pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-25 16:14:55 +08:00

Author	SHA1	Message	Date
Shangdi Yu	8cd74b302f	compiled	2025-09-04 19:30:27 -07:00
PyTorch MergeBot	5e98d9f9ba	Revert "[dynamic shapes] unbacked-safe slicing (#157944 )" This reverts commit 56218d85e2da09d9ede3809718ec989c2151632c. Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think this is failing test_draft_export in trunk `56218d85e2` ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3198874677))	2025-08-19 01:16:17 +00:00
Michael Lazos	5cf6567c1f	[Inductor] add cuda compile cmd to autotuning logging (#160906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160906 Approved by: https://github.com/henrylhtsang	2025-08-19 01:14:46 +00:00
Shangdi Yu	41b3e80a55	Fix duplicated kernel name in kernel stack trace tracking (#160905 ) Summary: as title. When we have two kernels with the same name, the stack traces should be appended, not overwritten. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing ``` Rollback Plan: Differential Revision: D80472731 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160905 Approved by: https://github.com/angelayi	2025-08-19 01:14:34 +00:00
Ting Lu	b6852778ff	Add Magma build for CUDA 13.0 (#160770 ) Add magma build for CUDA 13.0 after almalinux docker is available https://github.com/pytorch/pytorch/issues/159779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160770 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com> Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-08-19 01:10:00 +00:00
xinan.lin	1853f71b4f	[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#160403 ) Fixes #160243, Fixes #160244, Fixes #160245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160403 Approved by: https://github.com/janeyx99	2025-08-19 00:54:51 +00:00
Lakshay Garg	bbc7c03e93	Fix UndefinedGrad::apply (#160572 ) The function incorrectly reserved space in the input parameter instead of the output parameter Pull Request resolved: https://github.com/pytorch/pytorch/pull/160572 Approved by: https://github.com/soulitzer	2025-08-19 00:15:51 +00:00
Justin Chu	dc200066cf	[ONNX] Use onnxruntime 1.22 in CI (#160924 ) Use onnxruntime 1.22 in CI to enable testing of newer opsets and IR versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160924 Approved by: https://github.com/titaiwangms	2025-08-19 00:05:26 +00:00
Pian Pawakapan	56218d85e2	[dynamic shapes] unbacked-safe slicing (#157944 ) Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944 Approved by: https://github.com/laithsakka	2025-08-18 22:38:16 +00:00
Natalia Gimelshein	0254646654	harden fabric checks for symmetric memory (#160790 ) Now we check only that fabric allocation succeeded, but sometimes we fail during export or import afterwards, with no recourse. Check the full cycle before attempting to allocate memory with the fabric. TODO: move it to c10/cuda so that it can be used from CUDACachingAllocator too Pull Request resolved: https://github.com/pytorch/pytorch/pull/160790 Approved by: https://github.com/Skylion007	2025-08-18 22:35:50 +00:00
dolpm	b439675ae2	[nativert] oss pass graph pass registration (#160859 ) Summary: att Test Plan: CI Rollback Plan: Differential Revision: D80368343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160859 Approved by: https://github.com/georgiaphillips	2025-08-18 22:23:38 +00:00
PyTorch MergeBot	82c7a1eb4b	Revert "[ONNX] Default to dynamo export (#159646 )" This reverts commit 11b6ceb7b4f81ba02f88652136a93d685c399191. Reverted https://github.com/pytorch/pytorch/pull/159646 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/159646#issuecomment-3198507767))	2025-08-18 21:41:32 +00:00
Wei Wang	16ada80c61	[BE][CUDA][Distributed] Add require_exact_world_size() and a few distributed unit test fixes (#160803 ) 1. Add require_exact_world_size() 2. Decorate the test `test_new_subgroups_with_group_param` with this require_exact_world_size(4) as the test would fail with world_size of 8 when testing with 8xB200 runner. 3. Modify `test_new_subgroups_world_size_not_divisible_by_group_size` so that it will not fail due to 4 vs. 8 mismatch. Doing so makes the test pass with both 4-GPU runner and 8-GPU runner. Separating these changes out from B200 distributed runner PR #159323 Fixes https://github.com/pytorch/pytorch/issues/159987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160803 Approved by: https://github.com/fduwjj	2025-08-18 21:15:33 +00:00
Klaus Zimmermann	c27d6df1ea	For sdists, replace symlink with copy for docs requirements (#157811 ) Before this change, there was the requirements file `.ci/docker/requirements-docs.txt` which was symlinked as `../.ci/docker/requirements-docs.txt` from `docs/requirements.txt` since #151796. In this situation, [because `.ci` is excluded from the source tarball](`3173616532/.github/workflows/create_release.yml (L67)`), we end up with a broken symlink, that additionally is [invalid in a Python source distribution](https://packaging.python.org/en/latest/specifications/source-distribution-format/#unpacking-without-the-data-filter). The broken symlink can be confirmed in [the rc sources](https://github.com/pytorch/pytorch/actions/runs/15892205745). ~After this change, there is still a single source of truth, which now is `docs/requirements.txt`, symlinked as `../docs/requirements.txt` from `.ci/docker/requirements-docs.txt`, which would also be invalid in a Python source distribution, but is not included in the tarball (see above). Additionally, the docs requirements that were missing from the previous tarball, are now actually included, allowing users to build the documentation again.~ @malfet clarified offline that there is a problem with the docs workflows because they use a cache with a key that includes the hash of the requirements document in the `.ci` folder, which now does no longer change when the requirements change. Hence, a different solution is needed~, though for now the problem remains~. The solution in this PR is simply to copy the actual document to replace the symlink just prior to creating the source distribution. This way, a single document needs to be maintained, git checkouts remain as they are, and the source distributions contain the before-missing document. A better solution may be implemented at a later stage with a better build system. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157811 Approved by: https://github.com/atalman	2025-08-18 21:10:44 +00:00
Mitchell, Frost	d910cb3b2d	[cpp][inductor] Fix crash on bmm when input is used twice. (#160087 ) Fixes #156412 For torch.bmm using CPP generated template code, when the input is used as both the first and second weights, the generated code will simplify so it only passes one input instead of 2. However, if the weights are being repacked and saved for more efficient data-loading patterns, then we need to save both inputs instead of just one. This PR fixes this issue. ## Test code: ```python import torch @torch.compile(mode="max-autotune") def my_function(x, y): return torch.bmm(x, x) # Test x = torch.randn(2, 3, 3) y = torch.randn(2, 3, 3) result = my_function(x, y) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160087 Approved by: https://github.com/guangyey, https://github.com/jansel	2025-08-18 20:34:14 +00:00
Ryan Guo	a1a555ed7b	[dynamo] Fix graph break on calling functions decorated with special context manager (#160703 ) As title. This is a follow-up of the previous patch, with the goal of supporting a new pattern that showed up in ComfyUI: `644b23ac0b/comfy/ops.py (L44)` Effectively, the semantics of calling a function decorated with a context manager is: ```python @ctx_manager(args) def f(x): ... f(x) # -----> with ctx_manager(args): f.__wrapped__(x) ``` Yes, a fresh context manager instance per invokation, see CPython source code: https://github.com/python/cpython/blob/3.12/Lib/contextlib.py#L119-L122 So Dynamo already 1. knows how to handle the `with ctx_manager(args)` syntax, and has special handling for a few torch native context managers, like `sdpa_kernel` in this patch. 2. can trace through a good chunk (at least the ones that matter in this case) of contextlib. This patch just let Dynamo trace a bit more into contextlib, and then keep the torch-native special cases by moving their handling a bit down the stack, so that no additional logic is introduced -- it's only refactored. This also allows us to get rid of some `_sdpa_kernel_variadic` special handling, since now we will trace through its code, and it boils down to `sdpa_kernel` anyways. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160703 Approved by: https://github.com/guilhermeleobas, https://github.com/mlazos ghstack dependencies: #160684	2025-08-18 20:33:45 +00:00
Ryan Guo	72b559b2c8	[dynamo] Fix crash and silent incorrectness issues in `attention.sdpa_kernel` calls with kwargs (#160684 ) This patch fixes 2 issues, illustrated by the test cases added: 1. using `sdpa_kernel(backends=..., set_priority=...)` due to an internal assert that forgot to be updated after #147768. 2. forgetting to convert the `set_priority` VariableTracker back to a python constant so that its value is properly used by `sdpa_kernel`, also from #147768. I ran into (1) because ComfyUI had a recent update that actually sues this pattern `644b23ac0b/comfy/ops.py (L44)`, and then noticed (2), and fixed it conveniently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160684 Approved by: https://github.com/mlazos	2025-08-18 20:33:45 +00:00
cyy	1f19003694	Use py3.10 for ONNX CI jobs (#160852 ) Use Python 3.10 for ONNX jobs because Python 3.9 is near EOL and futher ONNX versions drop 3.9 support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160852 Approved by: https://github.com/justinchuby, https://github.com/malfet	2025-08-18 19:37:47 +00:00
Shangdi Yu	4e90441133	Add signpost to provenance tracking error (#160755 ) Summary: As title, add signpost to better track error when computing provenance tracking related debugging information Test Plan: CI Rollback Plan: Differential Revision: D80292285 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160755 Approved by: https://github.com/angelayi	2025-08-18 19:17:47 +00:00
Xinya Zhang	bfcae7e1c1	[ROCm] Fix Sliding Window Attention in AOTriton integration code (#159773 ) AOTriton implements Sliding Window Attention (SWA) as a more generalized version of causal masks and also needs an atomic counter for dynamic workload allocation. Fixes #158308 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159773 Approved by: https://github.com/jeffdaily	2025-08-18 18:45:58 +00:00
Michael Lazos	01bba62e21	Remove unused test code (#160823 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160823 Approved by: https://github.com/Skylion007	2025-08-18 18:37:52 +00:00
angelayi	6ac9035a84	[aoti-fx] Dynamic shapes support (#160766 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160766 Approved by: https://github.com/jansel ghstack dependencies: #160765	2025-08-18 18:14:08 +00:00
angelayi	bab79824cb	[aoti-fx] Initial AOTInductor FX (#160765 ) Using the existing WrapperFxCodegen backend, this PR prototypes an AOT version of it which will directly return a graph module. How to use: ```python exported_gm = torch.export.export(model, inp, dynamic_shapes=dynamic_shapes).module() compiled_gm = torch._inductor.aot_compile( exported_gm, inp, options={"fx_wrapper": True, "compile_threads": 1} ) assert torch.allclose(model(inp), compiled_gm(inp)) ``` The motivation behind this is that backends like ExecuTorch/MTIA would like to use inductor's optimization technologies, but might have their own graph lowering pipelines so they might not want to use AOTI (which generates an so). Pull Request resolved: https://github.com/pytorch/pytorch/pull/160765 Approved by: https://github.com/jansel	2025-08-18 18:14:08 +00:00
Rob Timpe	162bf78df6	[dynamo] Support itertools.filterfalse (#160596 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160596 Approved by: https://github.com/guilhermeleobas	2025-08-18 18:07:57 +00:00
Michael Lazos	450517f346	[Dynamo][Hierarchical Compile] Flatten tuple inputs for regions (#158812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158812 Approved by: https://github.com/anijain2305 ghstack dependencies: #158810, #158811	2025-08-18 18:03:11 +00:00
James Wu	664005662a	Recheck Autotune cache on Precompile serialization to prune compilation results (#158656 ) This PR rechecks the autotune cache on Precompile.serialize(), allowing us to ahead of time save autotune results for statically compiled triton kernels, so that warm start does not need to check the autotune cache. It has a few extra changes to make this work: ### Storing source code in TritonBundler - We now store the source_code for statically compiled triton kernels instead of the hash of the source code in TritonBundler, so that we can easily access their source code when rechecking the autotune cache on PrecompileContext.serialize. To make sure that this is not a huge space concern, I ran the entire hugging face benchmark on training. The total space of `/tmp/torchinductor_jjwu/fxgraph` before my change was 1185004 KB (1.18 GB). After my change, this increased to 1207312 KB (1.2 GB), for an increased storage cost of ~1.8%, which seems safe. - We now return early from recheck_autotune_cache if the number of triton kernels being compiled is 1, since there's no reason to check the cache at all in those cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158656 Approved by: https://github.com/zhxchen17	2025-08-18 17:55:10 +00:00
Sam Anklesaria	c0a1ae4404	Add `is_cpu` method to stable tensor type (#160212 ) Porting torchaudio to use the stable api requires the `is_cuda` and `dtype` functions. It would be more convenient if these were methods of the stable tensor class rather than utilities one needed to call from the C api. This PR adds them as methods, mirroring how `is_cuda` and `get_device` are already defined. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160212 Approved by: https://github.com/janeyx99	2025-08-18 17:42:43 +00:00
Nikita Shulga	b0071c65e2	[MPS] Fix error check for torch.var on scalar (#160889 ) Fixes https://github.com/pytorch/pytorch/issues/160738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160889 Approved by: https://github.com/Skylion007 ghstack dependencies: #160850	2025-08-18 17:36:42 +00:00
Guilherme Leobas	c6333f7dae	Fixes for `collections.NamedTuple` (#159367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159367 Approved by: https://github.com/mlazos ghstack dependencies: #159365, #159366, #159368, #159483, #159902, #159864, #159865	2025-08-18 17:32:59 +00:00
Ting Lu	87d6831b2e	Add CUDA installation script for CUDA 13 (#160201 ) Add the almalinux docker for building magma-cuda 13.0 https://github.com/pytorch/pytorch/issues/159779 Also fixed the NVSHMEM download link Pull Request resolved: https://github.com/pytorch/pytorch/pull/160201 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2025-08-18 17:26:25 +00:00
James Wu	4014672b30	Replace guard_serialization_mode with save_guards, remove load cases (#160531 ) This PR replaces "guard_serialization_mode" into `save_guards`. All cases where we care about whether or not we're loading guards can be inferred automatically from the existing inputs. The only case that's special here is whether or not to check guards. We don't want to check guards on guard load in CheckFnManager, because these guards have already been checked on save. Therefore, we put the setting in OutputGraphGuardsState, so that when we save, we bypass the guards check. Because of this change, it is technically possible to do a load and a save in the same CheckFunctionManager.__init__() by passing all the necessary parts, and also passing `save_guards=True`. This should just work out of the box, but so far no callsites need it, so not super important. Next up, we'll work on removing save_guards from GuardBuilder, and putting it into its own phase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160531 Approved by: https://github.com/zhxchen17	2025-08-18 17:04:17 +00:00
Peter Y. Yeh	e389a08dcd	AMD/ROCm OCP Micro-scaling Format (mx-fp8/mx-fp4) Support (#151360 ) - This pull request introduces support for the [OCP Micro-scaling (MX) format](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf), with a focus on compatibility with AMD ROCm 7.0 and the gfx950 architecture. This PR also establishes the foundation for enabling MX-FPX features in [TorchAO](https://github.com/pytorch/ao/issues/2229) on the AMD platform. - Validation (ROCm 7.0 + gfx950 required): `111 relevant tests passing.` > PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Co-author: @jagadish-amd — Thank you for the efforts leading validation on gfx950 with ROCm 7.0. ----------------------------------- This pull request introduces support for new scalar types and scaling methods, particularly for ROCm 7.0 and gfx950, and refines testing for these features. Key changes include adding constraints for matrix dimensions, enabling block-wise scaling, and updating tests to accommodate new data types. ### Support for new scalar types and scaling methods: * [`aten/src/ATen/cuda/CUDABlas.cpp`](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885): Added constraints for matrix dimensions when using `Float8_e8m0fnu` with block-wise scaling, ensuring dimensions are multiples of 32. Updated compatibility checks to support ROCm 7.0 for `Float8_e8m0fnu` and `Float8_e4m3fn`. [[1]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885) [[2]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeL1913-R1934) * [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290): Introduced block-wise scaling for `Float8_e8m0fnu`, with checks for ROCm 7.0 and GPU architecture `gfx950`. Added validation for supported scalar types and matrix dimensions. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1349-R1364) ### Updates to scalar type mappings: * [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L93-R93): Extended scalar type mappings to support `Float4_e2m1fn_x2` for ROCm 7.0. * [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fR88-R96): Added a constexpr mapping for `Float4_e2m1fn_x2` based on ROCm version. ### Enhancements to testing(@jagadish-amd): * [`test/test_matmul_cuda.py`](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766): Updated tests to include new scalar types (`Float4_e2m1fn_x2`) and recipes (`mxfp4`). Added logic to handle different scaling recipes and validate compatibility with ROCm and CUDA versions. [[1]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766) [[2]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23L1331-R1356) F592e669L1353R1472) These changes improve compatibility with newer hardware and software versions, enhance functionality for matrix operations, and ensure robust testing for the added features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151360 Approved by: https://github.com/drisspg, https://github.com/malfet	2025-08-18 16:43:09 +00:00
Animesh Jain	f2be3dc8da	[dynamo][guards] Optimize module getattr access for inline flag (#160864 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160864 Approved by: https://github.com/Lucaskabela ghstack dependencies: #160863	2025-08-18 16:38:46 +00:00
Animesh Jain	b8ff0fd21b	[dynamo][guards] Remove long lines from TORCH_LOGS=guards (#160863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160863 Approved by: https://github.com/Lucaskabela	2025-08-18 16:38:46 +00:00
Nikita Shulga	6b994c47ca	[MPS][BE] Fix unused vars in GridSampler (#160850 ) This fixes following warnings during the compilation of GridSampler.metal ``` /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/GridSampler.metal:22:23: warning: unused parameter 'input_sizes' [-Wunused-parameter] constant int32_t* input_sizes, ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/GridSampler.metal:24:23: warning: unused parameter 'grid_sizes' [-Wunused-parameter] constant int32_t* grid_sizes, ^ 2 warnings generated. ``` Introduced by https://github.com/pytorch/pytorch/pull/160541 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160850 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-08-18 16:24:45 +00:00
angelayi	3c8c509a9c	[export] Fix custom ops in subgraphs (#160004 ) Fixes https://github.com/pytorch/pytorch/issues/159995 Currently there are two problems with extern kernels in subgraphs: 1. They don't get serialized to the extern kernel json file because we only look at the toplevel graph. 2. Since the scope of each extern_kernel list is within its own subgraph, the indices referencing the operator is messed up because each subgraph will start counting from 0. So, this PR moves the extern_kernels list to a global view (under virtualized) so that we can count the extern kernels across subgraphs and the toplevel graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160004 Approved by: https://github.com/ydwu4	2025-08-18 15:42:19 +00:00
Angela Yi	1091165826	[export] Update move_to_device_pass for to.device (#160528 ) Differential Revision: D80135455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160528 Approved by: https://github.com/yushangdi	2025-08-18 15:41:48 +00:00
Scott Todd	d91a03f96a	[ROCm] Add HIPConfig.h to .gitignore like CUDAConfig.h. (#159805 ) This file is generated into the source directory by CMake just like `cuda/CUDAConfig.h`, so it seems appropriate to add it to `.gitignore` in the same place: `83ba3f1101/aten/src/ATen/CMakeLists.txt (L39-L47)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159805 Approved by: https://github.com/jeffdaily	2025-08-18 15:34:01 +00:00
Nichols A. Romero	0298ebc97a	[ROCm][inductor][dashboard] Add GPT2ForSequenceClassification to use_larger_multiplier_for_smaller_tensor list (#160001 ) GPT2ForSequenceClassification Hugging Face (HF) model fails on ROCm for bfloat16. The failure is numerically small. This PRs adds this model to an exception list for small tensors. The exception list already includes two models. This increases the multiplier factor to 10.0 instead of 3 (default) for this model used in `torch/_dynamo/utils.py`. In the PR comment below, I include a short analysis of the numerics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160001 Approved by: https://github.com/anijain2305, https://github.com/jataylo, https://github.com/jeffdaily	2025-08-18 15:33:30 +00:00
PyTorch UpdateBot	179511694c	Update slow tests (#160870 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160870 Approved by: https://github.com/pytorchbot	2025-08-18 11:53:41 +00:00
PyTorch UpdateBot	e7c3b77b22	[xla hash update] update the pinned xla hash (#160871 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160871 Approved by: https://github.com/pytorchbot	2025-08-18 11:50:47 +00:00
Sun, Jiayi	95e456fcc5	[inductor] pack linear for FP32 dynamic mode (#157542 ) Summary: Currently, Linear in FP32 dynamic mode(batch_size has free symbols) does not support weight prepacking since MKL Linear does not support dynamic mode. This PR uses oneDNN Linear to support Linear weight prepacking in FP32 dynamic mode. I tested the Inductor benchmark in FP32 dynamic mode on CPU using this PR, and saw ~8% improvement in timm_models geomean speedup, ~2% improvement in torchbench geomean speedup, and no change in huggingface. There are about 18 models with different degrees of performance improvement, among which BERT_pytorch, soft_actor_critic, BlenderbotForCausalLM, ElectraForCausalLM, crossvit_9_240, mobilevit_s, twins_pcpvt_base have more than 20% performance improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157542 Approved by: https://github.com/CaoE, https://github.com/jansel	2025-08-18 10:18:46 +00:00
Sun, Jiayi	de744ca4b1	[Inductor] modify convert_to_reinterpret_view (#158914 ) Summary: Fix https://github.com/pytorch/pytorch/issues/159121, Modify the rules for freezing the layout of `x.unwrap_view()` in `convert_to_reinterpret_view`: relax the condition of `isinstance(x_unwrap_view, (ReinterpretView, Buffer))` to `isinstance(x_unwrap_view, (ReinterpretView, Buffer, MutableBox))`. Prefer channels last format according to how the format of `x_unwrap_view_fx_node` is set from eager. Example: ``` import torch import torch.nn as nn class M(nn.Module): def __init__(self): super(M, self).__init__() self.relu = torch.nn.ReLU() def forward(self, x): n, c, h, w = x.shape return self.relu(x).permute(0, 2, 3, 1).reshape( n, h * w, c ) model = M().eval() x = torch.randn(2, 32, 4, 4).to(memory_format=torch.channels_last) compiled_model = torch.compile(model) with torch.no_grad(): compiled_model(x) ``` Generated code: - before ``` cpp_fused_permute_relu_view_0 = async_compile.cpp_pybinding(['const float', 'float', 'float'], ''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const float in_ptr0, float* out_ptr0, float* out_ptr1) { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(16L)) { for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(16L); x2+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(32L) && x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(16L))) { alignas(std::max(std::size_t(16), alignof(float))) float tmp0[1616]; transpose_mxn<float,static_cast<int64_t>(16),static_cast<int64_t>(16),false>(in_ptr0 + static_cast<int64_t>(x1 + 32Lx2 + 512Lx0), static_cast<int64_t>(32L), tmp0, static_cast<int64_t>(16)); for (long x1_inner = 0; x1_inner < static_cast<int64_t>(16); x1_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<int64_t>(16Lx1_inner), static_cast<int64_t>(16)); auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0)); tmp2.store(out_ptr0 + static_cast<int64_t>(x2 + 16Lx1 + 16Lx1_inner + 512Lx0)); } } } } } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(16L)) { for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L) && x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L))) { alignas(std::max(std::size_t(16), alignof(float))) float tmp0[1616]; transpose_mxn<float,static_cast<int64_t>(16),static_cast<int64_t>(16),false>(out_ptr0 + static_cast<int64_t>(x1 + 16Lx2 + 512Lx0), static_cast<int64_t>(16L), tmp0, static_cast<int64_t>(16)); for (long x1_inner = 0; x1_inner < static_cast<int64_t>(16); x1_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<int64_t>(16Lx1_inner), static_cast<int64_t>(16)); tmp1.store(out_ptr1 + static_cast<int64_t>(x2 + 32Lx1 + 32Lx1_inner + 512Lx0)); } } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (2, 32, 4, 4), (512, 1, 128, 32)) buf0 = empty_strided_cpu((2, 32, 4, 4), (512, 16, 4, 1), torch.float32) buf1 = empty_strided_cpu((2, 16, 32), (512, 32, 1), torch.float32) cpp_fused_permute_relu_view_0(arg0_1, buf0, buf1) del arg0_1 return (buf1, ) ``` - After ``` cpp_fused_relu_0 = async_compile.cpp_pybinding(['const float', 'float'], ''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(1024L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1024L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = at::vec::clamp_min(tmp0, decltype(tmp0)(0)); tmp1.store(out_ptr0 + static_cast<int64_t>(x0)); } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (2, 32, 4, 4), (512, 1, 128, 32)) buf0 = empty_strided_cpu((2, 32, 4, 4), (512, 1, 128, 32), torch.float32) cpp_fused_relu_0(arg0_1, buf0) del arg0_1 return (reinterpret_tensor(buf0, (2, 16, 32), (512, 32, 1), 0), ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158914 Approved by: https://github.com/CaoE, https://github.com/jansel	2025-08-18 07:41:20 +00:00
PyTorch MergeBot	b82aa3df20	Revert "Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. (#159197 )" This reverts commit e444cd24d48b3a46f067974f2cc157f5ed27709f. Reverted https://github.com/pytorch/pytorch/pull/159197 on behalf of https://github.com/laithsakka due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/159197#issuecomment-3195436668))	2025-08-18 07:22:13 +00:00
zhaoguoan	d8d589bd3a	Add build support for RISCV (#160172 ) In requirements.txt, do not install lintrunner on riscv64 Fixes #160170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160172 Approved by: https://github.com/malfet	2025-08-18 05:29:34 +00:00
drisspg	3c6efd1380	Add cutedsl template support to compile (#160108 ) ## Summary Still figuring out what actually writing a template should look like, but lands alot of the base infra <img width="1267" height="262" alt="Screenshot 2025-08-16 at 10 22 12 PM" src="https://github.com/user-attachments/assets/229f8bfa-0cb4-4fb1-8530-f535e569d350" /> Test code: ```Python #!/usr/bin/env python3 """ Fixed CuteDSL template test with proper def_kernel usage. """ import torch import torch._inductor.config as config from torch._inductor.lowering import lowerings from torch._inductor.ir import TensorBox from torch._inductor.select_algorithm import autotune_select_algorithm from torch._inductor.codegen.cutedsl import CuteDSLTemplate def create_fixed_cutedsl_template(): """Create a properly structured CuteDSL template.""" def cutedsl_grid(M, N, meta): return (1,) # Part 1: Imports and kernel definition template_part1 = r""" import torch import cutlass import cutlass.cute as cute from cutlass.cute.runtime import from_dlpack @cute.kernel def {{kernel_name}}_kernel(gA: cute.Tensor, gB: cute.Tensor, gC: cute.Tensor): # Get thread and block indices tidx, _, _ = cute.arch.thread_idx() bidx, _, _ = cute.arch.block_idx() bdim, _, _ = cute.arch.block_dim() thread_idx = bidx * bdim + tidx m, n = gA.shape if thread_idx < m * n: mi = thread_idx // n ni = thread_idx % n if mi < m and ni < n: a_val = gA[mi, ni] b_val = gB[mi, ni] result = a_val + b_val gC[mi, ni] = a_val + b_val """ # Part 2: JIT wrapper function template_part2 = r""" @cute.jit def {{kernel_name}}_jit(mA: cute.Tensor, mB: cute.Tensor, mC: cute.Tensor): m, n = mA.shape total_threads = m * n threads_per_block = 256 num_blocks = (total_threads + threads_per_block - 1) // threads_per_block kernel = {{kernel_name}}_kernel(mA, mB, mC) kernel.launch( grid=[num_blocks, 1, 1], block=[threads_per_block, 1, 1] ) """ # Part 3: Main kernel function template_part3 = r""" {{def_kernel("input_a", "input_b", "output_c")}} cute_a = from_dlpack(input_a, assumed_align=16) cute_b = from_dlpack(input_b, assumed_align=16) cute_c = from_dlpack(output_c, assumed_align=16) # Launch kernel {{kernel_name}}_jit(cute_a, cute_b, cute_c) return output_c """ # Combine all parts template = CuteDSLTemplate( name="fixed_add", grid=cutedsl_grid, source=template_part1 + template_part2 + template_part3 ) return template def fixed_cutedsl_lowering(a: TensorBox, b: TensorBox) -> TensorBox: """Fixed CuteDSL lowering.""" print(f"[FIXED] CuteDSL lowering: {a.get_size()} + {b.get_size()}") template = create_fixed_cutedsl_template() choices = [] error = template.maybe_append_choice( choices, input_nodes=[a.data, b.data], layout=a.get_layout() ) if error or not choices: print(f"[FIXED] Falling back: {error}") default_lowering = lowerings[torch.ops.aten.add.Tensor] return default_lowering(a, b) print(f"[FIXED] Using CuteDSL with {len(choices)} choices") result = autotune_select_algorithm( "fixed_cutedsl_add", choices, [a, b], a.get_layout(), ) return result def test_fixed_cutedsl(): """Test the fixed CuteDSL template.""" print("=" * 50) print("Fixed CuteDSL Template Test") print("=" * 50) original = lowerings.get(torch.ops.aten.add.Tensor, None) try: lowerings[torch.ops.aten.add.Tensor] = fixed_cutedsl_lowering def test_add(x, y): return x + y device = "cuda" if torch.cuda.is_available() else "cpu" x = torch.randn(128, 4, device=device, dtype=torch.float32) y = torch.randn(128, 4, device=device, dtype=torch.float32) print(f"[FIXED] Testing with {x.shape} tensors on {device}") compiled_fn = torch.compile(test_add, backend="inductor") result = compiled_fn(x, y) # Verify correctness expected = x + y if torch.allclose(result, expected, atol=1e-5): print("✅ [FIXED] Results match!") return True else: print("❌ [FIXED] Results don't match!") return False except Exception as e: print(f"❌ [FIXED] Failed: {e}") import traceback traceback.print_exc() return False finally: if original: lowerings[torch.ops.aten.add.Tensor] = original else: lowerings.pop(torch.ops.aten.add.Tensor, None) if __name__ == "__main__": success = test_fixed_cutedsl() print("🎉 Fixed test completed!" if success else "💥 Fixed test failed!") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160108 Approved by: https://github.com/mlazos	2025-08-18 04:37:15 +00:00
PyTorch UpdateBot	d18007a1d0	[vllm hash update] update the pinned vllm hash (#160847 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160847 Approved by: https://github.com/pytorchbot	2025-08-18 04:36:28 +00:00
dolpm	138413907a	[nativert] oss subgraph rewriter (#160780 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D80367765 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160780 Approved by: https://github.com/SherlockNoMad, https://github.com/georgiaphillips	2025-08-18 04:25:05 +00:00
PyTorch MergeBot	3ced4f1e6c	Revert "Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836 )" This reverts commit 7a68d02292fd7a430b55c5bce3268a33c7ec5055. Reverted https://github.com/pytorch/pytorch/pull/160836 on behalf of https://github.com/clee2000 due to broke some inductor jobs? Maybe just update the expected values? Not sure what the policy is for something like this [GH job link](https://github.com/pytorch/pytorch/actions/runs/17024529273/job/48262123844) [HUD commit link](`7a68d02292`) ([comment](https://github.com/pytorch/pytorch/pull/160836#issuecomment-3194953213))	2025-08-18 03:09:31 +00:00
Pian Pawakapan	075a2e6967	[PGO] add extra read/write keys (#160715 ) Differential Revision: D80321215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160715 Approved by: https://github.com/bobrenjc93	2025-08-18 01:41:08 +00:00
cyy	7a68d02292	Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836 ) Because numpy 1.22.4 had reached EOL 3 years ago. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160836 Approved by: https://github.com/malfet	2025-08-17 18:39:06 +00:00
James Wu	63e1b58a13	[easy] [Precompile] Refactor guards, improve typing (#160530 ) Purely a refactor, improve typing and get rid of some type errors. Make certain fields as nonnull, since in general it's not empty. The goal of this stack of PRs is to move the save/load logic of guard serialization into separate, flat phases, instead of being embedded in guard creation. This way, we can put a try/catch around it and fail safely if certain guards are not serializable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160530 Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007	2025-08-17 17:54:55 +00:00
cyy	960c03daf6	Remove unused CONDA_CMAKE option (#160832 ) Remove CONDA_CMAKE from `.ci/docker/build.sh` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160832 Approved by: https://github.com/malfet	2025-08-17 17:08:42 +00:00
PyTorch MergeBot	04c7be903d	Revert "[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747 )" This reverts commit 8f434545c2e48c858d8b0d06db8f9642d6a87ad0. Reverted https://github.com/pytorch/pytorch/pull/160747 on behalf of https://github.com/malfet due to Looks like this breaks rocm, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm%20%2F%20linux-jammy-rocm-py3.10 ([comment](https://github.com/pytorch/pytorch/pull/160747#issuecomment-3194417733))	2025-08-17 14:22:48 +00:00
Johnny	691d17a5c6	Update TensorPipe submodule (#160808 ) To a commit containing https://github.com/pytorch/tensorpipe/pull/464 that fixes compilation with CUDA-13 Fixes https://github.com/pytorch/pytorch/issues/160104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160808 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007, https://github.com/malfet	2025-08-17 14:11:41 +00:00
Sandeep Narendranath Karjala	c699668009	[inductor] TLParse tensor metadata logging + test (#160132 ) Summary: - Add TLParse artifact logging per op with output tensor shape, stride, and dtype for cross-rank aggregation. Testing: - Add test to verify structure and contents of tlparse artifiact Pull Request resolved: https://github.com/pytorch/pytorch/pull/160132 Approved by: https://github.com/xmfan	2025-08-17 04:27:49 +00:00
PyTorch UpdateBot	0b56f3aed8	[vllm hash update] update the pinned vllm hash (#160831 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160831 Approved by: https://github.com/pytorchbot	2025-08-17 04:25:26 +00:00
Nick Riasanovsky	8f434545c2	[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747 ) Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs. Test Plan: Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda` Rollback Plan: Differential Revision: D80348643 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747 Approved by: https://github.com/NikhilAPatel	2025-08-17 00:35:12 +00:00
PyTorch MergeBot	26297c27e2	Revert "[inductor] TLParse tensor metadata logging + test (#160132 )" This reverts commit 2603e40be5fa4a66301e6654e34a82a67f2e4913. Reverted https://github.com/pytorch/pytorch/pull/160132 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/17010600949/job/48226137423) [HUD commit link](`2603e40be5`). landrace with another PR that changed some had_cuda related things ([comment](https://github.com/pytorch/pytorch/pull/160132#issuecomment-3193969792))	2025-08-16 23:47:03 +00:00
Guilherme Leobas	74871d4d46	[collections.abc] Ensure that binop calls works with UserDefinedObjects (#159865 ) Changes: (1) Replace UserDefinedSetVariable by UserDefinedObjectVariable in all binop calls Test plan: (1) The three tests from CPython `test_collections.py` ensures that Dynamo can trace through a dunder method (e.g. __add__, __ixor__, etc) defined in a user defined class Pull Request resolved: https://github.com/pytorch/pytorch/pull/159865 Approved by: https://github.com/mlazos ghstack dependencies: #159365, #159366, #159368, #159483, #159902, #159864	2025-08-16 20:44:40 +00:00
Guilherme Leobas	f019da2979	Implement `list(UserDefinedObject)` via `force_unpack_var_sequence` (#159864 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159864 Approved by: https://github.com/mlazos ghstack dependencies: #159365, #159366, #159368, #159483, #159902	2025-08-16 20:44:40 +00:00
Guilherme Leobas	f1bc843a5d	Wrap class definitions in `set_fullgraph(False)` in `test_collections` (#159902 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159902 Approved by: https://github.com/mlazos ghstack dependencies: #159365, #159366, #159368, #159483	2025-08-16 20:42:15 +00:00
Sandeep Narendranath Karjala	2603e40be5	[inductor] TLParse tensor metadata logging + test (#160132 ) Summary: - Add TLParse artifact logging per op with output tensor shape, stride, and dtype for cross-rank aggregation. Testing: - Add test to verify structure and contents of tlparse artifiact Pull Request resolved: https://github.com/pytorch/pytorch/pull/160132 Approved by: https://github.com/xmfan ghstack dependencies: #160260	2025-08-16 16:37:18 +00:00
Xuehai Pan	8fe4b3f848	[BE][CI] move `MYPYSTRICT` linter from `lintrunner-noclang` to `lintrunner-mypy` (#160806 ) Like `MYPY`, linter `MYPYSTRICT` will need `--all-files` too. See also: - https://github.com/pytorch/pytorch/pull/160652#issuecomment-3193390813 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160806 Approved by: https://github.com/seemethere	2025-08-16 16:15:22 +00:00
Hai Zheng	cff6def7f4	[MTIA] add correct name for CFF in tlparse (#160599 ) Differential Revision: D80201622 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160599 Approved by: https://github.com/bdhirsh	2025-08-16 14:58:03 +00:00
Laith Sakka	e444cd24d4	Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. (#159197 ) This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous() but want to find those call sites to handle this properly by calling is_contiguous_or_false() and not is_contiguous() explitly when appropriate. I had to fix one issue after removing the implicit size oblivious reasoning. here is context we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE. when people call is_contiguous we do sym_is_contiguous().guard_bool() when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false() one issue not handled well was this path ``` c10::SymBool TensorImpl::sym_is_contiguous_custom( at::MemoryFormat memory_format) const { if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) { return pyobj_slot_.load_pyobj_interpreter()->is_contiguous( this, memory_format); } return sym_is_contiguous_default(memory_format); } ``` namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format); This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning. once we removed that implicit size oblivious reasoning, the right thing we want is to call return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format); otherwise we would get DDE even if the caller is doing sym_is_contiguous. so I had to define it for pyinterpreter, and then I had to override it for nested tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159197 Approved by: https://github.com/ezyang	2025-08-16 09:15:58 +00:00
Huy Do	a84541c73f	Update transformers version automatically with Dependabot (#160635 ) My proposal here is to use GitHub Dependabot to make sure that `transformers` version used in CI are always up-to-date. To achieve this, this PR does 2 things: 1. Pin `transformers` version across all CI jobs to only one place at `.ci/docker/ci_commit_pins/huggingface.txt`. This file is now a regular pip requirements instead of a pinned commit text. There isn't any need to pin `transformers` to a specific commit and the file already refers to a stable version `v4.54.0` 2. Create `.github/dependabot.yml` to config the bot to update `transformers` automatically when there is a new version. Those labels will ensure that the right reviewers from torch.compile and Dev Infra are notified. I'm not sure how to test this out in PR, but it feels ok to land and test this in main. If this works, we should see a PR to update `v4.54.0` to the current latest `v4.55.0` ### Reference https://docs.github.com/en/code-security/dependabot/working-with-dependabot/dependabot-options-reference Pull Request resolved: https://github.com/pytorch/pytorch/pull/160635 Approved by: https://github.com/ZainRizvi	2025-08-16 05:53:39 +00:00
Rohit Singh Rathaur	114813ca77	Fix mypy errors: PyTreeSpec inheritance (#160652 ) Fixes #160650. I added type ignore comment to `LeafSpec` class inheritance in `torch/utils/_cxx_pytree.py` to handle `PyTreeSpec` being marked as final in optree's type stubs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160652 Approved by: https://github.com/Skylion007	2025-08-16 05:14:11 +00:00
Justin Chu	11b6ceb7b4	[ONNX] Default to dynamo export (#159646 ) Set dynamo=True and enable fallback. 1. Implemented the compatible behavior where BytesIO objects as `f` is accepted 2. Update tests to explicitly set dynamo=False #151693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159646 Approved by: https://github.com/titaiwangms	2025-08-16 04:48:58 +00:00
Michael Lazos	fb7e60ba7a	[Dynamo][Hierarchical Compile] Flatten tuple outputs in graph dedupe pass (#158811 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158811 Approved by: https://github.com/anijain2305 ghstack dependencies: #158810	2025-08-16 04:45:31 +00:00
PyTorch UpdateBot	f89186e910	[audio hash update] update the pinned audio hash (#160797 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160797 Approved by: https://github.com/pytorchbot	2025-08-16 04:26:59 +00:00
PyTorch UpdateBot	10eb83734f	[vllm hash update] update the pinned vllm hash (#160699 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160699 Approved by: https://github.com/pytorchbot	2025-08-16 04:26:55 +00:00
Yang Wang	75ea93484c	[vllm test] add vllm.yml and additional package (#160698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160698 Approved by: https://github.com/huydhn ghstack dependencies: #160116	2025-08-16 04:24:20 +00:00
Huy Do	45c2c7a5fc	Fix the wrong dataclasses_json mointoring dep MacOS test (#160796 ) Typo mistake. This should be `dataclasses_json` https://github.com/pytorch/pytorch/actions/runs/17000197828/job/48200676725#step:10:23 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160796 Approved by: https://github.com/yangw-dev	2025-08-16 04:00:31 +00:00
Shangdi Yu	b74c7cd335	Add kernel stack traces tlparse dump (#160608 ) (#160779 ) Summary: as title This is requested by the zoomer team so they can add stack trace information to profiler result. Test Plan: ``` buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r stack_traces ``` Rollback Plan: Differential Revision: D80050233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160779 Approved by: https://github.com/angelayi	2025-08-16 03:12:38 +00:00
Scott Todd	b7ca502f29	[ROCm][Windows] Add hipcc compatibility flags to cpp_extension.py. (#159790 ) This is a similar change to https://github.com/pytorch/pytorch/pull/153986, this time adding flags to the hipcc command under `cpp_extension.py`. The `-Wno-ignored-attributes` flag in particular avoids about 200MB of warning spam when building torchvision, like these: ``` In file included from D:\b\vision_main\torchvision\csrc\ops\hip\deform_conv2d_kernel.hip:72: In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ATen.h:13: In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/Functions.h:386: In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ops/_sparse_softmax.h:21: D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ops/_sparse_softmax_ops.h:18:8: warning: __declspec attribute 'dllimport' is not supported [-Wignored-attributes] 18 \| struct TORCH_API _sparse_softmax_int { \| ^~~~~~~~~ D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\torch/headeronly/macros/Export.h💯19: note: expanded from macro 'TORCH_API' 100 \| #define TORCH_API C10_IMPORT \| ^~~~~~~~~~ D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\torch/headeronly/macros/Export.h:53:31: note: expanded from macro 'C10_IMPORT' 53 \| #define C10_IMPORT __declspec(dllimport) \| ^~~~~~~~~ ``` The `-fms-extensions` flag just seems beneficial to include: https://clang.llvm.org/docs/MSVCCompatibility.html. See also this downstream issue where these changes were tested: https://github.com/ROCm/TheRock/issues/910. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159790 Approved by: https://github.com/jeffdaily	2025-08-16 02:20:49 +00:00
Nikita Shulga	7bd4cfaef4	[BE] Update nvshem dependency to 3.3.20 (#160458 ) Which is manylinux2_28 compatible, even on aarch64 platform archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works. Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel Should fix https://github.com/pytorch/pytorch/issues/160425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160458 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv	2025-08-16 02:00:57 +00:00
PyTorch MergeBot	c015e53d37	Revert "[BE] Update nvshem dependency to 3.3.20 (#160458 )" This reverts commit e0488d9f00865fb56c931580c80e099771c6285e. Reverted https://github.com/pytorch/pytorch/pull/160458 on behalf of https://github.com/wdvr due to need to rerun workflow generation (failing workflow-checks) ([comment](https://github.com/pytorch/pytorch/pull/160458#issuecomment-3193133706))	2025-08-16 01:47:42 +00:00
Laith Sakka	65dc4df74d	unify broadcast_shapes functions and avoid duplicates (#160251 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160251 Approved by: https://github.com/jingsh, https://github.com/ColinPeppler ghstack dependencies: #160250	2025-08-16 00:54:32 +00:00
Laith Sakka	c03809e8a5	guard_or_false cat ops (#160250 ) keep existing unbacked semantics unchanged, just use guard_or_false instead of guard_size_obl Pull Request resolved: https://github.com/pytorch/pytorch/pull/160250 Approved by: https://github.com/ColinPeppler, https://github.com/jingsh	2025-08-16 00:54:31 +00:00
Nikita Shulga	e0488d9f00	[BE] Update nvshem dependency to 3.3.20 (#160458 ) Which is manylinux2_28 compatible, even on aarch64 platform archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works. Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel Should fix https://github.com/pytorch/pytorch/issues/160425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160458 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv	2025-08-16 00:50:13 +00:00
Laith Sakka	f782c790df	migrate more simple gso checks (#160253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160253 Approved by: https://github.com/bobrenjc93	2025-08-16 00:15:24 +00:00
atalman	16ce2c15fa	Add python 3.14 support to linux aarch64 builds (#160788 ) Related to https://github.com/pytorch/pytorch/issues/156856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160788 Approved by: https://github.com/malfet	2025-08-16 00:03:21 +00:00
Andrey Talman	0d28d12b11	Fix typo packing libnvshmem into libtorch (#160778 ) Fix typo after https://github.com/pytorch/pytorch/pull/160465 Fixes: https://github.com/pytorch/pytorch/issues/160762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160778 Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/Skylion007	2025-08-15 23:43:02 +00:00
Edward Yang	838f22c57d	Do not incorrectly chain each of the strings as iterables (#160709 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160709 Approved by: https://github.com/Skylion007, https://github.com/fduwjj	2025-08-15 23:22:24 +00:00
eqy	387fe847ab	[cuDNN][SDPA] Introduce `TORCH_CUDNN_SDPA_AVOID_RECOMPILE=1` (#155958 ) Opt-in for now, but basically uses the variable-sequence length/ragged path for the common case of BSHD layout to avoid recompiling for different sequence lengths. Built on top of #149282 Tested using a primitive fuzzer, seems at least as stable as default path (with recompilation) on B200 (50000+ cases tested without any failures) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155958 Approved by: https://github.com/drisspg	2025-08-15 21:59:18 +00:00
Mu-Chu Lee	40311e2ec1	[AOTInductor] ABI-Compatibility for RecordFunction. (#159842 ) Summary: Previous our implementation for RecordFunction injects Aten into codegen, which is breaking the ABI contract for AOTInductor. C10::IValue is aded to call the full record function. The extension of more profiling info will come in later PRs. Test Plan: Included in commit. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D79622071](https://our.internmc.facebook.com/intern/diff/D79622071) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159842 Approved by: https://github.com/desertfire	2025-08-15 21:45:47 +00:00
Yidi Wu	8ca8b6053c	[inductor][while_loop][be] improve the readability of output handling (#160374 ) The logic doesn't change but make it easier to read and change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160374 Approved by: https://github.com/zou3519 ghstack dependencies: #160548	2025-08-15 20:13:12 +00:00
Yidi Wu	ff86509a06	[map] filter none gradients and add autograd inductor tests (#160548 ) Will filter the none outputs in autograd backward for other hops as follow ups Pull Request resolved: https://github.com/pytorch/pytorch/pull/160548 Approved by: https://github.com/zou3519	2025-08-15 20:13:12 +00:00
Shangdi Yu	fa75ba9303	Change IR node's stack traces to return a set of stack traces only (#160701 ) Summary: There can be excessive stack trace outputs in TORCH_LOGS="+inductor" when a single line of code corresponds to many post grad nodes, e.g. `self.multihead_attn(x, x, x)`, in that case, we'll see the same stack trace many times in the IR node, spamming the output log. So we change to return a set of stack traces. Test Plan: CI Rollback Plan: Differential Revision: D80310549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160701 Approved by: https://github.com/angelayi	2025-08-15 19:31:59 +00:00
Guilherme Leobas	b78968b4d1	Support `next(iterator, default)` (#159483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159483 Approved by: https://github.com/mlazos ghstack dependencies: #159365, #159366, #159368	2025-08-15 19:08:21 +00:00
Guilherme Leobas	e5621b4d8b	Fixes for `collections.Counter` (#159368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159368 Approved by: https://github.com/mlazos ghstack dependencies: #159365, #159366	2025-08-15 19:08:21 +00:00
Guilherme Leobas	2542e71f3f	Change mutation type of `MutableMappingVariable` to `AttributeMutationNew` (#159366 ) Also add MutableMappingVariable to `call_or_` / `call_ior` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159366 Approved by: https://github.com/zou3519 ghstack dependencies: #159365	2025-08-15 19:08:21 +00:00
Guilherme Leobas	0242d40fa5	Enable trace through the collections module (#159365 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159365 Approved by: https://github.com/zou3519	2025-08-15 19:08:21 +00:00
atalman	17de899709	Add py3.14 to macos arm64 (#160593 ) Related to https://github.com/pytorch/pytorch/issues/156856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160593 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-08-15 18:52:10 +00:00
Shangdi Yu	25d0d8b0a3	[inductor] Fix propagating torch.utils._sympy.functions.Identity in IndexPropagation (#155504 ) Fixes https://github.com/pytorch/pytorch/issues/160535 Index may contain ` torch.utils._sympy.functions.Identity`. When we call `SymPyOps.index_expr`, if the value is a sympy.Expr with Identity, `TypedExpr(value, dtype)` will fail. So when we unwrap arguments, we expand the sympy expression to unwrap Identity. Test Plan: buck run @mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_sym_expr_indexing Rollback Plan: Differential Re vision: D76308640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155504 Approved by: https://github.com/eellison	2025-08-15 18:38:23 +00:00
Liao, Wei	c6d697ff52	port 2 distributed pipeline test files for Intel GPU (#159140 ) it's another pr to port distributed pipeline test for Intel GPU, while the other pr is https://github.com/pytorch/pytorch/pull/159033. In this pr, we port two test files for Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. instantiate_device_type_tests() 2. skip the case at xpu due to accuracy gap introduced by oneDNN non-deterministic Pull Request resolved: https://github.com/pytorch/pytorch/pull/159140 Approved by: https://github.com/guangyey, https://github.com/d4l3k, https://github.com/H-Huang	2025-08-15 18:29:50 +00:00
PyTorch MergeBot	30d2f98daa	Revert "[cutlass backend] re-add pip cutlass path (#160180 )" This reverts commit d556586448f3caab85673c7da0978fe31c7748f7. Reverted https://github.com/pytorch/pytorch/pull/160180 on behalf of https://github.com/atalman due to broke macos nightly ([comment](https://github.com/pytorch/pytorch/pull/160180#issuecomment-3192311552))	2025-08-15 18:00:41 +00:00
Xuan Zhang	8780d28c65	raise exception in case of errors in memory reordering (#160455 ) This PR introduce two checks in the memory reordering pass to catch graph issues before performing the reordering task. For situation not covered by these checks, the reordering pass might fail and an exception will be thrown in this case. This addresses issue -- https://github.com/pytorch/pytorch/issues/159568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160455 Approved by: https://github.com/eellison	2025-08-15 17:31:55 +00:00
Yidi Wu	da8f48d88f	[associative_scan] support gen_schema for associative_scan (#158883 ) In-place mutation may create inter-loop dependency that breaks the parallelism we have for associative_scan so we ban input mutations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158883 Approved by: https://github.com/zou3519 ghstack dependencies: #154193, #158965, #158863, #158864	2025-08-15 17:28:44 +00:00
Yidi Wu	cb9e2092a8	[scan] support gen_schema for scan (#158864 ) We don't want to allow scan's combine_fn to mutate its inputs. The semantic of the mutation can be confusing. For example: ```python def combine_fn(init, x): ``` If combine_fn mutates init, only first iteration mutates init, the rest of the iterations mutates the previous carry, which is an intermediate result. This is kind of a weird semantic because the only observable mutation is for init, which can be done outside of the combine_fn. If combine_fn mutates x, where x is a slice of scanned inputs (i.e. xs), this pattern is more meaningful but we've not seen any use case yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158864 Approved by: https://github.com/zou3519 ghstack dependencies: #154193, #158965, #158863	2025-08-15 17:28:44 +00:00
Yidi Wu	f6bf1573fc	[while_loop] support gen_schema for while_loop (#158863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158863 Approved by: https://github.com/zou3519 ghstack dependencies: #154193, #158965	2025-08-15 17:28:34 +00:00
Yidi Wu	82a18423be	[BE] create an empty shape_env for check_input_alias_and_mutation_return_outputs (#158965 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158965 Approved by: https://github.com/zou3519 ghstack dependencies: #154193	2025-08-15 17:28:20 +00:00
Yidi Wu	3fe3c23d4e	[cond] support gen_schema for cond (#154193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154193 Approved by: https://github.com/zou3519	2025-08-15 17:28:13 +00:00
Prajesh Praveen Anchalia	052c441cf4	Add logging for when inbuilt_inline_nn_modules will help with ID_MATCH guard triggered recompiles (#160592 ) We add a logging around when an ID_MATCH guard is added at a place where inbuilt_inline_nn_modules would inline it. This is done with the aim of tagging recompiles that could be avoided by setting inbuilt_inline_nn_modules flag. It will help us log and track the flag's adoption and potentially quantify saving in the the number of recompiles. Differential Revision: D80075975 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160592 Approved by: https://github.com/anijain2305	2025-08-15 17:09:39 +00:00
Paul de Supinski	b26d2a9464	[ez] Make NUMA signpost parameters JSON serializable (#160710 ) # Context Broader context in #160163. In order for the _utils_internal version of signpost_event to do proper logging, its parameters argument needs to be json serializable. # This PR Convert `NumaOptions` to serializable form before inputting to `signpost_event`. # Test Plan ## Automated Added tests `$ pytest test/test_numa_binding.py`. ## Manual See [D80317206](https://www.internalfb.com/diff/D80317206). Pull Request resolved: https://github.com/pytorch/pytorch/pull/160710 Approved by: https://github.com/kiukchung	2025-08-15 16:52:43 +00:00
Kurt Mohler	6382302990	[MPS] Add `grid_sampler_3d` for MPS (#160541 ) This PR adds support for `grid_sampler_3d` for MPS with "bilinear" interpolation. NOTE: "nearest" interpolation is not yet supported Fixes #159882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160541 Approved by: https://github.com/malfet	2025-08-15 16:19:25 +00:00
Catherine Lee	80dd05e31e	Disable flaky cpp test RecordDebugHandles.Basic (#160577 ) Test is flaky and sometimes hangs in CI Here's an example of the failure: https://github.com/pytorch/pytorch/actions/runs/16946153494/job/48027937663 ``` 2025-08-13T20:54:00.1223688Z ==================================== RERUNS ==================================== 2025-08-13T20:54:00.1224156Z ___________________________ RecordDebugHandles.Basic ___________________________ 2025-08-13T20:54:00.1224682Z [gw2] linux -- Python 3.13.5 /opt/conda/envs/py_3.13/bin/python3.13 2025-08-13T20:54:00.1225568Z Internal Error: calling /opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit for test RecordDebugHandles.Basic failed (returncode=-6): 2025-08-13T20:54:00.1226430Z CUDA not available. Disabling CUDA and MultiCUDA tests 2025-08-13T20:54:00.1226988Z Note: Google Test filter = RecordDebugHandles.Basic-_CUDA:_MultiCUDA 2025-08-13T20:54:00.1227450Z [==========] Running 1 test from 1 test suite. 2025-08-13T20:54:00.1227792Z [----------] Global test environment set-up. 2025-08-13T20:54:00.1228145Z [----------] 1 test from RecordDebugHandles 2025-08-13T20:54:00.1228492Z [ RUN ] RecordDebugHandles.Basic 2025-08-13T20:54:00.1228822Z [ OK ] RecordDebugHandles.Basic (1 ms) 2025-08-13T20:54:00.1229204Z [----------] 1 test from RecordDebugHandles (1 ms total) 2025-08-13T20:54:00.1229501Z 2025-08-13T20:54:00.1229666Z [----------] Global test environment tear-down 2025-08-13T20:54:00.1230033Z [==========] 1 test from 1 test suite ran. (1 ms total) 2025-08-13T20:54:00.1230355Z [ PASSED ] 1 test. 2025-08-13T20:54:00.1230727Z terminate called after throwing an instance of 'std::system_error' 2025-08-13T20:54:00.1231154Z what(): Invalid argument 2025-08-13T20:54:00.1231416Z unknown file:0: C++ failure 2025-08-13T20:54:00.1231788Z ------------------------------ Captured c++ call ------------------------------- 2025-08-13T20:54:00.1232262Z CUDA not available. Disabling CUDA and MultiCUDA tests 2025-08-13T20:54:00.1232745Z Note: Google Test filter = RecordDebugHandles.Basic-_CUDA:_MultiCUDA 2025-08-13T20:54:00.1233199Z [==========] Running 1 test from 1 test suite. 2025-08-13T20:54:00.1233557Z [----------] Global test environment set-up. 2025-08-13T20:54:00.1233915Z [----------] 1 test from RecordDebugHandles 2025-08-13T20:54:00.1234247Z [ RUN ] RecordDebugHandles.Basic 2025-08-13T20:54:00.1234590Z [ OK ] RecordDebugHandles.Basic (1 ms) 2025-08-13T20:54:00.1235020Z [----------] 1 test from RecordDebugHandles (1 ms total) 2025-08-13T20:54:00.1235304Z 2025-08-13T20:54:00.1235431Z [----------] Global test environment tear-down 2025-08-13T20:54:00.1235793Z [==========] 1 test from 1 test suite ran. (1 ms total) 2025-08-13T20:54:00.1236126Z [ PASSED ] 1 test. 2025-08-13T20:54:00.1236481Z terminate called after throwing an instance of 'std::system_error' 2025-08-13T20:54:00.1236906Z what(): Invalid argument 2025-08-13T20:54:00.1237287Z ___________________________ RecordDebugHandles.Basic ___________________________ 2025-08-13T20:54:00.1237800Z [gw2] linux -- Python 3.13.5 /opt/conda/envs/py_3.13/bin/python3.13 2025-08-13T20:54:00.1238686Z Internal Error: calling /opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit for test RecordDebugHandles.Basic failed (returncode=-6): 2025-08-13T20:54:00.1239551Z CUDA not available. Disabling CUDA and MultiCUDA tests 2025-08-13T20:54:00.1240048Z Note: Google Test filter = RecordDebugHandles.Basic-_CUDA:_MultiCUDA 2025-08-13T20:54:00.1240495Z [==========] Running 1 test from 1 test suite. 2025-08-13T20:54:00.1240848Z [----------] Global test environment set-up. 2025-08-13T20:54:00.1241199Z [----------] 1 test from RecordDebugHandles 2025-08-13T20:54:00.1241542Z [ RUN ] RecordDebugHandles.Basic 2025-08-13T20:54:00.1241871Z [ OK ] RecordDebugHandles.Basic (1 ms) 2025-08-13T20:54:00.1242249Z [----------] 1 test from RecordDebugHandles (1 ms total) 2025-08-13T20:54:00.1242503Z 2025-08-13T20:54:00.1242641Z [----------] Global test environment tear-down 2025-08-13T20:54:00.1242993Z [==========] 1 test from 1 test suite ran. (19 ms total) 2025-08-13T20:54:00.1243329Z [ PASSED ] 1 test. 2025-08-13T20:54:00.1243697Z terminate called after throwing an instance of 'std::system_error' 2025-08-13T20:54:00.1244113Z what(): Invalid argument 2025-08-13T20:54:00.1244392Z unknown file:0: C++ failure 2025-08-13T20:54:00.1244759Z ------------------------------ Captured c++ call ------------------------------- 2025-08-13T20:54:00.1245235Z CUDA not available. Disabling CUDA and MultiCUDA tests 2025-08-13T20:54:00.1283768Z ============== 1 failed, 568 passed, 2 rerun in 115.57s (0:01:55) ============== ``` Here's an example of the hang: https://github.com/pytorch/pytorch/actions/runs/16942186826/job/48015238944 Logs aren't super helpful other than stating that it took a long time. Usually this file takes <2min to run ``` 2025-08-13T18:43:24.6586481Z [gw0] [ 97%] PASSED [1.4119s] ../../../../../opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit::PyTorch/LiteInterpreterDynamicTypeTestFixture::Conformance/8 2025-08-13T18:43:24.6587278Z [gw1] [ 97%] PASSED [1.4866s] ../../../../../opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit::PyTorch/LiteInterpreterDynamicTypeTestFixture::Conformance/9 Command took >30min, returning 124 2025-08-13T18:43:24.6587288Z 2025-08-13T18:43:24.6587632Z FINISHED PRINTING LOG FILE of cpp/test_jit 1/1 (test/test-reports/cpp.test_jit_1.1_c259e5a152845991_.log) 2025-08-13T18:43:24.6587639Z ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160577 Approved by: https://github.com/huydhn	2025-08-15 15:59:21 +00:00
PyTorch MergeBot	9df07ecfbe	Revert "[inductor] dont reuse buffers if it affects peak (#145883 ) (#159530 )" This reverts commit 3be70dc30e893b552fc0f23ca06cd8f7949b6d08. Reverted https://github.com/pytorch/pytorch/pull/159530 on behalf of https://github.com/clee2000 due to newly added test fail internally D80316528, probably just a targets change, but also imo the tests should probably go into a testcase class from common or inductor utils. While I'm pretty sure CI can run the globally defined ones, theres some CI related functionality that on the testcase class that CI benefits from ([comment](https://github.com/pytorch/pytorch/pull/159530#issuecomment-3191947506))	2025-08-15 15:49:04 +00:00
PyTorch MergeBot	846963fa9b	Revert "[Inductor] addmm + activation function fusion (#158137 )" This reverts commit b9d7de3a094598c3dc0dd52e57bce30eb684c9d8. Reverted https://github.com/pytorch/pytorch/pull/158137 on behalf of https://github.com/malfet due to Broke inductor torchbench, see `663da17b62/1` ([comment](https://github.com/pytorch/pytorch/pull/158137#issuecomment-3191841298))	2025-08-15 15:34:09 +00:00
chunhuanMeng	663da17b62	Update torch-xpu-ops commit pin (#160062 ) Update the torch-xpu-ops commit to [77cc792cd265179745d335579d233e6d4f9a2667](`77cc792cd2`), includes: - Ensures that the XPU cache is cleared before creating tensors during the test - Add unused variable warning - Fix test_linalg and test_torch issue with bf32_on_and_off updates - Fix deterministic indexing with broadcast - Fix dist.gather with noncontiguous tensor - Improve accuracy of index put deterministic kernel - Add generate file rely avoid build before generate - optimize embedding bag Fixes #160661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160062 Approved by: https://github.com/EikanWang	2025-08-15 15:27:24 +00:00
Shiva Kaul	e299926f72	[ONNX] Fix doc typo for symbolic_multi_out (#160702 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160702 Approved by: https://github.com/justinchuby	2025-08-15 14:34:42 +00:00
Huy Do	bbd11c4f23	Uninstall torchao on MPS benchmark (#160724 ) Fixes https://github.com/pytorch/pytorch/issues/160689 The current torchao 0.12.0 doesn't work with transformers 4.54.0 and ends up with this error: ``` File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/transformers/models/albert/modeling_albert.py", line 37, in <module> from ...modeling_utils import PreTrainedModel File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/transformers/modeling_utils.py", line 51, in <module> from torchao.quantization import Int4WeightOnlyConfig File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/__init__.py", line 41, in <module> from torchao.quantization import ( File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/quantization/__init__.py", line 6, in <module> from .autoquant import ( File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/quantization/autoquant.py", line 11, in <module> from torchao.dtypes import ( File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/__init__.py", line 1, in <module> from . import affine_quantized_tensor_ops File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/affine_quantized_tensor_ops.py", line 38, in <module> from torchao.dtypes.uintx.dyn_int8_act_int4_wei_cpu_layout import ( File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/uintx/__init__.py", line 7, in <module> from .dyn_int8_act_int4_wei_cpu_layout import ( File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/uintx/dyn_int8_act_int4_wei_cpu_layout.py", line 320, in <module> from ...prototype.inductor.fx_passes import register_da8w4_concat_linear_cpu_pass File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/prototype/inductor/fx_passes/__init__.py", line 2, in <module> from .int8_sdpa_fusion import _int8_sdpa_init File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/prototype/inductor/fx_passes/int8_sdpa_fusion.py", line 22, in <module> from ..int8_sdpa_lowering import register_int8_sdpa # noqa: F401 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/prototype/inductor/int8_sdpa_lowering.py", line 6, in <module> from torch._inductor.kernel.flex_attention import construct_strides, maybe_realize ModuleNotFoundError: No module named 'torch._inductor.kernel.flex_attention' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160724 Approved by: https://github.com/malfet	2025-08-15 13:55:39 +00:00
Sherlock Huang	eaa5d9d3d3	Introduce OpInfo test for testing export on fake device (#160694 ) Summary: Prepare for the upcoming diffs for exporting on fake cuda device. Test Plan: test Rollback Plan: Differential Revision: D80304225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160694 Approved by: https://github.com/dolpm	2025-08-15 07:26:28 +00:00
Colin Peppler	a7c75ae976	[dde] use sym_or when checking normalized shape in layer_norm (#160683 ) Use `sym_eq` to check equality on tuple of ints/symints ### DDE ``` torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq(u0, u1) (unhinted: Eq(u0, u1)). (Size-like symbols: u1, u0) Caused by: return torch.nn.functional.layer_norm( # test/inductor/test_unbacked_symints.py:527 in fn (_refs/__init__.py:3292 in native_layer_norm) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160683 Approved by: https://github.com/bobrenjc93	2025-08-15 06:56:00 +00:00
Pian Pawakapan	f7ad69f59c	[dynamic shapes] handle Max(*,1) for inductor layout contiguity (#160578 ) Differential Revision: D80214882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160578 Approved by: https://github.com/ZixinYang, https://github.com/bobrenjc93	2025-08-15 06:10:18 +00:00
Wang, Chuanqi	4cae9cf2df	Update triton xpu commit to support python 3.14 (#160183 ) Follow PR #159725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160183 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-08-15 05:41:17 +00:00
Yang Wang	7710800865	[3/3][ghstack][vllm ci build setup]vllm build workflow (#160116 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160116 Approved by: https://github.com/huydhn	2025-08-15 05:35:46 +00:00
Shangdi Yu	aa99e0958f	Separate provenance tracking to different levels (#160383 ) Summary: as title. We've got request from various parties who are interested in turning on the provenance tracking by default. In this PR, we prepare to turn on part of the provenance tracking that doesn't have too much overhead by default. - Change `provenance_tracking` config to `provenance_tracking_level` - turn on the following provenance tracking by default when `basic_provenance_tracking`=True - `set_kernel_post_grad_provenance_tracing` for kernels, this add mapping between triton kernels and post_grad nodes - `dump_inductor_provenance_info` if we're dumping tlparse log - `get_graph_provenance_json` and dump `reate_mapping_pre_post_grad_nodes`. This creates mapping between pre_grad and post_grad nodes. Since we're not turning on the provenance tracking in GraphTransformObserver by default, the mapping here maybe incomplete/limited. - add stack trace from post grad nodes to inductor IR nodes - add exception swallowing for all functions above Test Plan: CI Rollback Plan: Differential Revision: D80031559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160383 Approved by: https://github.com/angelayi	2025-08-15 04:59:35 +00:00
PyTorch UpdateBot	3fc7a95176	[audio hash update] update the pinned audio hash (#160485 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160485 Approved by: https://github.com/pytorchbot	2025-08-15 04:27:49 +00:00
Kevin Fu	858fb80b9b	[PT2]: Add Static Dispatch Kernel for wrapped_fbgemm_linear_fp16_weight (#160451 ) Summary: Add static dispatch kernel for wrapped_fbgemm_linear_fp16_weight. This optimization should improve perf for all Ads DSNN models using Sigmoid. Test Plan: ``` MODEL_TYPE=dpa_product_first_ctr_model MODEL_ENTITY_ID=892669089 SNAPSHOT_ID=37 OTHER_MODEL_ENTITY_ID=892669089 OTHER_SNAPSHOT_ID=36 MODULES=(mix prepare_float_features object user) SUFFIXES=(.predictor.local .predictor.precompute.prepare_float_features .predictor.precompute.remote_object_only .predictor.precompute.remote_request_only) for i in "${!MODULES[@]}"; do MODULE=${MODULES[i]} SUFFIX=${SUFFIXES[i]} buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkAB --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --otherNetFile=/data/users/$USER/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice "" --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true ``` Before: P1900475429 I0810 19:29:22.782902 2717337 load_net_predictor_lib.cpp:1807] Average latency A: 0.0843 ms I0810 19:29:22.782905 2717337 load_net_predictor_lib.cpp:1807] Average latency B: 0.0989 ms After: P1900825771 I0811 15:42:34.866408 2311279 load_net_predictor_lib.cpp:1807] [36mAverage latency A: 0.0854 ms[0m I0811 15:42:34.866411 2311279 load_net_predictor_lib.cpp:1807] [36mAverage latency B: 0.092 ms[0m Still has some regression but the gap is smaller... Rollback Plan: Reviewed By: henryoier, muchulee8 Differential Revision: D80042054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160451 Approved by: https://github.com/henryoier	2025-08-15 04:06:17 +00:00
Kevin Fu	55061c9602	[PT2]: Add Static Dispatch Kernel for scale_gradient (#160454 ) Summary: Add Static Dispatch Kernel for scale_gradient Test Plan: ``` MODEL_TYPE=dpa_product_first_ctr_model MODEL_ENTITY_ID=892669089 SNAPSHOT_ID=37 OTHER_MODEL_ENTITY_ID=892669089 OTHER_SNAPSHOT_ID=36 MODULES=(mix prepare_float_features object user) SUFFIXES=(.predictor.local .predictor.precompute.prepare_float_features .predictor.precompute.remote_object_only .predictor.precompute.remote_request_only) for i in "${!MODULES[@]}"; do MODULE=${MODULES[i]} SUFFIX=${SUFFIXES[i]} buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkAB --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --otherNetFile=/data/users/$USER/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice "" --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true ``` Rollback Plan: Reviewed By: henryoier Differential Revision: D80062244 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160454 Approved by: https://github.com/henryoier	2025-08-15 03:42:39 +00:00
Kevin Fu	214d04833a	[PT2]: Add Static Dispatch Kernel for fmod.Scalar (#160654 ) Summary: Add static dispatch for torch.ops.aten.fmod.Scalar. Found this missing in user/object nets for DSNN models. Test Plan: ``` MODEL_TYPE=dpa_product_first_ctr_model MODEL_ENTITY_ID=892669089 SNAPSHOT_ID=36 MODULE=user SUFFIX=.predictor.precompute.remote_request_only buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkByOp --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice="" --benchmarkEnableProfiling=true --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true --benchmarkNumIterations=1000 ``` Object tower: P1904347784 User tower: P1904348406 Rollback Plan: Differential Revision: D80238495 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160654 Approved by: https://github.com/henryoier	2025-08-15 03:11:48 +00:00
Johnny	9c5601ecc3	[NVIDIA] Refactor Family Blackwell Support codegen (#156176 ) With the legacy driver (nvgpu) used for CUDA 12.9, Thor was operating with SM 10.1. This changes to SM 11.0 when the newer driver model (OpenRM), which is intended for CUDA 13.0, is introduced. Thor 10.1 --> 11.0 Spark 12.1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156176 Approved by: https://github.com/ezyang	2025-08-15 02:51:26 +00:00
Nikita Shulga	5b9ad951f8	[BE][Docker] Do not install `cuda:11.8` (#160695 ) As CUDA-11.8 binary are no longer produced by CD Pull Request resolved: https://github.com/pytorch/pytorch/pull/160695 Approved by: https://github.com/huydhn	2025-08-15 02:23:04 +00:00
Lucas Kabela	4d5f92aa39	typing tvm.py (#160369 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160369 Approved by: https://github.com/Skylion007 ghstack dependencies: #160362, #160363, #160364, #160365, #160366, #160367, #160368	2025-08-15 02:09:31 +00:00
Lucas Kabela	39ca0ce0c8	Type backend torchxla (#160368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160368 Approved by: https://github.com/Skylion007 ghstack dependencies: #160362, #160363, #160364, #160365, #160366, #160367	2025-08-15 02:09:31 +00:00
Lucas Kabela	d52bb67ac3	typing registry.py (#160367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160367 Approved by: https://github.com/Skylion007 ghstack dependencies: #160362, #160363, #160364, #160365, #160366	2025-08-15 02:09:31 +00:00
Lucas Kabela	05b9b63fb6	typing inductor and placeholder backends (#160366 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160366 Approved by: https://github.com/Skylion007 ghstack dependencies: #160362, #160363, #160364, #160365	2025-08-15 02:09:31 +00:00
Lucas Kabela	453cfa5153	typing distributed.py (#160365 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160365 Approved by: https://github.com/StrongerXi ghstack dependencies: #160362, #160363, #160364	2025-08-15 02:09:31 +00:00
Lucas Kabela	9faca5f260	typing debugging.py (#160364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160364 Approved by: https://github.com/Skylion007 ghstack dependencies: #160362, #160363	2025-08-15 02:09:31 +00:00
Lucas Kabela	6fe6dd9fdc	Type cudagraphs.py (#160363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160363 Approved by: https://github.com/StrongerXi ghstack dependencies: #160362	2025-08-15 02:09:31 +00:00
Lucas Kabela	f82c7eed84	Typing for common.py (#160362 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160362 Approved by: https://github.com/Skylion007	2025-08-15 02:09:31 +00:00
Nick Riasanovsky	25ccc4716e	[Inductor] [Triton] Apply feedback to Enable padded stride support (#160614 ) Summary: Issue I noticed while fixing tests for TMA store. This triton.language.make_tensor_descriptor call hardcodes the shape information as the stride, which is not necessarily correct. In particular, its legal to have a stride bigger than the shape (e.g. padded to a size). A good example of the usage of this would be to allocate a tensor to always be a multiple of 16 and just pad the result so TMA is legal. This is redo of https://github.com/pytorch/pytorch/pull/160493 because I broke this accidentally trying to land internally first instead of merging through Github directly. Test Plan: Tested with `buck2 run mode/opt-split-dwarf mode/inplace -c fbcode.nvcc_arch=h100 caffe2/test/inductor:max_autotune 2>&1 \| tee ~/test_logs.log` and confirmed all max autotune tests passed. Rollback Plan: Differential Revision: D80224578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160614 Approved by: https://github.com/eellison	2025-08-15 02:06:14 +00:00
Guilherme Leobas	d387a48c38	[generator] Raise `StopIteration(value)` with value from the return stmt (#157152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157152 Approved by: https://github.com/zou3519 ghstack dependencies: #157148	2025-08-15 01:42:40 +00:00
Guilherme Leobas	831e85104a	[contextlib] Fixes for CPython contextlib tests (#157148 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157148 Approved by: https://github.com/zou3519	2025-08-15 01:42:40 +00:00
David Berard	211c98859a	[inductor][triton] Update triton_builtin handling after triton # 7239 (#160658 ) https://github.com/triton-lang/triton/pull/7239 will search for a _semantic kwarg in the signature of the function before passing in this kwarg. To fix this in Inductor: 1. explicitly take a _semantic kwarg 2. remove the functools.wraps around the wrapper function, which was causing inspect.signature to return the signature of the wrapped function (instead of the signature of the wrapper, which does contain the _semantic arg) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160658 Approved by: https://github.com/PaulZhang12, https://github.com/njriasan	2025-08-15 00:39:24 +00:00
Kaichao You	dae7710bf2	[cuda][cupy] Improve cupy device placement when device is provided with explicit index (#158529 ) resubmit https://github.com/pytorch/pytorch/pull/158320 , fixing a potential bug when device index is not specified explicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158529 Approved by: https://github.com/ezyang	2025-08-15 00:27:42 +00:00
ankushwahaRH	dc194a3096	Test multiprocessing spawn timing fix (#160672 ) Submitting PR to fix #160511. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160672 Approved by: https://github.com/mikaylagawarecki	2025-08-15 00:11:55 +00:00
Jeff Daily	4051b42c29	[ROCm] hipify needs specific header mappings (#160675 ) Fixes #160579. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160675 Approved by: https://github.com/ScottTodd, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-15 00:09:04 +00:00
henrylhtsang	eb0eaa67e1	[BE][ci] Increase frequency of cutlass backend ci (#160656 ) * increase frequency from every 24 hours to every 12 hours * automatically enable it if cutlass backend files are touched. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160656 Approved by: https://github.com/eellison	2025-08-14 23:44:55 +00:00
henrylhtsang	98373e5ad2	[doc] AOTI debugging guide (#160430 ) Folded from https://discuss.pytorch.org/t/a-beginners-guide-to-debugging-aot-inductor-cuda-illegal-memory-access/222188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160430 Approved by: https://github.com/angelayi	2025-08-14 23:42:17 +00:00
Michael Lazos	371eacb2ae	[Dynamo][Hierarchical Compile] Refactor for tuple flattening (#158810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158810 Approved by: https://github.com/StrongerXi	2025-08-14 22:45:44 +00:00
PyTorch MergeBot	3650989e6e	Revert "[cutlass] fix dictionary iteration error (#160552 )" This reverts commit 29d20d49f0b7f4e362e1cefdcdc4b5659969312c. Reverted https://github.com/pytorch/pytorch/pull/160552 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160552#issuecomment-3189940880))	2025-08-14 21:41:28 +00:00
Markus Hoehnerbach	3be70dc30e	[inductor] dont reuse buffers if it affects peak (#145883 ) (#159530 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159530 Approved by: https://github.com/eellison	2025-08-14 21:14:36 +00:00
David Berard	47a1db823d	[triton_heuristics] Optimize the triton launcher in pt2 (#160000 ) Summary: (Original author: Xu Zhao. Commandeered by David to land this since it is relatively urgent) We observed ~10us PT2-Triton launch overhead regression after pin update. Before Triton pin-update: {F1980557238} After Triton pin-update: {F1980557240} The root cause is because https://github.com/pytorch/pytorch/pull/145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path. The motivation for `_get_args_with_constexprs` was that between triton 3.2 and triton 3.3, the convention for calling Triton kernels (at the level that non-static-cuda-launcher inductor integrates) changed. Previously, the callable did not take constexpr arguments as parameters; after 3.3, it does. With pointwise/reduction kernels, we don't know the constexpr values until after autotuning occurs; so `_get_args_with_constexprs` would inject constexprs into the arguments list before calling the Triton kernel. The fix (in this PR) is to instead inject the constexpr args into the launcher string - this avoids the cost of sorting/reordering arguments which previously occurred upon execution of each kernel. Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (`e96c7c4bb0/torch/_inductor/runtime/static_cuda_launcher.py (L220)`), there is no need to pass in constexprs to the generated launcher code. The new launcher code needs to work on three cases: - StaticallyLaunchedCudaKernel - triton.compile.CompiledKernel - AOTInductor Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0 Test Plan: Before: ``` $ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 1.893x ``` ``` $ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency x_val nop_python_function-walltime nop_triton_kernel-walltime nop_triton_compiled_kernel_run-walltime nop_inductor_kernel-walltime nop_inductor_kernel_cudagraph-walltime ------- ------------------------------ ---------------------------- ----------------------------------------- ------------------------------ ---------------------------------------- 0 0.00760921 1.80298 0.623282 5.25024 0.203722 19 0.00799885 4.78223 1.00226 5.8213 0.239084 average 0.00780403 3.29261 0.812769 5.53577 0.221403 ``` After: ``` buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency x_val nop_python_function-walltime nop_triton_kernel-walltime nop_triton_compiled_kernel_run-walltime nop_inductor_kernel-walltime nop_inductor_kernel_cudagraph-walltime ------- ------------------------------ ---------------------------- ----------------------------------------- ------------------------------ ---------------------------------------- 0 0.00747067 1.92589 0.726509 4.35459 0.204205 19 0.00747823 7.36852 1.26241 6.28208 0.239278 average 0.00747445 4.6472 0.994459 5.31834 0.221741 ``` ``` $ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 1.985x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160000 Approved by: https://github.com/jansel, https://github.com/mlazos Co-authored-by: Xu Zhao <xzhao9@meta.com>	2025-08-14 21:04:08 +00:00
PyTorch MergeBot	eac2d9d695	Revert "appending the pythonpath (#160219 )" This reverts commit 1d80d697a269234b47ec7ede192faf3bb9b159e3. Reverted https://github.com/pytorch/pytorch/pull/160219 on behalf of https://github.com/clee2000 due to broke inductor? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16970222746/job/48108262003) [HUD commit link](`1d80d697a2`) ([comment](https://github.com/pytorch/pytorch/pull/160219#issuecomment-3189850381))	2025-08-14 20:58:14 +00:00
Lucas Kabela	3fe19a7a0a	[Test Fix] Delete dynamo skipfile for OpenMP test_one_thread (#160562 ) Fixes #120648 During issue scrubbing I could not repro these failing tests, so reenabling them to close out the issue ### Test Original repro command: ``` PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_openmp.py -v -k test_one_thread ``` Now results in ``` platform linux -- Python 3.12.11, pytest-8.4.1, pluggy-1.6.0 -- /home/lucaskabela/.conda/envs/pytorch-3.12/bin/python3.12 cachedir: .pytest_cache hypothesis profile 'default' rootdir: /home/lucaskabela/pytorch configfile: pytest.ini plugins: hypothesis-6.138.0 collected 2 items / 1 deselected / 1 selected Running 1 items in this shard test/test_openmp.py::TestOpenMP_ParallelFor::test_one_thread PASSED [3.6874s] [100%] ===================================================== 1 passed, 1 deselected in 6.07s ===================================================== ``` And: ``` PYTORCH_TEST_WITH_DYNAMO=1 python test/test_openmp.py TestOpenMP_ParallelFor.test_one_thread ``` ``` PYTORCH_TEST_WITH_DYNAMO=1 python test/test_sort_and_select.py TestSortAndSelectCPU.test_sort_overflow_cpu_int16 ``` Both result in: ``` . ---------------------------------------------------------------------- Ran 1 test in 0.003s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160562 Approved by: https://github.com/zou3519	2025-08-14 20:55:59 +00:00
Dev Sashidhar	4a90dc0c1f	Update checkpoint warning to target PyTorch 2.9 (#160643 ) Fixes #160534 Updates the warning in torch.utils.checkpoint to state that starting in PyTorch 2.9, calling checkpoint without explicitly passing use_reentrant will raise an exception. Follows the guidance from the issue discussion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160643 Approved by: https://github.com/soulitzer	2025-08-14 20:53:17 +00:00
Paul Zhang	1fc683cf17	[Inductor] Allow indexing a flexible layout for extract_input_node_reduction_ranges (#160645 ) Differential Revision: D79831747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160645 Approved by: https://github.com/eellison	2025-08-14 20:43:35 +00:00
AaronWang04	b9d7de3a09	[Inductor] addmm + activation function fusion (#158137 ) PR implements a pass in post_grad to fuse activation(add + mm) This was previously done similarly here #106912 but was reverted for performance reasons. it was replaced with a pass that unfuses the activation and add from addmm/addmm_activation and let inductor handle the fusion. however since then cuBLAS team has made a lot of perf improvements on this, will update this post with more benchmarks but preliminary benchmark show good results perf dash board <img width="3371" height="1240" alt="Screenshot from 2025-08-07 13-41-35" src="https://github.com/user-attachments/assets/d44d6205-b33a-4a20-9f0f-d9db176b3738" /> Relu works with both training and inference but gelu only works with inference mode due to some fundamental limitations since gelu's derivative depends on input and relu's doesnt. don't think this is fixable with the current addmm_activation API Graph module before and after this pass Relu(addmm) ``` graph(): %primals_1 : [num_users=1] = placeholder[target=primals_1] %primals_2 : [num_users=2] = placeholder[target=primals_2] %primals_3 : [num_users=2] = placeholder[target=primals_3] %addmm : [num_users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {}) %relu : [num_users=2] = call_function[target=torch.ops.aten.relu.default](args = (%addmm,), kwargs = {}) %le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%relu, 0), kwargs = {}) %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {}) return (relu, primals_2, le, permute_1) graph(): %primals_1 : [num_users=1] = placeholder[target=primals_1] %primals_2 : [num_users=2] = placeholder[target=primals_2] %primals_3 : [num_users=2] = placeholder[target=primals_3] %_addmm_activation_default : [num_users=2] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {}) %le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%_addmm_activation_default, 0), kwargs = {}) %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {}) return (_addmm_activation_default, primals_2, le, permute_1) ``` Gelu (addmm) ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %addmm : [num_users=4] = call_function[target=torch.ops.aten.addmm.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {}) %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, %addmm), kwargs = {}) %mul_1 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul, %addmm), kwargs = {}) %mul_2 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_1, 0.044715), kwargs = {}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%addmm, %mul_2), kwargs = {}) %mul_3 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%add, 0.7978845608028654), kwargs = {}) %mul_4 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, 0.5), kwargs = {}) %tanh : [num_users=1] = call_function[target=torch.ops.aten.tanh.default](args = (%mul_3,), kwargs = {}) %add_1 : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%tanh, 1), kwargs = {}) %mul_5 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_4, %add_1), kwargs = {}) return (mul_5,) graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %_addmm_activation_default : [num_users=1] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {use_gelu: True}) return (_addmm_activation_default,) ``` Benchmark setup: NGC pytorch 25.06 container cublas version: 12.9.1.4 torch.compile ran with dynamic = False and max_autotune H100 ``` Testing with M=1024, N=1024, K=1024, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 0.0107 ms Average Time per Iteration (torch compile): 0.0296 ms ============================================================ Testing with M=2048, N=2048, K=2048, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 0.0262 ms Average Time per Iteration (torch compile): 0.0327 ms ============================================================ Testing with M=4096, N=4096, K=4096, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 0.1763 ms Average Time per Iteration (torch compile): 0.2457 ms ============================================================ Testing with M=8192, N=8192, K=8192, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 1.5280 ms Average Time per Iteration (torch compile): 1.9437 ms ``` A100 ``` ############################################################ Testing with dtype: float16 ############################################################ ============================================================ Testing with M=1024, N=1024, K=1024, dtype=float16 ============================================================ Average Time per Iteration (cublas): 0.0313 ms Average Time per Iteration (torch compile): 0.0643 ms ============================================================ Testing with M=2048, N=2048, K=2048, dtype=float16 ============================================================ Average Time per Iteration (cublas): 0.1149 ms Average Time per Iteration (torch compile): 0.1255 ms ============================================================ Testing with M=4096, N=4096, K=4096, dtype=float16 ============================================================ Average Time per Iteration (cublas): 0.6297 ms Average Time per Iteration (torch compile): 0.7547 ms ============================================================ Testing with M=8192, N=8192, K=8192, dtype=float16 ============================================================ Average Time per Iteration (cublas): 4.3821 ms Average Time per Iteration (torch compile): 5.0740 ms ``` Script ```py import torch torch.manual_seed(0) warmup, numrun= 10, 100 sizes = [1024, 2048, 4096, 8192] dtypes = [torch.float16, torch.bfloat16, torch.float32] device = torch.device("cuda") for dtype in dtypes: dtype_name = str(dtype).split('.')[-1] print(f"\n{'#'60}") print(f"Testing with dtype: {dtype_name}") print(f"{'#'60}") for size in sizes: M, N, K = size, size, size print(f"\n{'='60}") print(f"Testing with M={M}, N={N}, K={K}, dtype={dtype_name}") print(f"{'='60}") A = torch.randn(M, K, device=device, dtype=dtype) B = torch.randn(K, N, device=device, dtype=dtype) C = torch.randn(M, device=device, dtype=dtype) def func1(): return torch._addmm_activation(C, A, B, use_gelu=True) def func2(): return torch.nn.functional.gelu(torch.add(C, torch.mm(A, B)), approximate="tanh") func2_compiled = torch.compile( func2, dynamic=False, options={ "force_disable_caches": True, "max_autotune": True, "max_autotune_gemm": True, "max_autotune_gemm_backends": "TRITON", "autotune_fallback_to_aten": False, } ) for _ in range(warmup): func1() torch.cuda.synchronize(device=device) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) total_time_ms = 0.0 start_event.record() for _ in range(numrun): func1() end_event.record() torch.cuda.synchronize(device=device) total_time_ms += start_event.elapsed_time(end_event) avg_time_ms = total_time_ms / numrun print(f"Average Time per Iteration (cublas):\t {avg_time_ms:.4f} ms") for _ in range(warmup): func2_compiled() torch.cuda.synchronize(device=device) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) total_time_ms = 0.0 start_event.record() for _ in range(numrun): func2_compiled() end_event.record() torch.cuda.synchronize(device=device) total_time_ms += start_event.elapsed_time(end_event) avg_time_ms = total_time_ms / numrun print(f"Average Time per Iteration (torch compile):\t {avg_time_ms:.4f} ms") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158137 Approved by: https://github.com/eellison	2025-08-14 20:41:38 +00:00
Guilherme Leobas	1028c5e2d5	[Dynamo] Add CPython default dict tests (#155263 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155263 Approved by: https://github.com/zou3519	2025-08-14 20:22:22 +00:00
vishalgoyal316	19b4283884	Typo correction in variable name uninitalized_val in resize() function (#160636 ) Fixes #160633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160636 Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007	2025-08-14 20:11:43 +00:00
Michael Lazos	8d6d324631	[Dynamo][Hierarchical-Compile] Don't allow node duplicates to be added (#160605 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160605 Approved by: https://github.com/StrongerXi	2025-08-14 20:02:10 +00:00
Alex Malyshev	fdfd69bb05	Set PYTHONHOME for inductor subprocesses using torch (#160008 ) This is needed for subprocesses that are trying to call back into torch functionality, i.e. anything that's also setting `PYTHONPATH`. If they're part of an application that bundles the Python runtime, then they should use the bundled runtime to keep their view of the world consistent. There are more `sys.executable` subprocesses in torch/ but it seems like they're fine. Previous PR at https://github.com/pytorch/pytorch/pull/159382, but was reverted because it caused macOS jobs on GitHub to timeout. What was happening was inductor subprocesses were scheduling C++ compilation tasks that were failing to find the Python.h header. This was because they were running in venvs and now trying to find the CPython headers inside the venv, where the headers do not exist. This PR gates the new behavior to internal builds only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160008 Approved by: https://github.com/aorenste	2025-08-14 19:57:14 +00:00
Logan Thomas	0d3461bac0	DOC: update CrossEntropyLoss with note and example of incorrect target specification (#155649 ) Fixes #134771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155649 Approved by: https://github.com/mikaylagawarecki Co-authored-by: Svetlana Karslioglu <svekars@meta.com> Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>	2025-08-14 18:34:57 +00:00
Howard Huang	65053c03a3	[FR] Don't check incomplete ranks for printing (#160195 ) When just printing the ranks (`-j` option) we should skip the check for "incomplete ranks" since that doesn't affect the print Pull Request resolved: https://github.com/pytorch/pytorch/pull/160195 Approved by: https://github.com/fduwjj ghstack dependencies: #160097	2025-08-14 18:19:45 +00:00
Howard Huang	96f9fbe21a	Fix flight recorder for P2P ops (#160097 ) Fixes errors in debugging a trace as mentioned in https://docs.google.com/document/d/1EKVJYmW2hj_VsvDvnSggXhZzJyvMu9dA0iDJWOZAtjY/edit?tab=t.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160097 Approved by: https://github.com/fduwjj	2025-08-14 18:19:45 +00:00
Thomas Germer	1c25871191	Allow torch.hub.load with unauthorized GITHUB_TOKEN (#159896 ) Allow torch.hub.load with unauthorized GITHUB_TOKEN `torch.hub.load` fails if a `GITHUB_TOKEN` with few permissions is set, as can be seen in the following example. Make sure that the model has not been cached before, for example with `rm ~/.cache/torch`. If the model has been downloaded already, it will not be downloaded again and the authorization error will not occur. ```python export GITHUB_TOKEN="" python >>> import torch >>> torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 567, in load repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load", ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 231, in _get_cache_or_reload _validate_not_a_forked_repo(repo_owner, repo_name, ref) File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 191, in _validate_not_a_forked_repo response = json.loads(_read_url(Request(url, headers=headers))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 174, in _read_url with urlopen(url) as r: ^^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/urllib/request.py", line 215, in urlopen return opener.open(url, data, timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/urllib/request.py", line 521, in open response = meth(req, response) ^^^^^^^^^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/urllib/request.py", line 630, in http_response response = self.parent.error( ^^^^^^^^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/urllib/request.py", line 559, in error return self._call_chain(args) ^^^^^^^^^^^^^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/urllib/request.py", line 492, in _call_chain result = func(args) ^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/urllib/request.py", line 639, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 401: Unauthorized ``` The cause of the error is that the function `_validate_not_a_forked_repo` in `hub.py` always uses `GITHUB_TOKEN` for authorization, even when downloading does not require authorization. `0ba09a6d34/torch/hub.py (L194)` This fix simply retries the download without the token in case of a failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159896 Approved by: https://github.com/albanD	2025-08-14 18:15:49 +00:00
Xilun Wu	6c05ea6475	[DTensor] add op support: aten.squeeze_.dim (#159532 ) Summary This PR enables in-place op `aten.squeeze_.dim` on DTensor with a change to DTensor dispatch logic: when processing in-place operator, we should assign `output_sharding.output_spec` back to the first argument. This is because the in-place op_call on `arg._local_tensor` could also shift the tensor meta. Test `pytest test/distributed/tensor/test_view_ops.py -s -k test_squeeze_` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159532 Approved by: https://github.com/zpcore	2025-08-14 18:01:19 +00:00
Howard Huang	5665dc9ab7	[PP] Allow larger world_size schedule tests (#160559 ) Update schedule tests to use `world_size=4`, changes needed: - Move some tests that require world_size=2 to new class - Move helper methods from class level to function level - Update some initialization to pass assert since gradients were super small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160559 Approved by: https://github.com/wconstab ghstack dependencies: #159591, #160558	2025-08-14 17:41:58 +00:00
Howard Huang	2ff7c1c774	[PP] Rename _load_actions and validate (#160558 ) Rename method and add validation Pull Request resolved: https://github.com/pytorch/pytorch/pull/160558 Approved by: https://github.com/wconstab ghstack dependencies: #159591	2025-08-14 17:41:58 +00:00
Guilherme Leobas	3028fa6ce9	Wrap class definitions in `set_fullgraph(False)` in `test_list`/`tuple` (#160277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160277 Approved by: https://github.com/zou3519 ghstack dependencies: #160216, #160217, #160276, #160278, #160330, #160331	2025-08-14 17:29:45 +00:00
Matthew Haddock	077cb38974	Add dtype checks in meta dispatch for various ordering ops (#159556 ) This adds data type checks for the unsupported bool and complex types for argmax/min topk, sort, minimum, maximum. As listed here: `0a99b026d6/torch/testing/_internal/common_methods_invocations.py (L21076)` Currently the ops will fail on CPU or CUDA calculation, rather than at meta dispatch stage as with for example max: `0a99b026d6/aten/src/ATen/native/TensorCompare.cpp (L285)` . This will catch it early. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159556 Approved by: https://github.com/janeyx99	2025-08-14 17:06:27 +00:00
Jovian Anthony Jaison	cd8d8c18f5	[pytorch][dynamo_compile] Log graph_node_shape to dynamo_compile (#160556 ) This PR adds the dynamo graph node shape logging to dynamo compile. Also added unit tests to check if correct graph node shape is being logged. Test Plan: $ python -m test_utils Ran 12 tests in 36.447s OK Note: Will merge after D80185628 lands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160556 Approved by: https://github.com/masnesral, https://github.com/jingsh	2025-08-14 16:42:35 +00:00
Lucas Kabela	63654ba4c5	[BE][Dynamo] Type improvements in `_dynamo/utils` to generics (#159824 ) Follow up to #159580 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159824 Approved by: https://github.com/williamwen42	2025-08-14 16:06:50 +00:00
Ke Wen	7e27347fd3	[SymmMem] Check return of nvshmem_malloc (#160603 ) `nvshmem_malloc` returns a null pointer when allocation fails. We should check here. Otherwise, the nullptr can go down the road and into the device kernel, causing CUDA illegal memory access. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160603 Approved by: https://github.com/fduwjj, https://github.com/ngimel	2025-08-14 15:57:55 +00:00
Raman Kumar	1d80d697a2	appending the pythonpath (#160219 ) Fixes #160193 `PYTHONPATH=/torchbench` to `PYTHONPATH=/torchbench:$PYTHONPATH` in [pytorch/.ci/pytorch/test.sh](`b5fd7223b1/.ci/pytorch/test.sh (L1715)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160219 Approved by: https://github.com/malfet	2025-08-14 15:55:31 +00:00
Xinya Zhang	b6b74aed60	[ROCm] Support large inputs for coalesceValuesKernel (#158281 ) # Description `.coalesce` cannot handle large inputs on ROCM due to maximal grid size limit. This PR splits axis `X` into axes `X` and `Y`, and repurposes `Z` for original `Y` on ROCm to avoid such limitation. Confirmed the new approach can handle large inputs. Correctness needs validation. # Testing Command `python torch_spmv.py 22500000 272500000` ## Script `torch_spmv.py` ``` python import torch import argparse def parse_args(): parser = argparse.ArgumentParser( description="Sparse COO Matrix by Dense Vector Multiplication using PyTorch" ) parser.add_argument("n", type=int, help="Size of the NxN matrix") parser.add_argument("nnz", type=int, help="Number of non-zero entries") return parser.parse_args() def main(): args = parse_args() n = args.n nnz = args.nnz dtype = torch.float32 device = torch.device('cuda') # Generate random indices for the sparse matrix in COO format. torch.manual_seed(42) rows = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device) cols = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device) indices = torch.stack([rows, cols], dim=0) # Generate random values. values = torch.randn(nnz, dtype=torch.float32, device=device) # Create the sparse COO matrix and move it to the target device. sparse_matrix = torch.sparse_coo_tensor(indices, values, size=(n, n), dtype=torch.float32, device=device) sparse_matrix = sparse_matrix.coalesce() # Generate a random dense vector. dense_vector = torch.randn(n, dtype=torch.float32, device=device) # Perform sparse matrix - dense vector multiplication. # Using torch.sparse.mm which expects a 2D tensor for the vector. result = torch.sparse.mm(sparse_matrix, dense_vector.unsqueeze(1)).squeeze() # result = torch.mv(sparse_matrix, dense_vector) # Print the result. print("Result of the multiplication:") print(torch.sum(result)) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158281 Approved by: https://github.com/jeffdaily	2025-08-14 15:09:16 +00:00
Tugsbayasgalan Manlaibaatar	4a773e1e86	Warn when there is side effect in strict mode (#160060 ) Differential Revision: [D79784354](https://our.internmc.facebook.com/intern/diff/D79784354) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160060 Approved by: https://github.com/zhxchen17, https://github.com/StrongerXi	2025-08-14 14:59:44 +00:00
Howard Huang	198b5fd2d4	[PP] Add DualPipeV schedule (#159591 ) Added the DualPipeV schedule according to http://github.com/deepseek-ai/DualPipe/blob/main/dualpipe/dualpipev.py#L11 <img width="3633" height="486" alt="image" src="https://github.com/user-attachments/assets/4e843bb9-87cd-4d11-936c-7dfe8ee12f16" /> This schedule doesn't perform the actual "overlap" during execution, but provides the scaffolding and schedule definition we need to run it E2E in torchtitan. Supporting the overlapped operation will be worked on in following PRs. Tests: ```sh python test/distributed/pipelining/test_schedule_multiproc.py -k test_v_shape_schedules python test/distributed/pipelining/test_schedule.py -k test_pipeline_order_for_v_schedules ``` Also tested in TorchTitan and is running. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159591 Approved by: https://github.com/wconstab	2025-08-14 14:58:35 +00:00
blaine-rister	20bdabbb3c	[Dynamo] Fix MTIA dynamo backend by avoiding has_trition() at import time (#160604 ) # Summary MTIA's torch.compile tests were broken by D80037015. (For details, see internal task T234563969.) The root cause was that `has_triton` can change state after we call `torch.mtia.init()`, but it was used in a way that fixes Inductor's behavior at import time. (Note that `has_triton` is cached, and there's no opportunity to call `torch.mtia.init()` prior to `import torch`.) To fix this, we use `try: import triton` as opposed to `has_triton()` at the module level. # Test Plan See the internal diff. As a follow-up, we will add appropriate unit tests and/or CI hints so this type of issue can be caught at PR/diff time. Differential Revision: D80228000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160604 Approved by: https://github.com/PaulZhang12, https://github.com/eellison	2025-08-14 14:54:49 +00:00
Alexander Grund	d556586448	[cutlass backend] re-add pip cutlass path (#160180 ) Revert #156651 to allow using the cutlass PIP package which is easier for users than the Git checkout or similar method. Also fix a bug where the PIP cutlass path wouldn't be available to subprocesses spawned during benchmarking for algorithm selection. Looks like the "spawn" method does not inherit the (potentially) already set up `config.cuda.cutlass_dir` so in the subprocess the include paths will still be set to `"../third_party/cutlass/"` leading to compilation failure due to missing headers. Ensure `try_import_cutlass` is called at that point, which due to caching is a no-op in most cases, so doesn't hurt. Change the logic to return `None` when cutlass isn't available returning more useful values for include paths, namely an empty list. This is in line with other inductor code which disables the CUTLASS backend when `try_import_cutlass` returns False Pull Request resolved: https://github.com/pytorch/pytorch/pull/160180 Approved by: https://github.com/henrylhtsang, https://github.com/mlazos	2025-08-14 14:48:31 +00:00
Isuru Fernando	781e9a7724	Fix meta for constant_pad_nd (#159878 ) Fixes https://github.com/pytorch/pytorch/issues/144187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159878 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2025-08-14 14:47:47 +00:00
atalman	e4de93f6a3	Add sm50 and sm60 back to windows builds (#160586 ) Addresses the issue reported in https://github.com/pytorch/pytorch/issues/160575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160586 Approved by: https://github.com/malfet	2025-08-14 12:46:35 +00:00
Wang, Chuanqi	a5652407e4	[CI] Fix triton xpu build on Windows (#160442 ) Pin the ninja version to 1.11 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160442 Approved by: https://github.com/atalman	2025-08-14 12:43:49 +00:00
Laith Sakka	6f0f4e0c3e	reduce threshold to suggest changes to expected results (#160463 ) Since we increase threshold to 10% i would like suggestions to show up to update those +-2% instead of 3.3% now Pull Request resolved: https://github.com/pytorch/pytorch/pull/160463 Approved by: https://github.com/jamesjwu	2025-08-14 09:11:27 +00:00
fengqing.lu	db763b1717	[Intel GPU] Support SDPA backend selection and priority setting on XPU (#159464 ) Currentlly SPDA XPU use own `priority_order` instead of the one from global context. Hence it does not support `with sdpa_kernel(order, set_priority=True)` with set_priority=True. This PR enables this feature. To make default `priority_order` from global context works for XPU, I also move MATH backend to lowest priority, otherwise `cudnn attention` and `overrideable attention` will never be selected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159464 Approved by: https://github.com/guangyey, https://github.com/drisspg Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: mayuyuace <qiming1.zhang@intel.com>	2025-08-14 08:55:31 +00:00
Phil Xiaojun Hu	089c4a1ba0	Fix wrong log file name in the docs of `torch.distributed.elastic.multiprocessing.start_processes()` (#160396 ) Fixes #160395 In https://docs.pytorch.org/docs/stable/elastic/multiprocessing.html#starting-multiple-workers and also in the code comment of the function[1], it was specified that: ``` For each process, the ``log_dir`` will contain: #. ``{local_rank}/error.json``: if the process failed, a file with the error info #. ``{local_rank}/stdout.json``: if ``redirect & STDOUT == STDOUT`` #. ``{local_rank}/stderr.json``: if ``redirect & STDERR == STDERR`` ``` While in code[2], the files are `stdout.log` and `stderr.log`, instead of the `.json` ones listed in the doc. [1]: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/__init__.py#L144-L145 [2]: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L354-L357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160396 Approved by: https://github.com/fduwjj	2025-08-14 08:24:07 +00:00
zpcore	97c8c98f8d	measure dispatch overhead (#160504 ) Reopen https://github.com/pytorch/pytorch/pull/159699 to merge to main. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160504 Approved by: https://github.com/wconstab	2025-08-14 06:13:53 +00:00
FFFrog	39aa3d1471	Remove the dead code in setup.py (#160515 ) The following line has no effect. `34ec5ed275/setup.py (L1205)` This code was originally introduced in this PR: `dd7cec680c`, and clang11 and later now support `-fstack-clash-protection`. Can we remove this line? @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/160515 Approved by: https://github.com/isuruf, https://github.com/albanD	2025-08-14 06:02:11 +00:00
Yang Wang	639778b3ee	[2/3 step][ vllm ci build setup] Add vlllm buld logic and dockerfile (#160089 ) # set up vllm build logic - dockerfile: please notice the dockfile introduced here is only temporary, once we migrate this file to vllm, we will fetch it directly from there - VllmBuildRunner: - implement logic to prepare and run vllm build with dockerfile - Pull Request resolved: https://github.com/pytorch/pytorch/pull/160089 Approved by: https://github.com/huydhn ghstack dependencies: #160043	2025-08-14 05:51:45 +00:00
Yang Wang	00d7d6f123	[1/3][ghstack] [vllm ci build setup ]setup lumen_cli (#160043 ) # Description set up torch_cli using argparses ## Details: - add vllm placeholer in the cli - add unittest for cli command see Readme.md to see how to run the cli Pull Request resolved: https://github.com/pytorch/pytorch/pull/160043 Approved by: https://github.com/huydhn	2025-08-14 05:51:45 +00:00
Jeff Daily	c6d78d4dbd	[ROCm] enable miopen channels last 3d for conv and batchnorm (#160529 ) miopen batchnorm for channels last is guarded by env var PYTORCH_MIOPEN_SUGGEST_NHWC_BATCHNORM similar to existing PYTORCH_MIOPEN_SUGGEST_NHWC for conv. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160529 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-14 05:30:19 +00:00
Boyuan Feng	2898d3f965	[Lowering] Add assertion msg to sym_size and sym_stride (#160591 ) Summary: Add assertion msg to sym_size and sym_stride lowering function. Test Plan: Will test in mast job. Rollback Plan: Differential Revision: D80187693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160591 Approved by: https://github.com/angelayi	2025-08-14 04:55:32 +00:00
PyTorch UpdateBot	34358f335d	[vllm hash update] update the pinned vllm hash (#160594 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160594 Approved by: https://github.com/pytorchbot	2025-08-14 04:21:28 +00:00
zeshengzong	fe3f5fe4ea	Optimize `min`, `max` gradient behavior description (#160312 ) Fixes #160273 ## Test Result <img width="897" height="593" alt="image" src="https://github.com/user-attachments/assets/6ebcdb2c-8a2c-4f0d-8195-656089e88325" /> <img width="985" height="653" alt="image" src="https://github.com/user-attachments/assets/606a7264-e223-4d2b-8c3f-f153ce43b208" /> <img width="903" height="607" alt="image" src="https://github.com/user-attachments/assets/0ae2f56f-820f-4194-b15c-a02a078c0487" /> <img width="903" height="607" alt="image" src="https://github.com/user-attachments/assets/79c38a17-45ac-4808-829f-d538178de36b" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160312 Approved by: https://github.com/ngimel	2025-08-14 04:18:49 +00:00
Aidyn-A	45ba7ecda8	Flex Attention heuristics: a Blackwell config (#160192 ) Fixes #160074 and more. This is the working config for B200 and RTX 5080. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160192 Approved by: https://github.com/drisspg	2025-08-14 03:47:02 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	194fcfcfbd	Add support for param mutation under inference mode (#159661 ) Summary: In HF model rwkv, we have parameter mutation under inference mode which should be safe. This PR does multiple things to make sure it works: 1. We execute global autograd mutation while tracing so that we can actually trace through parameter inplace mutation 2. Add support for parameter mutation under inference mode in AOTAutograd 3. Add support for parameter mutation under inference mode in export. Test Plan: test Rollback Plan: Differential Revision: D79460136 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159661 Approved by: https://github.com/ydwu4	2025-08-14 03:34:04 +00:00
Michael Lazos	29d20d49f0	[cutlass] fix dictionary iteration error (#160552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160552 Approved by: https://github.com/henrylhtsang, https://github.com/jingsh	2025-08-14 03:23:46 +00:00
Guilherme Leobas	3faee0a631	Update nullcontext to return input args (#158776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158776 Approved by: https://github.com/zou3519	2025-08-14 03:02:44 +00:00
Yu, Guangye	8cfaf51d4e	Generalize support of background thread in pinned allocator (#160505 ) # Motivation https://github.com/pytorch/pytorch/pull/135524 only introduces the support of background thread for CUDA, this PR intends to support it for other backend such as XPU as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160505 Approved by: https://github.com/albanD	2025-08-14 02:22:39 +00:00
Guilherme Leobas	af3cabc55d	Wrap class definitions in `set_fullgraph(False)` in `test_sort` (#160331 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160331 Approved by: https://github.com/zou3519 ghstack dependencies: #160216, #160217, #160276, #160278, #160330	2025-08-14 02:12:20 +00:00
Guilherme Leobas	74bbe7b4a3	Wrap class definitions in `set_fullgraph(False)` in `test_math`/`cmath` (#160330 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160330 Approved by: https://github.com/zou3519 ghstack dependencies: #160216, #160217, #160276, #160278	2025-08-14 02:12:20 +00:00
Guilherme Leobas	7bfc424a61	Wrap class definitions in `set_fullgraph(False)` in `test_iter` (#160278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160278 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #160216, #160217, #160276	2025-08-14 02:12:20 +00:00
RajeshvShiyal	5ace061254	finfo eps doc fix (#160502 ) Existing documentation for torch.finfo().eps is as below: \| eps \| float \| The smallest representable number such that ``1.0 + eps != 1.0``. \| Proposed documentation for torch.finfo().eps is as below: \| eps \| float \| The difference between 1.0 and the next smallest representable float larger than 1.0. \| Fixes #160397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160502 Approved by: https://github.com/ngimel	2025-08-14 01:49:35 +00:00
drisspg	15e49f6164	Factor out the strings to templates for better editor integration (#160357 ) # Summary More code motion, tldr is that install 'Better Jinja' in vscode and now you can get highlighting Before <img width="776" height="926" alt="Screenshot 2025-08-11 at 2 41 08 PM" src="https://github.com/user-attachments/assets/10868b31-f8ac-4cf5-99fe-19b8789ce06b" /> After: <img width="1184" height="1299" alt="Screenshot 2025-08-11 at 2 40 27 PM" src="https://github.com/user-attachments/assets/45203765-589e-4d76-8196-d895a2f2fbf6" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160357 Approved by: https://github.com/eellison	2025-08-14 01:07:53 +00:00
Laith Sakka	dd21c8a578	refresh expected results (#160537 ) regression introduced by https://github.com/pytorch/pytorch/pull/160314 not much worried about it since it did not effect other inductor benchmarks could not repo locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/160537 Approved by: https://github.com/eellison	2025-08-14 00:56:14 +00:00
Nikita Shulga	a06ec54d40	[MPS] Add API to query GPU core count (#160414 ) Using good old IOKit to get `gpu-core-count` property from device implementing `AGXAccelerator` service Expose this one as `torch.backend.mps.get_core_count()` and make it accessible via `MpsInterface` to the inductor Test Plan: Run `python3 -c "import torch;print(torch.backends.mps.get_name(), torch.backends.mps.get_core_count())"` and compare it to `system_profiler SPDisplaysDataType\|head -n10` ``` % python3 -c "import torch;print(torch.backends.mps.get_name(), torch.backends.mps.get_core_count())" Apple M1 Pro 16 % system_profiler SPDisplaysDataType\|head -n10 Graphics/Displays: Apple M1 Pro: Chipset Model: Apple M1 Pro Type: GPU Bus: Built-In Total Number of Cores: 16 Vendor: Apple (0x106b) Metal Support: Metal 3 ``` This would significantly improve occupancy for torch.compile generated kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/160414 Approved by: https://github.com/dcci	2025-08-14 00:05:17 +00:00
Mikayla Gawarecki	50a8c11875	Add getCurrentDeviceIndex to torch::stable::accelerator (#160453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160453 Approved by: https://github.com/janeyx99 ghstack dependencies: #159679	2025-08-13 23:42:24 +00:00
Mikayla Gawarecki	e4e4dbd2f8	Add beginnings of torch::stable::accelerator (#159679 ) Adds - `torch::stable::accelerator::DeviceGuard`: `std::unique_ptr` to `DeviceGuardOpauqe` mostly copied from the below (but made generic) `50eac811a6/torch/csrc/inductor/aoti_runtime/utils_cuda.h (L30-L46)` - constructor `DeviceGuard(DeviceIndex)` (this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device) - `set_index(DeviceIndex)` - `torch::stable::accelerator::Stream`: `std::shared_ptr` to `StreamOpaque` - constructor `Stream(StreamHandle stream)` (similar to torch::stable::Tensor) - `id() -> StreamId` - `getCurrentStream(DeviceIndex device_index) -> stable::accelerator::Stream` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159679 Approved by: https://github.com/guangyey, https://github.com/janeyx99	2025-08-13 23:42:24 +00:00
Aidyn-A	d670304001	[ATen][CUDA] Use new CCCL API in v2.8 (#160554 ) Silences deprecation warnings like: ``` In file included from tmpxft_003a195d_00000000-6_Nonzero.cudafe1.stub.c:1: /tmp/tmpxft_003a195d_00000000-6_Nonzero.cudafe1.stub.c: At global scope: /tmp/tmpxft_003a195d_00000000-6_Nonzero.cudafe1.stub.c:243:219: warning: 'template<class ValueType, class OffsetT> class at_cuda_detail::cub::CountingInputIterator' is deprecated: Use thrust::counting_iterator instead [-Wdeprecated-declarations] 243 \| static void __device_stub__ZN2at6native43_GLOBAL__N__3cee4041_10_Nonzero_cu_cba1aaa011flag_kernelILi512ELi16EhEEvPKT1_PlPKllli( const _ZN3c104impl20ScalarTypeToCPPTypeTILNS_10ScalarTypeE0EEE __par0, int64_t __par1, const int64_t __par2, int64_t __par3, int64_t __par4, int __par5) { __cudaLaunchPrologue(6); __cudaSetupArgSimple(__par0, 0UL); __cudaSetupArgSimple(__par1, 8UL); __cudaSetupArgSimple(__par2, 16UL); __cudaSetupArgSimple(__par3, 24UL); __cudaSetupArgSimple(__par4, 32UL); __cudaSetupArgSimple(__par5, 40UL); __cudaLaunch(((char )((void ( )(const _ZN3c104impl20ScalarTypeToCPPTypeTILNS_10ScalarTypeE0EEE , int64_t , const int64_t , int64_t, int64_t, int))at::native::_NV_ANON_NAMESPACE::flag_kernel<(int)512, (int)16, unsigned char> ))); }namespace at{ \| ^~~~~~~~~~~~~~~~~~~~~ /usr/local/cuda-12.9/include/cub/iterator/counting_input_iterator.cuh:93:63: note: declared here 93 \| class CCCL_DEPRECATED_BECAUSE("Use thrust::counting_iterator instead") CountingInputIterator \| ^~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160554 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/atalman	2025-08-13 23:15:53 +00:00
Sheng Fu	c5efc5c8a6	Fix unit test test_equivalent_template_code (#160432 ) Summary: Fix unit test test_equivalent_template_code https://github.com/pytorch/pytorch/pull/159920 treats ReinterpretView as a not-realized node when searching FX origin nodes for fused triton kernel. In test_equivalent_template_code, there is a transpose node (which is a ReinterpretView) before matmul. It was not in FX graph segment before PR 159920. FX origin nodes are used to define the name of triton kernel. That is the reason test_equivalent_template_code failed with PR 159920 since it uses hard-coded triton kernel name to check the result. The fix is to update the triton kernel name in the unit test. Test Plan: buck2 run mode/opt caffe2/test/inductor:benchmark_fusion -- caffe2.test.inductor.test_benchmark_fusion.BenchmarkMultiTemplateFusionCudaTest Rollback Plan: Differential Revision: D80101711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160432 Approved by: https://github.com/clee2000	2025-08-13 23:14:51 +00:00
Will Constable	6da11d9aaf	[C10D] Add check_rng_sync util (#160283 ) Debugs RNG desync by checking the current state on each rank in the group and summarizing the differences if any are detected. Notes: - used allgather instead of gather since its simpler to do this SPMD rather than add conditional behavior, though I could be convinced we only want to log on rank0. Usage: `check_rng_sync(generator, group)` Prints something like this: (cuda): ``` [rank0]:E0808 ] Generator desync detected: [rank0]:E0808 ] Ranks (Seed, Offset) values [rank0]:E0808 ] ------- ----------------------- [rank0]:E0808 ] 0 (456, 0) [rank0]:E0808 ] 1 (123, 4) [rank0]:E0808 ] 2-3 (123, 0) ``` (cpu): ``` [rank2]:E0810 ] Generator desync detected: [rank2]:E0810 ] Ranks Generator State Hash values [rank2]:E0810 ] ------- ----------------------------- [rank2]:E0810 ] 0 7633364531954955665 [rank2]:E0810 ] 1 8807615394212033278 [rank2]:E0810 ] 2-3 -6150027303226666531 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160283 Approved by: https://github.com/ezyang	2025-08-13 23:05:29 +00:00
Markus Hoehnerbach	182efe31db	[inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160 ) (#158462 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158462 Approved by: https://github.com/eellison	2025-08-13 22:54:18 +00:00
William Wen	1ea688f9a2	[dynamo] fix EXTENDED_ARG starts_line dropping bug (#160478 ) Fixes https://github.com/pytorch/pytorch/issues/160471 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160478 Approved by: https://github.com/Lucaskabela, https://github.com/billmguo	2025-08-13 22:27:40 +00:00
Isabella Ni	53e3949495	[MTIA-T][CFF] Pass backend parameter into GPU vertical pass file and pattern matcher (#160404 ) Summary: As titled Please see https://fb.workplace.com/groups/1075192433118967/posts/1735215827116621/?comment_id=1735220747116129&reply_comment_id=1735242997113904 Basically, for MTIA, we want mtia_afg to show up in the counters and backend, instead of Inductor. MTIA is not using inductor yet. Using env var TORCHINDUCTOR_PATTERN_MATCH_BACKEND to pass in the actual backend. The env var default value is "inductor", so nothing should break for GPU. Test Plan: Default is always "inductor", so existing test should not break. CI tests Rollback Plan: Differential Revision: D80069072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160404 Approved by: https://github.com/BoyuanFeng	2025-08-13 22:24:27 +00:00
PyTorch MergeBot	33d9401866	Revert "[BE][Dynamo] Type improvements in `_dynamo/utils` to generics (#159824 )" This reverts commit 3ef2e1ef769582a82c6ddf150e9d11bf4bf1c44f. Reverted https://github.com/pytorch/pytorch/pull/159824 on behalf of https://github.com/clee2000 due to I think this broke dynamo/test_trace_rules.py::TraceRuleTests::test_almost_impossible_missing_name [GH job link](https://github.com/pytorch/pytorch/actions/runs/16948305999/job/48035192324) [HUD commit link](`3ef2e1ef76`) ([comment](https://github.com/pytorch/pytorch/pull/159824#issuecomment-3186003531))	2025-08-13 22:17:29 +00:00
Shangdi Yu	d1950d4bb5	Change IR node's stack trace to be computed lazily (#160487 ) Summary: When an IR node is an inherited class, post_init is called once for each super().__init__() call. To avoid duplicated calls, we make stack trace computation happen lazily. Test Plan: CI Rollback Plan: Differential Revision: D80137870 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160487 Approved by: https://github.com/angelayi	2025-08-13 21:41:25 +00:00
Mikayla Gawarecki	1196bb1c2e	Add utility to get computed kernel in torch.library (#158393 ) Adds `OperatorEntry::getComputedKernelForDispatchKey` which returns the KernelFunction corresponding to `OperatorEntry.dispatchTable_[dispatch_ix]` for a given dispatch key - Specifically it returns a `SafeKernelFunction` that holds a `KernelToken`. This `KernelToken` is registered to the `KernelFunction` in `OperatorEntry.kernels_` and will be invalidated when the `KernelFunction` is destructed (i.e. when the `AnnotatedKernel` that holds this `KernelFunction` is removed from `kernels_`, which happens when the corresponding impl is deregistered). - `SafeKernelFunction` can be called via `callBoxed`, the validity of the token will be checked before this happens - `SafeKernelFunction` is pybinded and `getComputedKernelForDispatchKey` is exposed to the frontend ia `torch.library.get_kernel` Related to https://github.com/pytorch/pytorch/issues/155330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158393 Approved by: https://github.com/albanD	2025-08-13 21:00:59 +00:00
henrylhtsang	e9eb2096a5	[cutlass backend] Allow bmm use cases when batch stride is 0 (#160356 ) Differential Revision: [D80035771](https://our.internmc.facebook.com/intern/diff/D80035771/) The motivation and the original change is to reduce the number parameters we pass into the kernel, which was motivated by aesthetic reasons only. But seeing the need to use different batch stride, we should just pass in the batch stride. That would be a good long term fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160356 Approved by: https://github.com/mlazos	2025-08-13 20:52:24 +00:00
Lucas Kabela	3ef2e1ef76	[BE][Dynamo] Type improvements in `_dynamo/utils` to generics (#159824 ) Follow up to #159580 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159824 Approved by: https://github.com/williamwen42	2025-08-13 20:17:01 +00:00
Jithun Nair	4cde0acc0e	Make triton build ROCm library version-agnostic (#158408 ) Fixes maintenance of triton packaging script when library versions change from one ROCm version to next. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158408 Approved by: https://github.com/jeffdaily Co-authored-by: Ethan Wee <Ethan.Wee@amd.com>	2025-08-13 19:49:23 +00:00
Jerry Mannil	70ccdec44b	[ROCm] Improve reduction sum performance (#160466 ) * Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 Reproducer: ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` Before (MI300X): Avg time for shape (5079670, 128): 1629.99 us After (MI300X) Avg time for shape (5079670, 128): 1008.59 us Pull Request resolved: https://github.com/pytorch/pytorch/pull/160466 Approved by: https://github.com/petrex, https://github.com/jeffdaily	2025-08-13 18:46:58 +00:00
Nikita Shulga	db0b7f1cc9	[BE][CI] Adjust `error_inputs` for cat and complex (#160378 ) MPS backend does not support double, so errors should be different Pull Request resolved: https://github.com/pytorch/pytorch/pull/160378 Approved by: https://github.com/dcci	2025-08-13 18:35:06 +00:00
ILCSFNO	1c26c53851	Fix the Doc of `pivot` in `torch.lu` (#159617 ) Fixes #159616 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159617 Approved by: https://github.com/lezcano, https://github.com/jansel	2025-08-13 18:30:54 +00:00
Alexander Grund	adcca7d9a1	Do not rpath CUDA stubs folder in JIT generated code (#160179 ) `_transform_cuda_paths` intentionally includes the CUDA stubs folder. However this path must not be added to the rpath as otherwise any CUDA command will fail at runtime with > CUDA_ERROR_STUB_LIBRARY: "CUDA driver is a stub library" This results in e.g. non-descriptive errors like ``` cutlass_library/source/tools/util/include/cutlass/util/device_memory.h:67 cutlass::device_memory::allocate: cudaMalloc failed: bytes=4096 terminate called after throwing an instance of 'cutlass::cuda_exception' what(): std::exception ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160179 Approved by: https://github.com/jansel	2025-08-13 18:29:24 +00:00
Dmitry Nikolaev	01584d2a7d	[ROCm] remove extra transposes in NHWC convolutions on MIOpen (#160435 ) remove aten::contiguous for NHWC convolutions on ROCm Tests: - nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float32 - nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float16 Before: <img width="1255" height="228" alt="image" src="https://github.com/user-attachments/assets/b125ccab-00c2-4d3a-a341-4583e51d8d57" /> After: <img width="874" height="153" alt="image" src="https://github.com/user-attachments/assets/ec200754-3622-488e-8762-bff1c2d22818" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160435 Approved by: https://github.com/jeffdaily	2025-08-13 17:58:22 +00:00
ILCSFNO	87e6c4079d	Fix the Doc issue on the description of edge_order in torch.gradient() (#159130 ) Fixes #159129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159130 Approved by: https://github.com/soulitzer	2025-08-13 16:48:47 +00:00
Nikita Shulga	7d87e358ac	Fix MPS conv3d autocast bias dtype mismatch (#160423 ) ## Summary - register conv3d with MPS autocast to ensure bias dtypes match under AMP - add regression test chaining two Conv3d layers on MPS autocast Written by Codex, see https://chatgpt.com/codex/tasks/task_e_689b64192df883278648935963d2776d Pull Request resolved: https://github.com/pytorch/pytorch/pull/160423 Approved by: https://github.com/dcci	2025-08-13 16:23:21 +00:00
Saurabh Mishra	6ee175195a	[DCP][OSS] Rank local checkpointing in DCP without collectives (#147758 ) Summary: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147758 Approved by: https://github.com/meetv18	2025-08-13 16:20:28 +00:00
zhangfei	db32b60662	[ci] Add riscv opt-int build (#143979 ) Hi, @malfet Based on the previous discussion: [RISCV CI support · Issue #141550 · pytorch/pytorch](https://github.com/pytorch/pytorch/issues/141550) I have cross-compiled PyTorch for the RISC-V architecture on x86_64 Ubuntu 24.04 and created a new PR for it. Could you please help review it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/143979 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-13 16:12:02 +00:00
Paul Zhang	56c828bef9	Followup of #160002 , gracefully fail if Triton functions don't contain attributes (#160436 ) Summary: Fixes internal test failures of D80037015 Test Plan: CI Rollback Plan: Differential Revision: D80094187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160436 Approved by: https://github.com/clee2000	2025-08-13 16:04:56 +00:00
Natalia Gimelshein	a2fd106d67	guard cuMulticastUnbind call (#160499 ) Fixes builds for old compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/160499 Approved by: https://github.com/Skylion007	2025-08-13 15:45:51 +00:00
PyTorch MergeBot	c656334120	Revert "Factor out the strings to templates for better editor integration (#160357 )" This reverts commit cbffde774557752cf20447d42d99ec6102673c31. Reverted https://github.com/pytorch/pytorch/pull/160357 on behalf of https://github.com/clee2000 due to broke a bunch of internal builds due to not being able to find the file No such file or directory: torch/_inductor/kernel/flex/templates/flex_decode.py.jinja D80145761, might need a buck targets change? ([comment](https://github.com/pytorch/pytorch/pull/160357#issuecomment-3184435581))	2025-08-13 15:40:50 +00:00
fduwjj	31c9ac4319	[c10d] Fix test test_nccl_user_buffer_registration (#160497 ) Fixed `test_nccl_user_buffer_registration ` due to https://github.com/pytorch/pytorch/pull/160145, somehow CI didn't capture it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160497 Approved by: https://github.com/ngimel	2025-08-13 15:29:41 +00:00
Catherine Lee	deea71a90e	[ez][CI] Set timeout for linux-jammy-py3_13-clang12-test from 600min -> default val of 240 (#160500 ) 10 hours is very long Pull Request resolved: https://github.com/pytorch/pytorch/pull/160500 Approved by: https://github.com/huydhn	2025-08-13 15:14:24 +00:00
Svetlana Karslioglu	114a6c4043	Add placeholder for the User Guide (#159379 ) - Add pytorch_overview.md - Add pytorch_main_components.md - Reorganize top nav to have Get Started, User Guide, Reference API, Community, Tutorials - Move notes under user guide Pull Request resolved: https://github.com/pytorch/pytorch/pull/159379 Approved by: https://github.com/albanD Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-13 14:56:04 +00:00
libohao	ee1b0412b9	[1/N]Port 3 distributed/_tools test cases to Intel GPU (#159543 ) For [#114850](https://github.com/pytorch/pytorch/issues/114850), we will port distributed tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. use "torch.accelerator.current_accelerator()" to determine the accelerator backend 2. enabled XPU for some test path 3. skip some test cases which Intel GPU does not support Pull Request resolved: https://github.com/pytorch/pytorch/pull/159543 Approved by: https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-08-13 12:49:01 +00:00
Han, Chao1	42e51cd4b3	Support ddp zero hook XCCL path (#159240 ) XCCL backend no https://github.com/pytorch/pytorch/issues/62300 issue, add xccl path here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159240 Approved by: https://github.com/guangyey, https://github.com/Skylion007, https://github.com/EikanWang	2025-08-13 12:37:33 +00:00
Laith Sakka	96bd33b2de	Fix get_free_symbol_uses for several nodes (#160314 ) get_free_symbol_uses is used to know what unbacked symbols are used by a given node. not having correct get_free_symbol_uses defined properly leads to : - eliminating of some nodes due to not detection of any users. (See the added unit test) - Incorrect topological sort. Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel. for ComputedBuffer with NonOwningLayout its interesting case. when layout is NonOwningLayout we need to access the actual view op base layout and use detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160314 Approved by: https://github.com/eellison	2025-08-13 12:28:29 +00:00
Michael Lazos	ecde76c764	[Hierarchical Compile] Sort all regions identically (#158814 ) Before we would topologically sort each region individually, this works well except if some nodes have no arguments, then their order may change. To rectify this, we sort the first region as the reference region and use that sort order to sort the remaining regions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158814 Approved by: https://github.com/williamwen42	2025-08-13 11:55:23 +00:00
Michael Lazos	34ec5ed275	[Dynamo][Hierarchical Compile] Allow parameters to be propagated to submodules (#157979 ) Fixes issue with HF Gen AI models where we mark a param as static and a get_attr node gets put in the region. The effect of this is lifting get_attr nodes to be inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157979 Approved by: https://github.com/williamwen42	2025-08-13 09:12:10 +00:00
PyTorch MergeBot	641ee74781	Revert "Add `label_smoothing` param in `nn.BCELoss` and `nn.BCEWithLogitsLoss` (#150282 )" This reverts commit f990490a23815ea6ee27e487c70ba2cf513ba43d. Reverted https://github.com/pytorch/pytorch/pull/150282 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150282#issuecomment-3182844949))	2025-08-13 09:01:52 +00:00
Deng, Daisy	6e8865fbc1	port 3 distributed test to Intel GPU and unified some common functions (#158533 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - enabled XPU for some test path - Unify some common code under torch/testing/_internal for multiple backend, for example: - requires_nccl_version - _dynamo_dist_per_rank_init - DynamoDistributedSingleProcTestCase - DistTestCases - FSDPTestMultiThread Pull Request resolved: https://github.com/pytorch/pytorch/pull/158533 Approved by: https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-08-13 08:13:23 +00:00
Edward Yang	9a06e6d031	[claude-code] Add top-level module doc for torch/distributed/tensor/_op_schema.py (#157804 ) Not sure how good the description is, seeking insight from maintainers. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157804 Approved by: https://github.com/wanchaol	2025-08-13 07:27:11 +00:00
Erxin Shang	6ea8376f84	Enable XPU for test_autograd_function.py (#160309 ) # Description Fixes #114850, we will port dynamo tests to Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: # Changes 1. Get device type from get_devtype() method. 2. Replace the requires_cuda_and_triton with requires_gpu. 3. Add HAS_XPU_AND_TRITON into the scope. # Notify Pull Request resolved: https://github.com/pytorch/pytorch/pull/160309 Approved by: https://github.com/guangyey, https://github.com/ezyang	2025-08-13 06:38:34 +00:00
FFFrog	8eee08d227	Replace TORCH_INTERNAL_ASSERT with TORCH_CHECK (#160411 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160411 Approved by: https://github.com/ezyang	2025-08-13 06:31:10 +00:00
Masaki Kozuki	e497620260	Add `compile_id: Optional[CompileID]` to `torch._logging._internal.trace_structured_artifact` (#160440 ) Context: When writing a custom `torch.compile` backend, I quite frequently (ab)use `trace_structured_artifact` because I'm too lazy to customize tlparse (ref: `6d8b13c867`). I recently notice some of the artifacts I want to store are generated where CompileID cannot be correlated and `tlparse` html says > Sometimes, logs are made without a compile id. This makes it difficult to correlate related logs. This stack trie shows all places where log entries occurred without compile context; to fix, look an appropriate place in the stack where compile id should have been specified. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160440 Approved by: https://github.com/ezyang	2025-08-13 06:28:23 +00:00
kshitij12345	199e9abb6a	[fx] fix split_module with symint (#160093 ) Fixes https://github.com/pytorch/pytorch/issues/155220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160093 Approved by: https://github.com/ezyang	2025-08-13 05:50:15 +00:00
PyTorch UpdateBot	685f15dbea	[vllm hash update] update the pinned vllm hash (#160484 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160484 Approved by: https://github.com/pytorchbot	2025-08-13 04:54:03 +00:00
Guilherme Leobas	85db508af5	Wrap class definitions in `set_fullgraph(False)` in `test_int`/`bool`/`float`/`complex` (#160276 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160276 Approved by: https://github.com/zou3519 ghstack dependencies: #160216, #160217	2025-08-13 04:53:03 +00:00
Guilherme Leobas	27156ec804	Wrap class definitions in `set_fullgraph(False)` in `test_operator` (#160217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160217 Approved by: https://github.com/zou3519 ghstack dependencies: #160216	2025-08-13 04:53:03 +00:00
Guilherme Leobas	6746bc59df	Wrap class definitions in `set_fullgraph(False)` in `test_set` (#160216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160216 Approved by: https://github.com/zou3519	2025-08-13 04:53:03 +00:00
Nikita Shulga	3008d985a8	[CD] Do not build pytorch with nvshem on ARM (#160465 ) As nvshmem binary from 3.3.9 is not compatible with manylinux2_28, and 3.3.20 is not available for download yet Also, package nvshmem binary into full wheel Fixes https://github.com/pytorch/pytorch/issues/160425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160465 Approved by: https://github.com/atalman, https://github.com/huydhn	2025-08-13 04:10:43 +00:00
PyTorch MergeBot	652a6f5954	Revert "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#160403 )" This reverts commit 5a9c4cfce42b9eb87da0de40c5633f083115c307. Reverted https://github.com/pytorch/pytorch/pull/160403 on behalf of https://github.com/malfet due to It indeed consistently broken inductor, see `118bc97b14/1` ([comment](https://github.com/pytorch/pytorch/pull/160403#issuecomment-3182101130))	2025-08-13 04:05:46 +00:00
Ankita George	118bc97b14	Write full tensors out at once in HF consolidation script (#159394 ) Not all storage systems support writing at random offsets. This PR changes the writes of the consolidation script to write each tensor to a buffer, and then write out the buffer, sequentially going through every tensor in the output file. This will also help in the case where the sharded files weren't just sharded in the row-wise dimension. The reason is because small writes are expensive and we were writing each write for every chunk that was the largest number of contiguous bytes in the final tensor, but this could be a small amount of bytes for col-wise sharding. Now the full tensor is needed for the write, making the number of small writes smaller. Differential Revision: [D78684452](https://our.internmc.facebook.com/intern/diff/D78684452/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159394 Approved by: https://github.com/saumishr ghstack dependencies: #159392, #159393	2025-08-13 03:51:16 +00:00
Nikita Shulga	305fa22393	[GHF] Remove `app { name databaseId}` query (#160494 ) From `PRCheckSuites` fragment, as it's causes security exception when used with new GITHUB_TOKEN, that will looks as follows ``` RuntimeError: GraphQL query fragment PRReviews on PullRequestReviewConnection { nodes { author { login } bodyText createdAt authorAssociation editor { login } databaseId url state } pageInfo { startCursor hasPreviousPage } } fragment PRCheckSuites on CheckSuiteConnection { edges { node { app { name databaseId } workflowRun { workflow { name databaseId } databaseId url } checkRuns(first: 50) { nodes { name conclusion detailsUrl databaseId title summary } pageInfo { endCursor hasNextPage } } conclusion } cursor } pageInfo { hasNextPage } } fragment CommitAuthors on PullRequestCommitConnection { nodes { commit { authors(first: 2) { nodes { user { login } email name } } oid } } pageInfo { endCursor hasNextPage } } query ($owner: String!, $name: String!, $number: Int!) { repository(owner: $owner, name: $name) { pullRequest(number: $number) { closed isCrossRepository author { login } title body headRefName headRepository { nameWithOwner } baseRefName baseRefOid baseRepository { nameWithOwner isPrivate defaultBranchRef { name } } mergeCommit { oid } commits_with_authors: commits(first: 100) { ...CommitAuthors totalCount } commits(last: 1) { nodes { commit { checkSuites(first: 10) { ...PRCheckSuites } status { contexts { context state targetUrl } } oid } } } changedFiles files(first: 100) { nodes { path } pageInfo { endCursor hasNextPage } } reviews(last: 100) { ...PRReviews } comments(last: 5) { nodes { bodyText createdAt author { login } authorAssociation editor { login } databaseId url } pageInfo { startCursor hasPreviousPage } } labels(first: 100) { edges { node { name } } } } } } , args {'name': 'pytorch', 'owner': 'pytorch', 'number': 159820} failed: [{'type': 'FORBIDDEN', 'path': ['repository', 'pullRequest', 'commits', 'nodes', 0, 'commit', 'checkSuites', 'edges', 4, 'node', 'app'], 'extensions': {'saml_failure': False}, 'locations': [{'line': 26, 'column': 7}], 'message': 'Resource not accessible by integration'}] ``` But the same query works fine if executed using one's Personal Access Token Updated mocks file by running ``` sed -i -e s/a32a7ca3a2f6e2c9de07aef821b0111539758b4ac254f8a3432af32314f94876/8e262b0495bd934d39dda198d4c09144311c5ddd6cca6a227194bd48dbfe7201/ gql_mocks.json sed -i -e s/157add81c519f614388f3a67e287bdf4fbb1791e6d0bffe312e169d02ac2813f/28349cb4c891bbf85255fab2c33c770baf77c3e02b29ca9a0e4c6c97bed041db/ gql_mocks.json sed '/"app": {/,+3d' gql_mocks-orig.json >gql_mocks.json sed '/"app": null/d' gql_mocks-orig.json >gql_mocks.json ``` Undisable offending jobs Fixes https://github.com/pytorch/pytorch/issues/159894 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160494 Approved by: https://github.com/huydhn ghstack dependencies: #160490, #160492	2025-08-13 03:46:39 +00:00
Nikita Shulga	1151b40cbf	[BE] Filter unused mocks (#160492 ) Somebody checked in twice the number of mocks into the archive Filter them out by running following script ```python import json with open("gql_mocks-orig.json") as f: mocks = json.load(f) keys = list(mocks.keys()) good_shas = {'a32a7ca3a2f6e2c9de07aef821b0111539758b4ac254f8a3432af32314f94876', '157add81c519f614388f3a67e287bdf4fbb1791e6d0bffe312e169d02ac2813f', '4715ed05b382e572135c049664939f22f9b1249bc0c499ae278d655ad8cb598b', 'a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5', 'e5130469b5373479776bfbccade8039ce4741b97873bb3bec4e279fed08602be', '5dc32efeb8306f03744f6804ef4b500882f2759f7ac17fdc9f123669bfe4805a', '0a34acb829d8aca9dd28a8ba388dfa52f6ecdde7e903ace1caabdcfaba87de98', '8b50878b010492fe64005cc4b4ed34ac5f6695ce093f06b0d8d5403b7787c2c0', '2877b3b1e8630ca4ae797b9d85d5673d25ca8488c01141e11ff55f4a1359fca7'} for k in keys: if any(sha in k for sha in good_shas): continue del mocks[k] with open("gql_mocks.json","w") as f: json.dump(mocks, f, indent=2) f.write("\n") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160492 Approved by: https://github.com/huydhn ghstack dependencies: #160490	2025-08-13 03:46:39 +00:00
Nikita Shulga	d0f9785af3	[CI] Prevent accidental gql_mocks updates by test_trymerge (#160490 ) As they could not longer be fetched from GitHub, see https://github.com/pytorch/pytorch/issues/160489 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160490 Approved by: https://github.com/huydhn	2025-08-13 03:46:32 +00:00
Jerry Mannil	ba47821f52	[ROCm] Set thread_work_size to 16 for vectorized elementwise kernels for MI300X (#160444 ) * thread_work_size of 16 is giving better perf with many workloads for MI300X cherry-pick of `fb81400d34` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160444 Approved by: https://github.com/jeffdaily	2025-08-13 03:41:25 +00:00
Ankita George	2c5e10a5fc	Add new function consolidate_safetensors_files_on_every_rank for HF consolidation (#159393 ) Currently we are only using rank-0 for HF consolidation. But we should be able to use every rank to consolidate the sharded files, which will speed up the consolidation by Nx (where N is the number of ranks). Adding a new method consolidate_safetensors_files_on_every_rank to do this. Differential Revision: [D79000720](https://our.internmc.facebook.com/intern/diff/D79000720/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159393 Approved by: https://github.com/saumishr ghstack dependencies: #159392	2025-08-13 03:31:36 +00:00
Jane Xu	355462e127	Add stable Tensor get_device_index, use more stable DeviceIndex (#160143 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160143 Approved by: https://github.com/mikaylagawarecki	2025-08-13 03:27:10 +00:00
Xu Han	41673110cd	[inductor] Windows inductor use intel-openmp. (#160258 ) After some debug work, I found PyTorch torch_cpu.dll is using intel-openmp, but not MSVC openmp. So, switch Windows inductor to intel-openmp. It fixed: `c8205cb354/test/inductor/test_aot_inductor.py (L2405-L2408)` <img width="896" height="230" alt="image" src="https://github.com/user-attachments/assets/273b00f8-7dc1-43c9-9b7f-752e16355a80" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160258 Approved by: https://github.com/ezyang	2025-08-13 02:36:19 +00:00
Yu, Guangye	6be6d06295	Avoid potential deadlocks in host allocator (#159352 ) # Motivation This PR fixes a potential deadlock in the host allocator. When calling `event->record(stream)`, the `record_stream` implementation may acquire the Python GIL. In places such as `842cc77ab9/aten/src/ATen/cuda/CachingHostAllocator.cpp (L145-L151)`, and `842cc77ab9/aten/src/ATen/xpu/CachingHostAllocator.cpp (L22-L28)` `record_stream` is invoked while holding the allocator lock. To prevent deadlocks, we must ensure the locking order is: GIL → Allocator Lock. Reversing the order (Allocator Lock → GIL) can cause a deadlock. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159352 Approved by: https://github.com/cyyever, https://github.com/ezyang	2025-08-13 02:30:17 +00:00
nandesuka	f15ada5c6f	Enable output padding when only outermost dim is dynamic (#159404 ) Summary: When the shape of the output tensor has a dynamic outer most dim, the stride can still be padded to conform to configured alignment if required. Test Plan: CI Rollback Plan: Differential Revision: D79146886 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159404 Approved by: https://github.com/blaine-rister, https://github.com/eellison	2025-08-13 01:28:22 +00:00
Nikhil Patel	69a0a9aa7f	[Inductor][Triton] Pass GPUTarget param to updated make_ir function (#160422 ) Summary: A recent Triton commit changed `ASTSource.make_ir` to a 5-arg signature that includes a `GPUTarget`. We need to pass in this new argument. Test Plan: `buck2 test 'fbcode//mode/opt' -m ovr_config//triton:trunk fbcode//caffe2/test/inductor:test_inductor_cuda -- triton_kernel` Rollback Plan: Reviewed By: davidberard98 Differential Revision: D80069909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160422 Approved by: https://github.com/davidberard98, https://github.com/mlazos	2025-08-13 01:27:57 +00:00
Nikita Shulga	32099961d5	[EZ] Delete CircleCI case (#160479 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160479 Approved by: https://github.com/izaitsevfb ghstack dependencies: #160477	2025-08-13 01:19:09 +00:00
Nikita Shulga	8d1cf52922	[EZ][BE] Remove unused `conda-env-macOS-ARM64` (#160477 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160477 Approved by: https://github.com/atalman	2025-08-12 23:41:25 +00:00
fduwjj	b1f43548ca	[c10d] Error out the case when registering symmetric memory without eager init (#160145 ) Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160145 Approved by: https://github.com/kwen2501	2025-08-12 23:25:04 +00:00
Zain Rizvi	0d71ca2c46	[EZ] Replace `pytorch-labs` with `meta-pytorch` (#160459 ) This PR replaces all instances of 'pytorch-labs' with 'meta-pytorch' in this repository now that the 'pytorch-labs' org has been renamed to 'meta-pytorch' ## Changes Made - Replaced all occurrences of 'pytorch-labs' with 'meta-pytorch' - Only modified files with extensions: .py, .md, .sh, .rst, .cpp, .h, .txt, .yml - Skipped binary files and files larger than 1MB due to GitHub api payload limits in the script to cover all repos in this org. Will do a more manual second pass later to cover any larger files ## Files Modified This PR updates files that contained the target text. Generated by automated script on 2025-08-12T20:41:29.888681+00:00Z Pull Request resolved: https://github.com/pytorch/pytorch/pull/160459 Approved by: https://github.com/huydhn, https://github.com/clee2000, https://github.com/atalman, https://github.com/malfet	2025-08-12 22:44:25 +00:00
deedongala	5737372862	[CI] Switch ROCm MI300 GitHub Actions workflows from 2-GPU to 1-GPU runners (#158882 ) Updated .github/actionlint.yaml to replace linux.rocm.gpu.mi300.2 with linux.rocm.gpu.mi300.1 in the supported runner list Modified all affected workflows (inductor-perf-test-nightly-rocm.yml, inductor-periodic.yml, inductor-rocm-mi300.yml, and rocm-mi300.yml) to run jobs on 1-GPU MI300 runners instead of 2-GPU runners This should help increase available runners even with same number of CI nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158882 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-12 22:42:40 +00:00
Isalia20	2e4e5ab4be	[MPS] Add mps keys to `indices` and `values` ops (#160223 ) enable indices and values on sparse mps Pull Request resolved: https://github.com/pytorch/pytorch/pull/160223 Approved by: https://github.com/malfet	2025-08-12 22:08:44 +00:00
Zhengxu Chen	16d15445f8	Fullgraph graph capture with dynamo. (#159749 ) Summary: Following up on Avik's doc https://docs.google.com/document/d/11RW0Bbkp1QwFbEu8rCNW5d7wUFaEkxbL0uLyqcc2jTk/edit?tab=t.0 We are experimenting with a new API which utilizes torch.compile(fullgraph=True) and intend to use it to replace the old dynamo.export() API. This PR adds a prototype for the API described in the doc. Test Plan: test_misc -- -k test_aot_capture Rollback Plan: Differential Revision: D79534608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159749 Approved by: https://github.com/tugsbayasgalan	2025-08-12 22:06:18 +00:00
henrylhtsang	101276f81b	[BE] Save attributes for CppCompileError for pickleing (#160294 ) Differential Revision: [D79977408](https://our.internmc.facebook.com/intern/diff/D79977408/) Context: When testing cutlass backend and used autotune with subproc, sometimes I would see C++ compilation error (expected) followed by ``` Traceback (most recent call last): File "/torch/_inductor/autotune_process.py", line 175, in get result = TuningProcess.recv(self.read_pipe) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/torch/_inductor/autotune_process.py", line 99, in recv return pickle.load(read_pipe) ^^^^^^^^^^^^^^^^^^^^^^ TypeError: CppCompileError.__init__() missing 1 required positional argument: 'output' ``` which is unexpected. After asking claude, it seems > Now I can see the issue. The `CppCompileError` class requires two arguments: `cmd` (a list of strings) and `output` (a string). However, when exceptions are being pickled and unpickled across process boundaries, the pickling process might not be preserving the constructor arguments correctly. > > The problem is likely that when a `CppCompileError` is raised in the subprocess and then pickled/unpickled through the `recv` function, the unpickling process is trying to reconstruct the exception but doesn't have the required constructor arguments. > > The issue is clear now. The `CppCompileError` class doesn't have custom pickle methods (`__reduce__`, `__getstate__`, `__setstate__`), so when it's pickled and unpickled across process boundaries, Python's default pickling mechanism tries to reconstruct it but fails because it doesn't preserve the constructor arguments properly. > > The solution is to add a `__reduce__` method to the `CppCompileError` class to ensure it can be properly pickled and unpickled. Let me implement this fix: Adding these seem to help. fbcode repro: [D79977541](https://www.internalfb.com/diff/D79977541) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160294 Approved by: https://github.com/masnesral	2025-08-12 22:03:36 +00:00
drisspg	cbffde7745	Factor out the strings to templates for better editor integration (#160357 ) # Summary More code motion, tldr is that install 'Better Jinja' in vscode and now you can get highlighting Before <img width="776" height="926" alt="Screenshot 2025-08-11 at 2 41 08 PM" src="https://github.com/user-attachments/assets/10868b31-f8ac-4cf5-99fe-19b8789ce06b" /> After: <img width="1184" height="1299" alt="Screenshot 2025-08-11 at 2 40 27 PM" src="https://github.com/user-attachments/assets/45203765-589e-4d76-8196-d895a2f2fbf6" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160357 Approved by: https://github.com/eellison	2025-08-12 21:59:54 +00:00
David Berard	78a2fe1d42	[TorchScript] thread-safe ErrorReport::CallStack (#160386 ) Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings. The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called. When this happens, it causes a segfault. This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults. Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults. Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160386 Approved by: https://github.com/eellison	2025-08-12 21:59:04 +00:00
Ivan Zaitsev	f8f0414a59	fix cpp builder to avoid missing-source compile error (#160354 ) Summary: the condition ``` if config.is_fbcode() and (not self._aot_mode or self._use_relative_path): sources = [os.path.basename(i) for i in sources] ``` unintentionally (?) stripped paths even when use_relative_path was False (as long as aot_mode was False), breaking local tests that rely on absolute temp-file paths. Fixes internal issue: ``` FAILED (errors=1) CppCompileError: C++ compile error Command: /mnt/gvfs/third-party2/llvm-fb/0f1f083aa5508772f3db24bf4f697bc118ba0958/17/platform010/72a2ff8/bin/clang-17 czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp -shared -fPIC -O3 -DNDEBUG -fno-trapping-math -funsafe-math-optimizations -ffinite-math-only -fno-signed-zeros -fno-math-errno -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -Werror=ignored-optimization-argument -g -o /re_tmp/tmpsp58ya2h/zy/test_symbol.so Output: clang-17: error: no such file or directory: 'czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp' clang-17: error: no input files ``` Reviewed By: clee2000 Differential Revision: D80025417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160354 Approved by: https://github.com/benjaminglass1, https://github.com/clee2000	2025-08-12 21:36:22 +00:00
Mikayla Gawarecki	4d419a7461	Add pad and narrow to torch/csrc/stable/ops.h (#159328 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159328 Approved by: https://github.com/janeyx99 ghstack dependencies: #159507	2025-08-12 21:29:49 +00:00
Mikayla Gawarecki	655137b678	Update torch::stable::Tensor() default constructor (#159507 ) Allows things like ```cpp Tensor cu_seqlens_q; if (...) { cu_seqlens_q = ... } ... ``` Also adds `torch::stable::Tensor.defined()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159507 Approved by: https://github.com/janeyx99	2025-08-12 21:29:49 +00:00
Gheorghe-Teodor Bercea	f27232a213	[ROCm] Limit number of values per thread for reductions on three dimensions (#159652 ) In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159652 Approved by: https://github.com/jeffdaily	2025-08-12 21:15:56 +00:00
Anshul Sinha	c24ca7f4bf	[FSDP][Collectives] skipping allgather when world size is 1 (#160135 ) Summary: In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_params group to skip the foreach_all_gather and foreach_all_gather_copy_out APIs when world_size ‎ = 1. I have created a test that uses CommDebugMode to verify that the all gather comm has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. Below, I have included the link to the profile trace verifying these two APIs were skipped and two test commands. https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/anshulsi_f846ac3b-9467-4060-8e36-8cc3bc4449c3_devgpu263.prn2.facebook.com_652183.1753822140871934814.pt.trace.json Pull Request resolved: https://github.com/pytorch/pytorch/pull/160135 Approved by: https://github.com/weifengpy	2025-08-12 21:13:29 +00:00
AaronWang04	b4596895b9	[DTensor] Registers sharding rule for rms_norm (#159692 ) Reduces collective calls in the forward pass from 2 to 1 In #158716 I added the sharding rule for the backward pass but didn't add the forward pass as it didn't get dispatched. After #159324 this should get properly dispatched hence I am adding it now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159692 Approved by: https://github.com/tianyu-l	2025-08-12 21:05:24 +00:00
xinan.lin	5a9c4cfce4	[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#160403 ) Fixes #160243, Fixes #160244, Fixes #160245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160403 Approved by: https://github.com/janeyx99	2025-08-12 21:02:44 +00:00
Chien-Lin Chen	a354fa91e2	added class or module info for functions blocked by weight-only load (#159935 ) Fixes #152985 In #152985, users are confused why weights-only load failed even though functions were registered in safe_globals. Because the error message doesn't make the critical failure reason clear, they couldn't figure out only some functions are missing from safe_globals registration. This fix is to make that point more clear. Here's the new errror message, the blocked function information will be following the warning message with a line breaker to make it stand out. ``` _pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Trying to call reduce for unrecognized function <built-in method _unpickle of type object at 0x641e8a57d1f0> which belongs to <class 'zoneinfo.ZoneInfo'> Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. To execute this test, run the following from the base repo dir: python test/test_serialization.py TestSerialization.test_weights_only_with_safe_zoneinfo_unpickle_registration_success This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159935 Approved by: https://github.com/mikaylagawarecki	2025-08-12 20:52:25 +00:00
Ankita George	f95b58c284	Remove usage of fsspec in HF consolidation script (#159392 ) Moving towards just supporting local storage to take advantage of HF apis such as safe_open. This was already done in Storage component in https://github.com/pytorch/pytorch/pull/159405. This PR removes fsspec usages in consolidation script and relies on local storage only Differential Revision: [D78997975](https://our.internmc.facebook.com/intern/diff/D78997975/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159392 Approved by: https://github.com/sibuachu	2025-08-12 20:41:06 +00:00
albanD	8e6a313858	Add ownership token when needed on GradientEdge (#160098 ) We can avoid the token by introducing PyObject preservation for THPFunction. But I think it will be too much complexity given that this kind of issue is very rare. Happy to be talked into doing it though if someone really wants to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160098 Approved by: https://github.com/ezyang, https://github.com/soulitzer	2025-08-12 20:14:18 +00:00
Paul de Supinski	7e91394955	Support NUMA Binding for Callable Entrypoints (#160163 ) # Context This is an extension of #149334. # This PR Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`. Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and #160006 for discussion of alternatives and why this is necessary. Other changes: * Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).) * Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints. # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran ``` $ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 \| tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 \| tee none_callable.txt ``` and observed * 6.6% remote memory accesses with 'node' bindings * 11.6% remote without bindings I also ran similar with `str` entrypoints as before just to be sure it's still working. NOTE: [--run-path triggers the code to be run inside a `Callable`.](`017259f9c6/torch/distributed/run.py (L870)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160163 Approved by: https://github.com/d4l3k	2025-08-12 20:08:49 +00:00
Markus Hoehnerbach	89654db1ab	[inductor] fix triton bucketize mask propagation (#159961 ) See `6b414f56a4` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159961 Approved by: https://github.com/eellison	2025-08-12 19:59:32 +00:00
Natalia Gimelshein	2d0cdee394	move thread-local capture mode guard to include work.isStarted (#160398 ) Per title, should fix capture errors that happen because nccl watchdog races with capture start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160398 Approved by: https://github.com/aorenste	2025-08-12 19:25:04 +00:00
eqy	9903ca4f70	[cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140 ) The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN https://github.com/pytorch/pytorch/issues/155225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140 Approved by: https://github.com/ngimel, https://github.com/atalman	2025-08-12 18:07:41 +00:00
PyTorch MergeBot	f341077ce4	Revert "[ROCm] Support large inputs for coalesceValuesKernel (#158281 )" This reverts commit a7abf57aabec0ce686092e2d66e53ba185dbc56b. Reverted https://github.com/pytorch/pytorch/pull/158281 on behalf of https://github.com/clee2000 due to broke windows cuda build? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16915172288/job/47927141460) [HUD commit link](`a7abf57aab`). Not caught b/c PR didn't have ciflow/trunk ([comment](https://github.com/pytorch/pytorch/pull/158281#issuecomment-3180408766))	2025-08-12 17:57:57 +00:00
Edward Z. Yang	3cec82a7e9	Ensure outer aliasing on DTensor matches inner aliasing (#158954 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158954 Approved by: https://github.com/albanD, https://github.com/wconstab	2025-08-12 17:47:48 +00:00
Jerry Mannil	ee9f8ba11d	[ROCm] Use opportunistic fastatomics based on hueristics (#159430 ) * Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address Co-author: @amd-hhashemi Reproducer: ``` import time import torch x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float) ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda') src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float) for _ in range(20): x.index_add_(0, ind, src) start_time = time.time() for i in range(100): x.index_add_(0, ind, src) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/100 print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us") ``` Perf numbers: ``` Before: Avg time for index_add_: 25652.16 us After: Avg time for index_add_: 2675.15 us ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159430 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-08-12 17:13:54 +00:00
David Berard	1f4057c11a	[inductor] remove no_x_dim (#159810 ) no_x_dim is used to indicate that a reduction operates on a single row, and data loaded for the reduction is 1-dimensional. no_x_dim was introduced in https://github.com/pytorch/pytorch/pull/102444 - in which there was bad perf in some reductions, and using 1D tensors fixed the perf issue. However, it appears that this perf issue no longer exists in current Triton versions. https://github.com/pytorch/pytorch/pull/118822 checked this, and we can also check this on H100 benchmarks (linked below). And another motivation for removing this behavior is that it enables larger loads, which we observe is necessary for good performance on certain shapes on Blackwell. H100 inference benchmarks: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a H100 training benchmarks: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a Overall, the benchmarks show minimal change in performance. Differential Revision: [D79599286](https://our.internmc.facebook.com/intern/diff/D79599286) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159810 Approved by: https://github.com/ngimel, https://github.com/eellison	2025-08-12 17:10:31 +00:00
Jovian Anthony Jaison	94b91a8763	[redone][pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#160352 ) Summary: Writing torch.compile worked logs to dedicated_log_rank{RANK} if we're running on mast. ref: D79456310 (got reverted because of linter) Testing: Refer differential Revision: D79917440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160352 Approved by: https://github.com/masnesral	2025-08-12 16:49:08 +00:00
Xinya Zhang	a7abf57aab	[ROCm] Support large inputs for coalesceValuesKernel (#158281 ) # Description `.coalesce` cannot handle large inputs on ROCM due to maximal grid size limit. This PR splits axis `X` into axes `X` and `Y`, and repurposes `Z` for original `Y` on ROCm to avoid such limitation. Confirmed the new approach can handle large inputs. Correctness needs validation. # Testing Command `python torch_spmv.py 22500000 272500000` ## Script `torch_spmv.py` ``` python import torch import argparse def parse_args(): parser = argparse.ArgumentParser( description="Sparse COO Matrix by Dense Vector Multiplication using PyTorch" ) parser.add_argument("n", type=int, help="Size of the NxN matrix") parser.add_argument("nnz", type=int, help="Number of non-zero entries") return parser.parse_args() def main(): args = parse_args() n = args.n nnz = args.nnz dtype = torch.float32 device = torch.device('cuda') # Generate random indices for the sparse matrix in COO format. torch.manual_seed(42) rows = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device) cols = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device) indices = torch.stack([rows, cols], dim=0) # Generate random values. values = torch.randn(nnz, dtype=torch.float32, device=device) # Create the sparse COO matrix and move it to the target device. sparse_matrix = torch.sparse_coo_tensor(indices, values, size=(n, n), dtype=torch.float32, device=device) sparse_matrix = sparse_matrix.coalesce() # Generate a random dense vector. dense_vector = torch.randn(n, dtype=torch.float32, device=device) # Perform sparse matrix - dense vector multiplication. # Using torch.sparse.mm which expects a 2D tensor for the vector. result = torch.sparse.mm(sparse_matrix, dense_vector.unsqueeze(1)).squeeze() # result = torch.mv(sparse_matrix, dense_vector) # Print the result. print("Result of the multiplication:") print(torch.sum(result)) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158281 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-08-12 16:42:55 +00:00
PyTorch MergeBot	f7b2f3314c	Revert "[triton_heuristics] Optimize the triton launcher in pt2 (#160000 )" This reverts commit d0e2240f680ea2a553f7ee8188f52482e130bfd0. Reverted https://github.com/pytorch/pytorch/pull/160000 on behalf of https://github.com/davidberard98 due to D80054972 failing with test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1_tdlp_1 ([comment](https://github.com/pytorch/pytorch/pull/160000#issuecomment-3180144676))	2025-08-12 16:33:02 +00:00
Jeff Daily	9d37c960a4	[ROCm][CI] use new benchmark image for dynamo (#160421 ) Follow-up to #160047 that separated the rocm image into default CI and benchmarks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160421 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-12 16:07:19 +00:00
PyTorch MergeBot	b219ca2a00	Revert "Update triton xpu commit to support python 3.14 (#160183 )" This reverts commit 7fbc22855c17741ae016992803b2e147a13aa22d. Reverted https://github.com/pytorch/pytorch/pull/160183 on behalf of https://github.com/clee2000 due to I'm not sure how, but it seems to have broken inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration [GH job link](https://github.com/pytorch/pytorch/actions/runs/16911267995/job/47917091939) [HUD commit link](`7fbc22855c`). Maybe because the docker build changed? Note to self: not bad TD ([comment](https://github.com/pytorch/pytorch/pull/160183#issuecomment-3179840160))	2025-08-12 15:29:19 +00:00
atalman	b7db86600a	Fix Tensor illustration, use permalinks for image embedding in Readme.md (#160416 ) Fixes Tensor illustration being broken on pypi.org. Also uses permalinks instead of links to images for embedding as per this suggestion of Alban: https://github.com/pytorch/pytorch/pull/160187#discussion_r2262978006 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160416 Approved by: https://github.com/malfet	2025-08-12 15:15:12 +00:00
James Wu	9708fcf92d	Account for triton kernel source code hidden in custom ops properly in AOTAutogradCache (#160120 ) This PR fixes a bug where user defined triton kernels hidden behind `triton_op` do not register source code changes. If a user only changes a triton kernel source_code, because triton kernels are hidden under the custom op, dynamo hasn't traced into them yet. This means at AOTAutograd time, we don't know the list of triton kernels that are defined by custom ops. This is an initial fix for the issue by parsing the AST of the custom op looking for triton kernels. This won't catch more degenerate cases if the custom op calls other custom ops/functions that then call triton kernels, and then the toplevel compiled graph doesn't know about it. To handle that, we'd have to trace through the custom op at dynamo time. This should handle 99% of cases, though. I added an expectedFailure test to show the limitation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160120 Approved by: https://github.com/zou3519	2025-08-12 14:11:06 +00:00
Wang, Chuanqi	a288b15ea9	[CI] Reduce XPU Windows build time (#159763 ) Reduce the time cost from 2.5 hours to about 1.5 hours. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159763 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-08-12 14:04:29 +00:00
Wang, Chuanqi	7fbc22855c	Update triton xpu commit to support python 3.14 (#160183 ) Follow PR #159725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160183 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-08-12 14:02:36 +00:00
IvanKobzarev	f33ce40bc0	[bucketing] Bucket only adjacent collectives to prevent reordering (#159983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159983 Approved by: https://github.com/wconstab, https://github.com/eellison	2025-08-12 11:57:00 +00:00
Animesh Jain	4d5b3f2d5a	[dynamo][guards] Install dict watchers for recrusive dict tag optimization (#159796 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159796 Approved by: https://github.com/jansel	2025-08-12 09:49:11 +00:00
zeshengzong	f990490a23	Add `label_smoothing` param in `nn.BCELoss` and `nn.BCEWithLogitsLoss` (#150282 ) Fixes #91545 ## Changes - Add `label_smoothing` param and docs - Add test case for `label_smoothing` - Remove duplicate description in `nn.BCELoss` and `nn.BCEWithLogitsLoss` ## Test Result ```bash pytest -s test/test_nn.py -k test_bce ``` ![image](https://github.com/user-attachments/assets/30c0b7fe-fe49-4aa0-9b05-4d70403a7b05) ![image](https://github.com/user-attachments/assets/4fe3fd1c-54b8-4012-afd9-133ce9fb4964) ![image](https://github.com/user-attachments/assets/5cad019a-3a4c-475a-9fde-9c1acad5792d) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150282 Approved by: https://github.com/cyyever, https://github.com/mikaylagawarecki	2025-08-12 09:37:03 +00:00
morrison-turnansky	b9003ed3d8	Dynamo Deep Dive Documentation Fix (#158860 ) changed SourceBuilder to VariableBuilder Fixes #158447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158860 Approved by: https://github.com/mlazos	2025-08-12 08:53:33 +00:00
Laith Sakka	fea7e9dd37	extract shape in _view_has_unbacked_input (#160255 ) Summary: We were getting DDE on reshape still!! i looked deeper and found an issue in _view_has_unbacked_input namely when input is [[,,]] it need to be normalized to [..] Test Plan: existing tests. Rollback Plan: Differential Revision: D79951119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160255 Approved by: https://github.com/bobrenjc93	2025-08-12 08:38:19 +00:00
Jovian Anthony Jaison	9a0f7a3bb0	[retry-land][pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#160348 ) refer: https://github.com/pytorch/pytorch/pull/159655 Earlier pr failed on dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed. Updated test_dynamo_timed + re-ran locally to test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160348 Approved by: https://github.com/masnesral	2025-08-12 06:24:54 +00:00
Animesh Jain	01bcf9a40d	Bump transformers pin (#159291 ) Trying to update hf pin. Benchmarking run to figure out issues <img width="1356" height="123" alt="image" src="https://github.com/user-attachments/assets/fbc435f3-a7cb-4280-9636-2ea6d15d7b6d" /> Retrying - https://github.com/pytorch/pytorch/pull/156118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159291 Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-08-12 05:14:17 +00:00
Animesh Jain	8d3d1c8443	[dynamo] fixes to propagate tag safeness (#159807 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159807 Approved by: https://github.com/jansel	2025-08-12 04:50:13 +00:00
PyTorch UpdateBot	0f3b10b8ee	[audio hash update] update the pinned audio hash (#160384 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160384 Approved by: https://github.com/pytorchbot	2025-08-12 04:38:04 +00:00
Boyuan Feng	5f1010fbb3	[Graph Partition] Pass all OSS unit tests (#154667 ) Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315). Run the same diff on two days and both show speedup on average. [first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d) <img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" /> [second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf) <img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667 Approved by: https://github.com/eellison	2025-08-12 04:37:58 +00:00
Nikita Shulga	edaa151d0d	[CI] Move CUDA tests to trunk workflow (#160379 ) Which is getting run before PR is merged anyway, but according to 3X less frequently than pull workflow according to [Flambeau](https://pytorchci.grafana.net/public-dashboards/1c571e79090443eaaa9811db71f8d23b) <img width="796" height="573" alt="image" src="https://github.com/user-attachments/assets/0235e610-4e1c-4be5-88bf-ea8278d1c656" /> I.e. that will probably results in some longer time to signal, but considering that frequency of changes to eager PyTorch-on-CUDA slowed down and Inductor changes are decorated with ciflow/inductor, this looks like an acceptable tradeoff to reduce costs Pull Request resolved: https://github.com/pytorch/pytorch/pull/160379 Approved by: https://github.com/izaitsevfb	2025-08-12 04:23:50 +00:00
rzou	10bc36fe84	Get tensor subclasses and torch.library.triton_op to dispatch correctly (#160341 ) Short-term fix for https://github.com/pytorch/pytorch/issues/160333 The problem is: 1) `triton_op` adds a decomposition for FunctionalTensorMode for this operation 2) Tensor Subclasses rely on FunctionalTensorMode's `__torch_dispatch__` returning NotImplemented. 3) `triton_op`'s FunctionalTensorMode decomposition takes precedence over FunctionalTensorMode's decomposition. The easy fix is to copy-paste the FunctionalTensorMode's NotImplemented return logic into the decomposition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160341 Approved by: https://github.com/drisspg	2025-08-12 04:09:37 +00:00
PyTorch UpdateBot	32e5e2f596	[vllm hash update] update the pinned vllm hash (#160259 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160259 Approved by: https://github.com/pytorchbot	2025-08-12 04:04:53 +00:00
Scott Todd	bfc873d02e	[ROCm][Windows] Revert copying hipblaslt and rocblas dirs. (#159083 ) This reverts the changes from `b367e5f6a6`. This will also close https://github.com/pytorch/pytorch/pull/158922. Since `30387ab2e4`, ROCm is bootstrapped using the 'rocm' Python module which contains these files (see https://github.com/ROCm/TheRock/blob/main/docs/packaging/python_packaging.md), so they do not need to be bundled into torch/lib. There was also a bug in here - if `ROCM_DIR` is unset, the code crashes: ``` File "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\setuptools\_distutils\dist.py", line 1002, in run_command cmd_obj.run() File "D:\b\pytorch_main\setup.py", line 853, in run rocm_dir_path = Path(os.environ["ROCM_DIR"]) ~~~~~~~~~~^^^^^^^^^^^^ File "<frozen os>", line 714, in __getitem__ KeyError: 'ROCM_DIR' ``` The code could have checked for `ROCM_PATH` too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159083 Approved by: https://github.com/jeffdaily	2025-08-12 02:45:49 +00:00
Scott Todd	eed9dbf70f	[ROCm] Add torch/_rocm_init.py to .gitignore. (#159806 ) Follow-up to https://github.com/pytorch/pytorch/pull/155285. Build scripts like https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py generate this file with contents like: ```python def initialize(): import rocm_sdk rocm_sdk.initialize_process( preload_shortnames=['amd_comgr', 'amdhip64', 'hiprtc', 'hipblas', 'hipfft', 'hiprand', 'hipsparse', 'hipsolver', 'hipblaslt', 'miopen'], check_version='7.0.0rc20250804') ``` We may also have https://github.com/pytorch/pytorch/blob/main/tools/amd_build/build_amd.py do the same thing as more of that build support moves here into the upstream PyTorch repository itself (see https://github.com/pytorch/pytorch/issues/159520). This file is then loaded if present here: `a7f3bdf550/torch/__init__.py (L145-L157)` Given that the file is generated by build scripts, I think adding it to `.gitignore` makes sense, as that will prevent accidental check-ins and keep local history cleaner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159806 Approved by: https://github.com/jeffdaily	2025-08-12 02:24:21 +00:00
Natalia Gimelshein	be53f609aa	fix retaining multimem in symmetric memory (#160343 ) fixes OOM in #160289 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160343 Approved by: https://github.com/eqy	2025-08-12 02:03:20 +00:00
Zain Rizvi	95210cc409	[BE] Isolate pre-push hook dependencies in dedicated virtual environment (#160048 ) This adds two changes: - Isolates pre-push hook dependencies into an isolated venv, no longer affect your system environment - Lets you manually run the pre-push lintrunner (including with lintrunner -a) by invoking `python scripts/lintrunner.py [-a]` (it's ugly, but better than nothing...for now) This is a follow up to: - https://github.com/pytorch/pytorch/pull/158389 ## Problem The current pre-push hook setup installs lintrunner and related dependencies globally, which makes developers nervous about system pollution and can cause version conflicts with existing installations. Also, if the pre-push lintrunner found errors, you had to hope your normal lintrunner could fix them (which wasn't always the case, e.g. if those errors only manifested in certain python versions) ## Key Changes: - Isolated Environment: Creates .git/hooks/linter/.venv/ with Python 3.9 (the python used in CI) and an isolated lintrunner installation - User-Friendly CLI: New python scripts/lintrunner.py wrapper allows developers to run lintrunner (including -a auto-fix) from any environment - Simplified Architecture: Eliminates pre-commit dependency entirely - uses direct git hooks File Changes: - scripts/setup_hooks.py: Rewritten to create isolated uv-managed virtual environment - scripts/lintrunner.py: New wrapper script with shared hash management logic - scripts/run_lintrunner.py: Removed (functionality merged into lintrunner.py) - .pre-commit-config.yaml: Removed (no longer needed) ## Usage: ``` # Setup (run once) python scripts/setup_hooks.py # Manual linting (works from any environment) python scripts/lintrunner.py # Check mode python scripts/lintrunner.py -a # Auto-fix mode # Git hooks work automatically git push # Runs lintrunner in isolated environment # Need to skip the pre-push hook? git push --no-verify ``` ## Benefits: - ✅ Zero global dependency installation - ✅ Per-repository isolation prevents version conflicts - ✅ Full lintrunner functionality is now accessible ## Implementation Notes: - Virtual env is kept in a dedicated dir in .git, to keep per-repo mechanics - lintrunner.py does not need to be invoked from a specific venv. It'll invoke the right venv itself. A minor bug: It tends to garble the lintrunner output a bit, like the screenshot below shows, but I haven't found a workaround so far and it remains understandable to users: <img width="241" height="154" alt="image" src="https://github.com/user-attachments/assets/9496f925-8524-4434-8486-dc579442d688" /> ## What's next? Features that could be added: - Check for lintrunner updates, auto-update if needed - Depending on dev response, this could be enabled by default for all pytorch/pytorch environments Pull Request resolved: https://github.com/pytorch/pytorch/pull/160048 Approved by: https://github.com/seemethere	2025-08-12 01:58:46 +00:00
Ramya Ramineni	7a974a88f2	[ROCm] Fix resource_strings.h (#159996 ) This PR fixes the errors like below: ``` [rank7]: RuntimeError: /tmp/comgr-c3c81b/input/CompileSourceejOPx6:34:8: error: unknown type name 'uint64_t'; did you mean '__hip_internal::uint64_t'? [rank7]: 34 \| if(((uint64_t) t0.data) % (4 * sizeof(half)) != 0) flag_vec4 = false; ``` The following datatypes needs to be defined in `torch/csrc/jit/codegen/fuser/cuda/resource_strings.h` for ROCm versions >= 7.0. ``` typedef unsigned char uint8_t; typedef signed char int8_t; typedef short int int16_t; typedef long long int int64_t; typedef unsigned long long int uint64_t; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159996 Approved by: https://github.com/pruthvistony, https://github.com/Skylion007, https://github.com/jeffdaily	2025-08-12 01:58:02 +00:00
henrylhtsang	f3f159ff8c	[BE][cutlass backend] Reduce severity of log message for no cutlass config found (#160148 ) This is not really a problem. Sometimes we cannot find a cutlass config due to shape, e.g. when k is odd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160148 Approved by: https://github.com/mlazos, https://github.com/Skylion007	2025-08-12 01:41:58 +00:00
henrylhtsang	b90feeac86	[BE][cutlass backend] Fix subproc addmm tests (#160295 ) Differential Revision: [D79977421](https://our.internmc.facebook.com/intern/diff/D79977421/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160295 Approved by: https://github.com/jingsh	2025-08-12 01:41:06 +00:00
Han, Xu	0d40ff3b49	[inductor] fix test_different_file_paths_local_pgo on Windows. (#160382 ) fix test_different_file_paths_local_pgo on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160382 Approved by: https://github.com/angelayi	2025-08-12 01:35:39 +00:00
Scott Todd	cae2b5e3d2	[ROCm][Windows] Enable USE_ROCM, disable USE_RCCL on Windows. (#159079 ) This allows setting `USE_ROCM` on Windows. A few other patches are still required to build (see https://github.com/ROCm/TheRock/issues/589), but we have instructions using open source code and rocm python packages available at https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#build-pytorch-with-rocm-support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159079 Approved by: https://github.com/jeffdaily	2025-08-12 01:28:20 +00:00
Scott Todd	ee89cc7a0a	[ROCm][Windows] Fix LoadHIP handling of environment variable paths on Windows. (#159080 ) See https://cmake.org/cmake/help/latest/command/file.html#path-conversion. Paths stored in environment variables may use `/` or `\` (e.g. on Windows), while cmake-style paths always use `/`. This fixes configure errors like: ``` CMake Error at D:/b/pytorch_main/build/CMakeFiles/CMakeScratch/TryCompile-srhq07/CMakeLists.txt:2 (set): Syntax error in cmake code at D:/b/pytorch_main/build/CMakeFiles/CMakeScratch/TryCompile-srhq07/CMakeLists.txt:2 when parsing string D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\_rocm_sdk_devel/cmake/;D:/b/pytorch_main/cmake/Modules Invalid character escape '\p'. CMake Error at D:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/cmake/data/share/cmake-3.31/Modules/Internal/CheckSourceCompiles.cmake:108 (try_compile): Failed to configure test project build system. ``` (note the mixed usage of `\` and `/` in that string) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159080 Approved by: https://github.com/jeffdaily	2025-08-12 00:18:19 +00:00
Howard Huang	e63c2b21c1	[PP] Initialize P2P communicators on first step (#160210 ) Was hitting hangs in multi-node settings and initializing the NCCL communicators needed for batch p2p ops ahead of time fixes this. This change adds extra communication since it communicates a dummy tensor to next and previous stage ranks. However, this is only paid on the first step so it is negligible. Debug history: https://docs.google.com/document/d/1EKVJYmW2hj_VsvDvnSggXhZzJyvMu9dA0iDJWOZAtjY/edit?tab=t.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160210 Approved by: https://github.com/wconstab	2025-08-11 23:46:58 +00:00
drisspg	3626ba711b	[FlexAttention] Swap from and to & for new triton (#160227 ) Fixes #158463 On B200 I am getting a bunch of error spew: ```Shell /tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline /tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` Triton compilation failed: triton_tem_fused_zeros_1 def triton_tem_fused_zeros_1(arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0): PRESCALE_QK : tl.constexpr = False ``` ```Shell 74 = arith.subi %170, %166 : i32 %175 = arith.muli %174, %c128_i32 : i32 %176 = arith.subi %175, %c64_i32 : i32 %177 = arith.extui %173 : i1 to i32 %178 = arith.muli %176, %177 : i32 %179 = arith.subi %c1_i32, %177 : i32 %180 = arith.muli %179, %c64_i32 : i32 %181 = arith.addi %178, %180 : i32 %182 = arith.muli %181, %c64_i32 : i32 %183 = tt.splat %182 : i32 -> tensor<64x64xi32> %184 = tt.addptr %arg19, %183 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %185 = tt.addptr %arg20, %183 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %186 = tt.splat %181 : i32 -> tensor<64xi32> %187 = arith.addi %arg21, %186 : tensor<64xi32> scf.yield %163, %184, %185, %187 : tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32> } %114 = tt.expand_dims %113#3 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %115 = arith.cmpi slt, %114, %cst_7 : tensor<1x64xi32> %116 = tt.broadcast %115 : tensor<1x64xi1> -> tensor<64x64xi1> %117 = tt.load %113#1, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>> %118 = tt.dot %46, %117, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %119 = arith.mulf %118, %cst_13 : tensor<64x64xf32> %120 = arith.mulf %119, %cst_3 : tensor<64x64xf32> %121 = arith.select %116, %120, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %122 = arith.select %115, %cst_4, %cst_5 : tensor<1x64xi1>, tensor<1x64xi1> %123 = tt.broadcast %122 : tensor<1x64xi1> -> tensor<64x64xi1> %124 = arith.select %123, %121, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %125 = arith.mulf %124, %cst_2 : tensor<64x64xf32> %126 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32> %127 = arith.subf %125, %126 : tensor<64x64xf32> %128 = math.exp2 %127 : tensor<64x64xf32> %129 = tt.load %113#2, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>> %130 = tt.dot %51, %129, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %131 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> %132 = tt.broadcast %131 : tensor<64x1xf32> -> tensor<64x64xf32> %133 = arith.subf %130, %132 : tensor<64x64xf32> %134 = arith.mulf %128, %133 : tensor<64x64xf32> %135 = arith.mulf %134, %cst_3 : tensor<64x64xf32> %136 = arith.select %116, %135, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %137 = arith.select %115, %122, %cst_5 : tensor<1x64xi1>, tensor<1x64xi1> %138 = tt.broadcast %137 : tensor<1x64xi1> -> tensor<64x64xi1> %139 = arith.select %138, %136, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %140 = arith.truncf %139 : tensor<64x64xf32> to tensor<64x64xf16> %141 = tt.trans %117 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %142 = tt.dot %140, %141, %113#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> scf.yield %142 : tensor<64x64xf32> } else { scf.yield %cst_9 : tensor<64x64xf32> } %84 = tt.addptr %arg13, %22 : !tt.ptr<i32>, i32 %85 = tt.load %84 : !tt.ptr<i32> %86 = arith.muli %85, %c128_i32 : i32 %87 = tt.addptr %arg12, %21 : !tt.ptr<i32>, i32 %88 = tt.load %87 : !tt.ptr<i32> %89 = tt.splat %86 : i32 -> tensor<64xi32> %90 = arith.addi %89, %14 : tensor<64xi32> %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %92 = arith.muli %91, %cst_11 : tensor<1x64xi32> %93 = tt.addptr %71, %92 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32> %94 = tt.broadcast %93 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %95 = tt.addptr %94, %74 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %96 = tt.addptr %76, %92 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32> %97 = tt.broadcast %96 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %98 = tt.addptr %97, %74 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %99 = arith.muli %88, %c2_i32 : i32 %100 = arith.minsi %99, %c4_i32 : i32 %101 = arith.cmpi sge, %100, %c1_i32 : i32 %102 = scf.if %101 -> (tensor<64x64xf32>) { %112 = arith.subi %100, %c1_i32 : i32 %113:4 = scf.for %arg17 = %c0_i32 to %112 step %c1_i32 iter_args(%arg18 = %83, %arg19 = %95, %arg20 = %98, %arg21 = %90) -> (tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>) : i32 { %137 = tt.expand_dims %arg21 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %138 = arith.cmpi slt, %137, %cst_7 : tensor<1x64xi32> %139 = tt.broadcast %138 : tensor<1x64xi1> -> tensor<64x64xi1> %140 = tt.load %arg19, %139, %cst_8 : tensor<64x64x!tt.ptr<f16>> %141 = tt.dot %46, %140, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %142 = arith.mulf %141, %cst_13 : tensor<64x64xf32> %143 = arith.mulf %142, %cst_3 : tensor<64x64xf32> %144 = arith.mulf %143, %cst_2 : tensor<64x64xf32> %145 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32> %146 = arith.subf %144, %145 : tensor<64x64xf32> %147 = math.exp2 %146 : tensor<64x64xf32> %148 = tt.load %arg20, %139, %cst_8 : tensor<64x64x!tt.ptr<f16>> %149 = tt.dot %51, %148, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %150 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> %151 = tt.broadcast %150 : tensor<64x1xf32> -> tensor<64x64xf32> %152 = arith.subf %149, %151 : tensor<64x64xf32> %153 = arith.mulf %147, %152 : tensor<64x64xf32> %154 = arith.mulf %153, %cst_3 : tensor<64x64xf32> %155 = arith.truncf %154 : tensor<64x64xf32> to tensor<64x64xf16> %156 = tt.trans %140 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %157 = tt.dot %155, %156, %arg18, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %158 = arith.divsi %arg17, %c2_i32 : i32 %159 = tt.addptr %84, %158 : !tt.ptr<i32>, i32 %160 = tt.load %159 evictionPolicy = evict_last : !tt.ptr<i32> %161 = arith.addi %158, %c1_i32 : i32 %162 = arith.cmpi slt, %161, %88 : i32 %163 = tt.addptr %159, %c1_i32 : !tt.ptr<i32>, i32 %164 = tt.load %163, %162 evictionPolicy = evict_last : !tt.ptr<i32> %165 = arith.addi %arg17, %c1_i32 : i32 %166 = arith.remsi %165, %c2_i32 : i32 %167 = arith.cmpi eq, %166, %c0_i32 : i32 %168 = arith.subi %164, %160 : i32 %169 = arith.muli %168, %c128_i32 : i32 %170 = arith.subi %169, %c64_i32 : i32 %171 = arith.extui %167 : i1 to i32 %172 = arith.muli %170, %171 : i32 %173 = arith.subi %c1_i32, %171 : i32 %174 = arith.muli %173, %c64_i32 : i32 %175 = arith.addi %172, %174 : i32 %176 = arith.muli %175, %c64_i32 : i32 %177 = tt.splat %176 : i32 -> tensor<64x64xi32> %178 = tt.addptr %arg19, %177 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %179 = tt.addptr %arg20, %177 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %180 = tt.splat %175 : i32 -> tensor<64xi32> %181 = arith.addi %arg21, %180 : tensor<64xi32> scf.yield %157, %178, %179, %181 : tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32> } %114 = tt.expand_dims %113#3 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %115 = arith.cmpi slt, %114, %cst_7 : tensor<1x64xi32> %116 = tt.broadcast %115 : tensor<1x64xi1> -> tensor<64x64xi1> %117 = tt.load %113#1, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>> %118 = tt.dot %46, %117, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %119 = arith.mulf %118, %cst_13 : tensor<64x64xf32> %120 = arith.mulf %119, %cst_3 : tensor<64x64xf32> %121 = arith.select %116, %120, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %122 = arith.mulf %121, %cst_2 : tensor<64x64xf32> %123 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32> %124 = arith.subf %122, %123 : tensor<64x64xf32> %125 = math.exp2 %124 : tensor<64x64xf32> %126 = tt.load %113#2, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>> %127 = tt.dot %51, %126, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %128 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> %129 = tt.broadcast %128 : tensor<64x1xf32> -> tensor<64x64xf32> %130 = arith.subf %127, %129 : tensor<64x64xf32> %131 = arith.mulf %125, %130 : tensor<64x64xf32> %132 = arith.mulf %131, %cst_3 : tensor<64x64xf32> %133 = arith.select %116, %132, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %134 = arith.truncf %133 : tensor<64x64xf32> to tensor<64x64xf16> %135 = tt.trans %117 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %136 = tt.dot %134, %135, %113#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> scf.yield %136 : tensor<64x64xf32> } else { scf.yield %83 : tensor<64x64xf32> } %103 = tt.splat %33 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %104 = tt.addptr %103, %37 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %105 = tt.broadcast %104 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %106 = tt.addptr %105, %42 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %107 = arith.mulf %102, %cst_13 : tensor<64x64xf32> %108 = arith.cmpi slt, %40, %cst_11 : tensor<1x64xi32> %109 = tt.broadcast %108 : tensor<1x64xi1> -> tensor<64x64xi1> %110 = arith.andi %45, %109 : tensor<64x64xi1> %111 = arith.truncf %107 : tensor<64x64xf32> to tensor<64x64xf16> tt.store %106, %111, %110 : tensor<64x64x!tt.ptr<f16>> } else { %16 = arith.divsi %0, %c2_i32 : i32 %17 = arith.muli %0, %c64_i32 : i32 %18 = tt.splat %17 : i32 -> tensor<64xi32> %19 = arith.addi %18, %14 : tensor<64xi32> %20 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %21 = arith.muli %20, %cst_14 : tensor<64x1xi32> %22 = tt.splat %11 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %23 = tt.addptr %22, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %24 = tt.expand_dims %14 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %25 = tt.broadcast %23 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %26 = tt.broadcast %24 : tensor<1x64xi32> -> tensor<64x64xi32> %27 = tt.addptr %25, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %28 = arith.cmpi slt, %20, %cst_10 : tensor<64x1xi32> %29 = tt.broadcast %28 : tensor<64x1xi1> -> tensor<64x64xi1> %30 = tt.load %27, %29, %cst_8 : tensor<64x64x!tt.ptr<f16>> %31 = tt.splat %12 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %32 = tt.addptr %31, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %33 = tt.broadcast %32 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %34 = tt.addptr %33, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %35 = tt.load %34, %29, %cst_8 : tensor<64x64x!tt.ptr<f16>> %36:2 = scf.for %arg17 = %c0_i32 to %c4_i32 step %c1_i32 iter_args(%arg18 = %cst_9, %arg19 = %cst_9) -> (tensor<64x64xf32>, tensor<64x64xf32>) : i32 { %55 = arith.muli %2, %c4_i32 : i32 %56 = arith.addi %55, %arg17 : i32 %57 = arith.muli %56, %c2048_i32 : i32 %58 = arith.muli %1, %c32768_i32 : i32 %59 = arith.addi %57, %58 : i32 %60 = arith.extsi %59 : i32 to i64 %61 = arith.muli %1, %c16_i32 : i32 %62 = arith.addi %61, %56 : i32 %63 = arith.muli %62, %c32_i32 : i32 %64 = arith.extsi %63 : i32 to i64 %65 = tt.addptr %arg0, %60 : !tt.ptr<f16>, i64 %66 = tt.addptr %arg5, %60 : !tt.ptr<f16>, i64 %67 = tt.addptr %arg3, %64 : !tt.ptr<f32>, i64 %68 = tt.addptr %arg4, %64 : !tt.ptr<f32>, i64 %69 = arith.remsi %56, %c16_i32 : i32 %70 = arith.muli %3, %c16_i32 : i32 %71 = arith.addi %70, %69 : i32 %72 = arith.muli %71, %c2_i32 : i32 %73 = arith.addi %72, %16 : i32 %74 = tt.addptr %arg11, %73 : !tt.ptr<i32>, i32 %75 = tt.load %74 : !tt.ptr<i32> %76 = arith.muli %75, %c128_i32 : i32 %77 = tt.addptr %arg10, %73 : !tt.ptr<i32>, i32 %78 = tt.load %77 : !tt.ptr<i32> %79 = tt.splat %76 : i32 -> tensor<64xi32> %80 = arith.addi %79, %14 : tensor<64xi32> %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %82 = arith.muli %81, %cst_11 : tensor<1x64xi32> %83 = tt.splat %65 : !tt.ptr<f16> -> tensor<1x64x!tt.ptr<f16>> %84 = tt.addptr %83, %82 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32> %85 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %86 = tt.broadcast %84 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %87 = tt.broadcast %85 : tensor<64x1xi32> -> tensor<64x64xi32> %88 = tt.addptr %86, %87 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %89 = tt.expand_dims %80 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %90 = arith.muli %89, %cst_14 : tensor<64x1xi32> %91 = tt.splat %66 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %92 = tt.addptr %91, %90 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %93 = tt.broadcast %92 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %94 = tt.addptr %93, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %95 = arith.muli %78, %c2_i32 : i32 %96 = arith.minsi %95, %c1_i32 : i32 %97 = arith.cmpi sge, %96, %c1_i32 : i32 %98:2 = scf.if %97 -> (tensor<64x64xf32>, tensor<64x64xf32>) { %120 = arith.subi %96, %c1_i32 : i32 %121:5 = scf.for %arg20 = %c0_i32 to %120 step %c1_i32 iter_args(%arg21 = %arg18, %arg22 = %arg19, %arg23 = %88, %arg24 = %94, %arg25 = %80) -> (tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>) : i32 { %167 = tt.expand_dims %arg25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %168 = arith.cmpi slt, %167, %cst_1 : tensor<1x64xi32> %169 = tt.broadcast %168 : tensor<1x64xi1> -> tensor<64x64xi1> %170 = tt.load %arg23, %169, %cst_8 : tensor<64x64x!tt.ptr<f16>> %171 = arith.cmpi slt, %arg25, %cst_17 : tensor<64xi32> %172 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %173 = tt.addptr %172, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %174 = tt.load %173, %171 : tensor<64x!tt.ptr<f32>> %175 = arith.cmpf oeq, %174, %cst_16 : tensor<64xf32> %176 = arith.select %175, %cst_15, %174 : tensor<64xi1>, tensor<64xf32> %177 = tt.dot %30, %170, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %178 = arith.mulf %177, %cst_13 : tensor<64x64xf32> %179 = arith.mulf %178, %cst_3 : tensor<64x64xf32> %180 = arith.mulf %179, %cst_2 : tensor<64x64xf32> %181 = tt.expand_dims %176 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %182 = tt.broadcast %181 : tensor<1x64xf32> -> tensor<64x64xf32> %183 = arith.subf %180, %182 : tensor<64x64xf32> %184 = math.exp2 %183 : tensor<64x64xf32> %185 = tt.expand_dims %arg25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %186 = arith.cmpi slt, %185, %cst_12 : tensor<64x1xi32> %187 = tt.broadcast %186 : tensor<64x1xi1> -> tensor<64x64xi1> %188 = tt.load %arg24, %187, %cst_8 : tensor<64x64x!tt.ptr<f16>> %189 = arith.truncf %184 : tensor<64x64xf32> to tensor<64x64xf16> %190 = tt.dot %189, %188, %arg22, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %191 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %192 = tt.addptr %191, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %193 = tt.load %192, %171 : tensor<64x!tt.ptr<f32>> %194 = tt.trans %188 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %195 = tt.dot %35, %194, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %196 = tt.expand_dims %193 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %197 = tt.broadcast %196 : tensor<1x64xf32> -> tensor<64x64xf32> %198 = arith.subf %195, %197 : tensor<64x64xf32> %199 = arith.mulf %184, %198 : tensor<64x64xf32> %200 = arith.mulf %199, %cst_3 : tensor<64x64xf32> %201 = arith.truncf %200 : tensor<64x64xf32> to tensor<64x64xf16> %202 = tt.trans %170 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %203 = tt.dot %201, %202, %arg21, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %204 = arith.divsi %arg20, %c2_i32 : i32 %205 = tt.addptr %74, %204 : !tt.ptr<i32>, i32 %206 = tt.load %205 evictionPolicy = evict_last : !tt.ptr<i32> %207 = arith.addi %204, %c1_i32 : i32 %208 = arith.cmpi slt, %207, %78 : i32 %209 = tt.addptr %205, %c1_i32 : !tt.ptr<i32>, i32 %210 = tt.load %209, %208 evictionPolicy = evict_last : !tt.ptr<i32> %211 = arith.addi %arg20, %c1_i32 : i32 %212 = arith.remsi %211, %c2_i32 : i32 %213 = arith.cmpi eq, %212, %c0_i32 : i32 %214 = arith.subi %210, %206 : i32 %215 = arith.muli %214, %c128_i32 : i32 %216 = arith.subi %215, %c64_i32 : i32 %217 = arith.extui %213 : i1 to i32 %218 = arith.muli %216, %217 : i32 %219 = arith.subi %c1_i32, %217 : i32 %220 = arith.muli %219, %c64_i32 : i32 %221 = arith.addi %218, %220 : i32 %222 = arith.muli %221, %c64_i32 : i32 %223 = tt.splat %222 : i32 -> tensor<64x64xi32> %224 = tt.addptr %arg23, %223 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %225 = tt.addptr %arg24, %223 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %226 = tt.splat %221 : i32 -> tensor<64xi32> %227 = arith.addi %arg25, %226 : tensor<64xi32> scf.yield %203, %190, %224, %225, %227 : tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32> } %122 = tt.expand_dims %121#4 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %123 = arith.cmpi slt, %122, %cst_1 : tensor<1x64xi32> %124 = tt.broadcast %123 : tensor<1x64xi1> -> tensor<64x64xi1> %125 = tt.load %121#2, %124, %cst_8 : tensor<64x64x!tt.ptr<f16>> %126 = arith.cmpi slt, %121#4, %cst_17 : tensor<64xi32> %127 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %128 = tt.addptr %127, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %129 = tt.load %128, %126 : tensor<64x!tt.ptr<f32>> %130 = arith.cmpf oeq, %129, %cst_16 : tensor<64xf32> %131 = arith.select %130, %cst_15, %129 : tensor<64xi1>, tensor<64xf32> %132 = tt.dot %30, %125, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %133 = arith.mulf %132, %cst_13 : tensor<64x64xf32> %134 = arith.mulf %133, %cst_3 : tensor<64x64xf32> %135 = arith.select %29, %134, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %136 = arith.select %28, %cst, %cst_0 : tensor<64x1xi1>, tensor<64x1xi1> %137 = tt.broadcast %136 : tensor<64x1xi1> -> tensor<64x64xi1> %138 = arith.select %137, %135, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %139 = arith.mulf %138, %cst_2 : tensor<64x64xf32> %140 = tt.expand_dims %131 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %141 = tt.broadcast %140 : tensor<1x64xf32> -> tensor<64x64xf32> %142 = arith.subf %139, %141 : tensor<64x64xf32> %143 = math.exp2 %142 : tensor<64x64xf32> %144 = tt.expand_dims %121#4 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %145 = arith.cmpi slt, %144, %cst_12 : tensor<64x1xi32> %146 = tt.broadcast %145 : tensor<64x1xi1> -> tensor<64x64xi1> %147 = tt.load %121#3, %146, %cst_8 : tensor<64x64x!tt.ptr<f16>> %148 = arith.truncf %143 : tensor<64x64xf32> to tensor<64x64xf16> %149 = tt.dot %148, %147, %121#1, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %150 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %151 = tt.addptr %150, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %152 = tt.load %151, %126 : tensor<64x!tt.ptr<f32>> %153 = tt.trans %147 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %154 = tt.dot %35, %153, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %155 = tt.expand_dims %152 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %156 = tt.broadcast %155 : tensor<1x64xf32> -> tensor<64x64xf32> %157 = arith.subf %154, %156 : tensor<64x64xf32> %158 = arith.mulf %143, %157 : tensor<64x64xf32> %159 = arith.mulf %158, %cst_3 : tensor<64x64xf32> %160 = arith.select %29, %159, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %161 = arith.select %28, %136, %cst_0 : tensor<64x1xi1>, tensor<64x1xi1> %162 = tt.broadcast %161 : tensor<64x1xi1> -> tensor<64x64xi1> %163 = arith.select %162, %160, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %164 = arith.truncf %163 : tensor<64x64xf32> to tensor<64x64xf16> %165 = tt.trans %125 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %166 = tt.dot %164, %165, %121#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> scf.yield %166, %149 : tensor<64x64xf32>, tensor<64x64xf32> } else { scf.yield %arg18, %arg19 : tensor<64x64xf32>, tensor<64x64xf32> } %99 = tt.addptr %arg15, %73 : !tt.ptr<i32>, i32 %100 = tt.load %99 : !tt.ptr<i32> %101 = arith.muli %100, %c128_i32 : i32 %102 = tt.addptr %arg14, %73 : !tt.ptr<i32>, i32 %103 = tt.load %102 : !tt.ptr<i32> %104 = tt.splat %101 : i32 -> tensor<64xi32> %105 = arith.addi %104, %14 : tensor<64xi32> %106 = tt.expand_dims %105 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %107 = arith.muli %106, %cst_11 : tensor<1x64xi32> %108 = tt.addptr %83, %107 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32> %109 = tt.broadcast %108 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %110 = tt.addptr %109, %87 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %111 = tt.expand_dims %105 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %112 = arith.muli %111, %cst_14 : tensor<64x1xi32> %113 = tt.addptr %91, %112 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %114 = tt.broadcast %113 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %115 = tt.addptr %114, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %116 = arith.muli %103, %c2_i32 : i32 %117 = arith.minsi %116, %c1_i32 : i32 %118 = arith.cmpi sge, %117, %c1_i32 : i32 %119:2 = scf.if %118 -> (tensor<64x64xf32>, tensor<64x64xf32>) { %120 = arith.subi %117, %c1_i32 : i32 %121:5 = scf.for %arg20 = %c0_i32 to %120 step %c1_i32 iter_args(%arg21 = %98#0, %arg22 = %98#1, %arg23 = %110, %arg24 = %115, %arg25 = %105) -> (tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>) : i32 { %161 = tt.expand_dims %arg25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %162 = arith.cmpi slt, %161, %cst_1 : tensor<1x64xi32> %163 = tt.broadcast %162 : tensor<1x64xi1> -> tensor<64x64xi1> %164 = tt.load %arg23, %163, %cst_8 : tensor<64x64x!tt.ptr<f16>> %165 = arith.cmpi slt, %arg25, %cst_17 : tensor<64xi32> %166 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %167 = tt.addptr %166, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %168 = tt.load %167, %165 : tensor<64x!tt.ptr<f32>> %169 = arith.cmpf oeq, %168, %cst_16 : tensor<64xf32> %170 = arith.select %169, %cst_15, %168 : tensor<64xi1>, tensor<64xf32> %171 = tt.dot %30, %164, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %172 = arith.mulf %171, %cst_13 : tensor<64x64xf32> %173 = arith.mulf %172, %cst_3 : tensor<64x64xf32> %174 = arith.mulf %173, %cst_2 : tensor<64x64xf32> %175 = tt.expand_dims %170 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %176 = tt.broadcast %175 : tensor<1x64xf32> -> tensor<64x64xf32> %177 = arith.subf %174, %176 : tensor<64x64xf32> %178 = math.exp2 %177 : tensor<64x64xf32> %179 = tt.expand_dims %arg25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %180 = arith.cmpi slt, %179, %cst_12 : tensor<64x1xi32> %181 = tt.broadcast %180 : tensor<64x1xi1> -> tensor<64x64xi1> %182 = tt.load %arg24, %181, %cst_8 : tensor<64x64x!tt.ptr<f16>> %183 = arith.truncf %178 : tensor<64x64xf32> to tensor<64x64xf16> %184 = tt.dot %183, %182, %arg22, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %185 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %186 = tt.addptr %185, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %187 = tt.load %186, %165 : tensor<64x!tt.ptr<f32>> %188 = tt.trans %182 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %189 = tt.dot %35, %188, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %190 = tt.expand_dims %187 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %191 = tt.broadcast %190 : tensor<1x64xf32> -> tensor<64x64xf32> %192 = arith.subf %189, %191 : tensor<64x64xf32> %193 = arith.mulf %178, %192 : tensor<64x64xf32> %194 = arith.mulf %193, %cst_3 : tensor<64x64xf32> %195 = arith.truncf %194 : tensor<64x64xf32> to tensor<64x64xf16> %196 = tt.trans %164 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %197 = tt.dot %195, %196, %arg21, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %198 = arith.divsi %arg20, %c2_i32 : i32 %199 = tt.addptr %99, %198 : !tt.ptr<i32>, i32 %200 = tt.load %199 evictionPolicy = evict_last : !tt.ptr<i32> %201 = arith.addi %198, %c1_i32 : i32 %202 = arith.cmpi slt, %201, %103 : i32 %203 = tt.addptr %199, %c1_i32 : !tt.ptr<i32>, i32 %204 = tt.load %203, %202 evictionPolicy = evict_last : !tt.ptr<i32> %205 = arith.addi %arg20, %c1_i32 : i32 %206 = arith.remsi %205, %c2_i32 : i32 %207 = arith.cmpi eq, %206, %c0_i32 : i32 %208 = arith.subi %204, %200 : i32 %209 = arith.muli %208, %c128_i32 : i32 %210 = arith.subi %209, %c64_i32 : i32 %211 = arith.extui %207 : i1 to i32 %212 = arith.muli %210, %211 : i32 %213 = arith.subi %c1_i32, %211 : i32 %214 = arith.muli %213, %c64_i32 : i32 %215 = arith.addi %212, %214 : i32 %216 = arith.muli %215, %c64_i32 : i32 %217 = tt.splat %216 : i32 -> tensor<64x64xi32> %218 = tt.addptr %arg23, %217 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %219 = tt.addptr %arg24, %217 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %220 = tt.splat %215 : i32 -> tensor<64xi32> %221 = arith.addi %arg25, %220 : tensor<64xi32> scf.yield %197, %184, %218, %219, %221 : tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32> } %122 = tt.expand_dims %121#4 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %123 = arith.cmpi slt, %122, %cst_1 : tensor<1x64xi32> %124 = tt.broadcast %123 : tensor<1x64xi1> -> tensor<64x64xi1> %125 = tt.load %121#2, %124, %cst_8 : tensor<64x64x!tt.ptr<f16>> %126 = arith.cmpi slt, %121#4, %cst_17 : tensor<64xi32> %127 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %128 = tt.addptr %127, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %129 = tt.load %128, %126 : tensor<64x!tt.ptr<f32>> %130 = arith.cmpf oeq, %129, %cst_16 : tensor<64xf32> %131 = arith.select %130, %cst_15, %129 : tensor<64xi1>, tensor<64xf32> %132 = tt.dot %30, %125, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %133 = arith.mulf %132, %cst_13 : tensor<64x64xf32> %134 = arith.mulf %133, %cst_3 : tensor<64x64xf32> %135 = arith.select %29, %134, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %136 = arith.mulf %135, %cst_2 : tensor<64x64xf32> %137 = tt.expand_dims %131 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %138 = tt.broadcast %137 : tensor<1x64xf32> -> tensor<64x64xf32> %139 = arith.subf %136, %138 : tensor<64x64xf32> %140 = math.exp2 %139 : tensor<64x64xf32> %141 = tt.expand_dims %121#4 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %142 = arith.cmpi slt, %141, %cst_12 : tensor<64x1xi32> %143 = tt.broadcast %142 : tensor<64x1xi1> -> tensor<64x64xi1> %144 = tt.load %121#3, %143, %cst_8 : tensor<64x64x!tt.ptr<f16>> %145 = arith.truncf %140 : tensor<64x64xf32> to tensor<64x64xf16> %146 = tt.dot %145, %144, %121#1, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %147 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %148 = tt.addptr %147, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %149 = tt.load %148, %126 : tensor<64x!tt.ptr<f32>> %150 = tt.trans %144 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %151 = tt.dot %35, %150, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %152 = tt.expand_dims %149 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %153 = tt.broadcast %152 : tensor<1x64xf32> -> tensor<64x64xf32> %154 = arith.subf %151, %153 : tensor<64x64xf32> %155 = arith.mulf %140, %154 : tensor<64x64xf32> %156 = arith.mulf %155, %cst_3 : tensor<64x64xf32> %157 = arith.select %29, %156, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %158 = arith.truncf %157 : tensor<64x64xf32> to tensor<64x64xf16> %159 = tt.trans %125 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %160 = tt.dot %158, %159, %121#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> scf.yield %160, %146 : tensor<64x64xf32>, tensor<64x64xf32> } else { scf.yield %98#0, %98#1 : tensor<64x64xf32>, tensor<64x64xf32> } scf.yield %119#0, %119#1 : tensor<64x64xf32>, tensor<64x64xf32> } %37 = tt.splat %13 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %38 = tt.addptr %37, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %39 = tt.broadcast %38 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %40 = tt.addptr %39, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %41 = arith.cmpi slt, %24, %cst_11 : tensor<1x64xi32> %42 = tt.broadcast %41 : tensor<1x64xi1> -> tensor<64x64xi1> %43 = arith.andi %29, %42 : tensor<64x64xi1> %44 = arith.truncf %36#1 : tensor<64x64xf32> to tensor<64x64xf16> tt.store %40, %44, %43 : tensor<64x64x!tt.ptr<f16>> %45 = arith.mulf %36#0, %cst_13 : tensor<64x64xf32> %46 = tt.broadcast %21 : tensor<64x1xi32> -> tensor<64x64xi32> %47 = arith.addi %26, %46 : tensor<64x64xi32> %48 = tt.splat %4 : i32 -> tensor<64x64xi32> %49 = arith.addi %47, %48 : tensor<64x64xi32> %50 = tt.splat %8 : i32 -> tensor<64x64xi32> %51 = arith.addi %49, %50 : tensor<64x64xi32> %52 = tt.splat %arg16 : !tt.ptr<f16> -> tensor<64x64x!tt.ptr<f16>> %53 = tt.addptr %52, %51 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %54 = arith.truncf %45 : tensor<64x64xf32> to tensor<64x64xf16> tt.store %53, %54, %29 : tensor<64x64x!tt.ptr<f16>> } tt.return } } {-# external_resources: { mlir_reproducer: { pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=90}, sccp, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", disable_threading: false, verify_each: true } } #-} /tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline /tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` Triton compilation failed: triton_tem_fused_zeros_1 def triton_tem_fused_zeros_1(arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0): PRESCALE_QK : tl.constexpr = False ROWS_GUARANTEED_SAFE : tl.constexpr = False BLOCKS_ARE_CONTIGUOUS : tl.constexpr = False WRITE_DQ : tl.constexpr = True OUTPUT_LOGSUMEXP : tl.constexpr = True FLOAT32_PRECISION : tl.constexpr = 'tf32' IS_DIVISIBLE : tl.constexpr = False SM_SCALE : tl.constexpr = 0.125 GQA_SHARED_HEADS : tl.constexpr = 4 HAS_FULL_BLOCKS : tl.constexpr = True QK_HEAD_DIM : tl.constexpr = 64 QK_HEAD_DIM_ROUNDED : tl.constexpr = 64 V_HEAD_DIM : tl.constexpr = 64 V_HEAD_DIM_ROUNDED : tl.constexpr = 64 SAFE_HEAD_DIM : tl.constexpr = True BLOCK_M1 : tl.constexpr = 64 BLOCK_N1 : tl.constexpr = 64 BLOCK_M2 : tl.constexpr = 64 BLOCK_N2 : tl.constexpr = 64 SPARSE_Q_BLOCK_SIZE : tl.constexpr = 128 SPARSE_KV_BLOCK_SIZE : tl.constexpr = 128 Q = arg_Q K = arg_K V = arg_V LSE = arg_LSE DELTA = arg_DELTA DO = arg_DO DQ = arg_DQ DV = arg_DV KV_NUM_BLKS = arg_KV_NUM_BLKS KV_IDX = arg_KV_IDX Q_NUM_BLKS = arg_Q_NUM_BLKS Q_IDX = arg_Q_IDX FULL_KV_NUM_BLKS = arg_FULL_KV_NUM_BLKS FULL_KV_IDX = arg_FULL_KV_IDX FULL_Q_NUM_BLKS = arg_FULL_Q_NUM_BLKS FULL_Q_IDX = arg_FULL_Q_IDX # Sub notation for this kernel: # # Q: Query, K: Key, V: Value # LSE: logsumexp (logsumexp is always stored in fp32 regardless of the input dtype) # DELTA: Precomputed sum(OUTDO, axis=-1) # DO: Derivative of Output, DQ: Derivative of Query, DV: Derivative of Value # DK: Derivative of Key, is the written to via the store_output call due to some limitations with # inductor codegen # M: Number of queries, N: Number of keys/values # QK_HEAD_DIM: The dimension of the query and key embeddings # V_HEAD_DIM: The dimension of the value embeddings # z: Batch size, h: Number of heads, m: Number of queries or keys/values, d: Head dim # GQA_SHARED_HEADS: number of query heads sharing one kv head in GQA setups. # (Modifiable) Performance tuning options # BLOCK_M1: when calculating DK & DV, iterate over BLOCK_M1 across the seqlen dim of Q in each thread block. # BLOCK_N1: when calculating DK & DV, the thread block size across the seqlen dim of K/V. # BLOCK_M2: when calculating DQ, the thread block size across the seqlen dim of Q. # BLOCK_N2: when calculating DQ, iterate over BLOCK_N2 across the seqlen dim of K/V in each thread block. # # The following FULL_ and PARTIAL_* is defined in the block sparse mask grid, rather than the thread block grid. # KV_NUM_BLKS: The number of KV blocks (that may or may not require masking) for each query. # KV_IDX: The indices of KV blocks (that may or may not require masking) for each query. # Q_NUM_BLKS: The number of Q blocks (that may or may not require masking) for each query. # Q_IDX: The indices of Q blocks (that may or may not require masking) for each query. # FULL_KV_NUM_BLKS: The number of fully unmasked KV blocks (so we don't need masking) for each query. # FULL_KV_IDX: The indices of fully unmasked KV blocks (so we don't need masking) for each query. # FULL_Q_NUM_BLKS: The number of fully unmasked Q blocks (so we don't need masking) for each query. # FULL_Q_IDX: The indices of fully unmasked Q blocks (so we don't need masking) for each query. # The below are kernel options that can be applied for certain score_mods, # or involve a numerics vs. perf tradeoff # PRESCALE_QK: Whether to pre-scale QK by 1/sqrt(d) and change of base. Has # about 20% more numerical error, but slightly faster. # Define strides of inputs stride_qz, stride_qh, stride_qm, stride_qd = 32768, 2048, 64, 1 stride_kz, stride_kh, stride_kn, stride_kd = 65536, 16384, 64, 1 stride_vz, stride_vh, stride_vn, stride_vd = 65536, 16384, 64, 1 stride_doz, stride_doh, stride_dom, stride_dod = 32768, 2048, 64, 1 stride_dqz, stride_dqh, stride_dqm, stride_dqd = 32768, 2048, 64, 1 stride_dvz, stride_dvh, stride_dvm, stride_dvd = 65536, 16384, 64, 1 ZQ = 2 HQ = 16 HKV = 4 Q_LEN = 32 ZKV = 2 KV_LEN = 256 MATMUL_PRECISION = Q.dtype.element_ty pid = tl.program_id(0) NUM_KV_BLOCKS = tl.cdiv(KV_LEN, BLOCK_N1) NUM_Q_BLOCKS = tl.cdiv(Q_LEN, BLOCK_M2) off_zq = tl.program_id(1) # q batch idx off_hkv = tl.program_id(2) # kv head idx off_zkv = off_zq % ZKV # kv batch idx SPARSE_Z = 2 SPARSE_HQ = 16 sparse_idx_z = off_zq % SPARSE_Z k_adj = (stride_kh * off_hkv + stride_kz * off_zkv).to(tl.int64) v_adj = (stride_vh * off_hkv + stride_vz * off_zkv).to(tl.int64) # first compute broadcasted dv of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM] # then reduce to dv of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM] dv_adj = (stride_dvh * off_hkv + stride_dvz * off_zq).to(tl.int64) # offset K, V, DV pointers for batch/kv-head K += k_adj V += v_adj DV += dv_adj RCP_LN2 = 1.44269504 offs_k = tl.arange(0, QK_HEAD_DIM_ROUNDED) offs_v = tl.arange(0, V_HEAD_DIM_ROUNDED) if pid >= NUM_KV_BLOCKS: off_pid = pid - NUM_KV_BLOCKS # THIS BLOCK DOES DQ SPARSE_Q_MULTIPLE = (SPARSE_Q_BLOCK_SIZE // BLOCK_M2) SPARSE_KV_MULTIPLE = (SPARSE_KV_BLOCK_SIZE // BLOCK_N2) off_hq2 = off_pid // NUM_Q_BLOCKS + off_hkv * GQA_SHARED_HEADS start_m2_block = off_pid % NUM_Q_BLOCKS off_pid_mask = start_m2_block // SPARSE_Q_MULTIPLE stride_kv_num_blks_h = 1 stride_kv_idx_h = 2 stride_kv_idx_m = 2 sparse_idx_hq2 = off_hq2 % SPARSE_HQ sparse_hz_offset = sparse_idx_z * SPARSE_HQ + sparse_idx_hq2 sparse_kv_num_blks_offset = sparse_hz_offset * stride_kv_num_blks_h + off_pid_mask sparse_kv_idx_offset = sparse_hz_offset * stride_kv_idx_h + off_pid_mask * stride_kv_idx_m # noqa: B950 # Offset Q, DQ, DO, DELTA & LSE. These inputs are offsetted by query heads. q_adj2 = (stride_qh * off_hq2 + stride_qz * off_zq).to(tl.int64) do_adj2 = (stride_doh * off_hq2 + stride_doz * off_zq).to(tl.int64) dq_adj2 = (stride_dqh * off_hq2 + stride_dqz * off_zq).to(tl.int64) off_chz2 = ((off_zq * HQ + off_hq2) * Q_LEN).to(tl.int64) Q2 = Q + q_adj2 DO2 = DO + do_adj2 # TODO: This does not work if DQ is not the same layout as Q (for example, # if Q is broadcasted) DQ2 = DQ + dq_adj2 LSE2 = LSE + off_chz2 DELTA2 = DELTA + off_chz2 # dq = tl.zeros([BLOCK_M2, QK_HEAD_DIM], dtype=tl.float32) dq = tl.zeros([BLOCK_M2, QK_HEAD_DIM_ROUNDED], dtype=tl.float32) start_m2 = start_m2_block * BLOCK_M2 offs_m2 = start_m2 + tl.arange(0, BLOCK_M2) # load Q and do: they stay in SRAM throughout the inner loop. q = load_checked_2d(Q2, offs_m2, offs_k, stride_qm, stride_qd, IS_DIVISIBLE, SAFE_HEAD_DIM, Q_LEN, QK_HEAD_DIM) do = load_checked_2d(DO2, offs_m2, offs_v, stride_dom, stride_dod, IS_DIVISIBLE, SAFE_HEAD_DIM, Q_LEN, V_HEAD_DIM) if PRESCALE_QK: q = (q * SM_SCALE * RCP_LN2).to(MATMUL_PRECISION) if IS_DIVISIBLE: Di = tl.load(DELTA2 + offs_m2) lse = tl.load(LSE2 + offs_m2) else: Di = tl.load(DELTA2 + offs_m2, mask=offs_m2 < Q_LEN) lse = tl.load(LSE2 + offs_m2, mask=offs_m2 < Q_LEN) lse = tl.where(lse == -float("inf"), 0.0, lse) lse = lse[:, None] # ~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # KV_IDX and KV_NUM_BLKS are always contiguous. kv_indices = KV_IDX + sparse_kv_idx_offset kv_start = tl.load(kv_indices) * SPARSE_KV_BLOCK_SIZE # first kv block we're loading sparse_kv_num_blocks = tl.load(KV_NUM_BLKS + sparse_kv_num_blks_offset) offs_n2 = kv_start + tl.arange(0, BLOCK_N2) dq = bwd_dq_inner( arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0, K, V, dq, q, do, Di, lse, off_zq, off_hq2, offs_m2, offs_n2, stride_kn, stride_kd, stride_vn, stride_vd, kv_indices, sparse_kv_num_blocks, MATMUL_PRECISION, IS_FULL_BLOCKS=False, ) if HAS_FULL_BLOCKS: # ~~~~~~~~~~~ partial unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # FULL_KV_IDX and FULL_KV_NUM_BLKS are always contiguous. kv_indices = FULL_KV_IDX + sparse_kv_idx_offset kv_start = tl.load(kv_indices) * SPARSE_KV_BLOCK_SIZE # first kv block we're loading sparse_kv_num_blocks = tl.load(FULL_KV_NUM_BLKS + sparse_kv_num_blks_offset) offs_n2 = kv_start + tl.arange(0, BLOCK_N2) dq = bwd_dq_inner( arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0, K, V, dq, q, do, Di, lse, off_zq, off_hq2, offs_m2, offs_n2, stride_kn, stride_kd, stride_vn, stride_vd, kv_indices, sparse_kv_num_blocks, MATMUL_PRECISION, IS_FULL_BLOCKS=True, ) # Write back dQ. dq_ptrs = DQ2 + offs_m2[:, None] * stride_dqm + offs_k[None, :] * stride_dqd dq = SM_SCALE if IS_DIVISIBLE and SAFE_HEAD_DIM: tl.store(dq_ptrs, dq) else: tl.store(dq_ptrs, dq, mask=(offs_m2[:, None] < Q_LEN) & (offs_k[None, :] < QK_HEAD_DIM)) else: # THIS BLOCK DOES DK & DV SPARSE_Q_MULTIPLE = (SPARSE_Q_BLOCK_SIZE // BLOCK_M1) SPARSE_KV_MULTIPLE = (SPARSE_KV_BLOCK_SIZE // BLOCK_N1) pid_mask = pid // SPARSE_KV_MULTIPLE stride_q_num_blks_h = 2 stride_q_idx_h = 2 stride_q_idx_n = 1 dv = tl.zeros([BLOCK_N1, V_HEAD_DIM_ROUNDED], dtype=tl.float32) dk = tl.zeros([BLOCK_N1, QK_HEAD_DIM_ROUNDED], dtype=tl.float32) start_n1 = pid BLOCK_N1 offs_n1 = start_n1 + tl.arange(0, BLOCK_N1) # load K and V: they stay in SRAM throughout the inner loop. k = load_checked_2d(K, offs_n1, offs_k, stride_kn, stride_kd, IS_DIVISIBLE, SAFE_HEAD_DIM, KV_LEN, QK_HEAD_DIM) v = load_checked_2d(V, offs_n1, offs_v, stride_vn, stride_vd, IS_DIVISIBLE, SAFE_HEAD_DIM, KV_LEN, V_HEAD_DIM) if PRESCALE_QK: k = (k * SM_SCALE * RCP_LN2).to(MATMUL_PRECISION) for off_g in range(0, GQA_SHARED_HEADS): off_hq1 = off_hkv * GQA_SHARED_HEADS + off_g # Offset Q, DQ, DO, DELTA & LSE. These inputs are offsetted by query heads. q_adj1 = (stride_qh * off_hq1 + stride_qz * off_zq).to(tl.int64) do_adj1 = (stride_doh * off_hq1 + stride_doz * off_zq).to(tl.int64) dq_adj1 = (stride_dqh * off_hq1 + stride_dqz * off_zq).to(tl.int64) off_chz1 = ((off_zq * HQ + off_hq1) * Q_LEN).to(tl.int64) Q1 = Q + q_adj1 DO1 = DO + do_adj1 # TODO: This does not work if DQ is not the same layout as Q (for example, # if Q is broadcasted) LSE1 = LSE + off_chz1 DELTA1 = DELTA + off_chz1 sparse_idx_hq1 = off_hq1 % SPARSE_HQ sparse_hz_offset = sparse_idx_z * SPARSE_HQ + sparse_idx_hq1 sparse_q_num_blks_offset = sparse_hz_offset * stride_q_num_blks_h + pid_mask sparse_q_idx_offset = sparse_hz_offset * stride_q_idx_h + pid_mask * stride_q_idx_n # noqa: B950 # ~~~~~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Q_IDX and Q_NUM_BLKS are always contiguous. q_indices = Q_IDX + sparse_q_idx_offset q_start = tl.load(q_indices) * SPARSE_Q_BLOCK_SIZE # first q block we're loading sparse_q_num_blocks = tl.load(Q_NUM_BLKS + sparse_q_num_blks_offset) offs_m1 = q_start + tl.arange(0, BLOCK_M1) dk, dv = bwd_dkdv_inner( arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0, Q1, DO1, DELTA1, LSE1, dk, dv, k, v, off_zq, off_hq1, offs_n1, offs_m1, stride_qm, stride_qd, stride_dom, stride_dod, q_indices, sparse_q_num_blocks, MATMUL_PRECISION, IS_FULL_BLOCKS=False, ) if HAS_FULL_BLOCKS: # ~~~~~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # FULL_Q_IDX and FULL_Q_NUM_BLKS are always contiguous. q_indices = FULL_Q_IDX + sparse_q_idx_offset q_start = tl.load(q_indices) * SPARSE_Q_BLOCK_SIZE # first q block we're loading sparse_q_num_blocks = tl.load(FULL_Q_NUM_BLKS + sparse_q_num_blks_offset) offs_m1 = q_start + tl.arange(0, BLOCK_M1) dk, dv = bwd_dkdv_inner( arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0, Q1, DO1, DELTA1, LSE1, dk, dv, k, v, off_zq, off_hq1, offs_n1, offs_m1, stride_qm, stride_qd, stride_dom, stride_dod, q_indices, sparse_q_num_blocks, MATMUL_PRECISION, IS_FULL_BLOCKS=True, ) # Write back dV and dK. dv_ptrs = DV + offs_n1[:, None] * stride_dvm + offs_v[None, :] * stride_dvd index_n = offs_n1[:, None] index_k = offs_k[None, :] index_v = offs_v[None, :] if IS_DIVISIBLE and SAFE_HEAD_DIM: tl.store(dv_ptrs, dv) else: tl.store(dv_ptrs, dv, mask=(index_n < KV_LEN) & (index_v < V_HEAD_DIM)) dk = SM_SCALE if SAFE_HEAD_DIM: mask = index_n < KV_LEN else: mask = (index_n < KV_LEN) & (index_k < QK_HEAD_DIM) # first compute broadcasted dk of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM] # then reduce to dk of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM] xindex = index_k + 64index_n + 16384off_hkv + 65536off_zq tl.store(out_ptr0 + (tl.broadcast_to(xindex, dk.shape)), dk, mask) metadata: {'signature': {'arg_Q': 'fp16', 'arg_K': 'fp16', 'arg_V': 'fp16', 'arg_LSE': 'fp32', 'arg_DELTA': 'fp32', 'arg_DO': 'fp16', 'arg_DQ': 'fp16', 'arg_DV': 'fp16', 'arg_KV_NUM_BLKS': 'i32', 'arg_KV_IDX': 'i32', 'arg_Q_NUM_BLKS': 'i32', 'arg_Q_IDX': 'i32', 'arg_FULL_KV_NUM_BLKS': 'i32', 'arg_FULL_KV_IDX': 'i32', 'arg_FULL_Q_NUM_BLKS': 'i32', 'arg_FULL_Q_IDX': 'i32', 'out_ptr0': 'fp16'}, 'device': 0, 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (4,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (9,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (14,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]]}], 'device_type': 'cuda', 'num_warps': 4, 'num_stages': 3, 'debug': True, 'cc': 100} Traceback (most recent call last): File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 748, in _precompile_config binary = triton.compile(compile_args, *compile_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/compiler/compiler.py", line 359, in compile next_module = compile_ir(module, metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 456, in <lambda> stages["ttgir"] = lambda src, metadata: self.make_ttgir(src, metadata, options, capability) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 298, in make_ttgir pm.run(mod) RuntimeError: PassManager::run failed frames [('total', 3), ('ok', 3)] inline_call [] stats [('calls_captured', 8), ('unique_graphs', 3)] aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('ok', 1)] inductor [('triton_bundler_save_kernel', 8), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1), ('fxgraph_cache_bypass', 1)] graph_break [] F ==================================================== FAILURES ===================================================== _____________________________ TestFlexAttentionCUDA.test_GQA_score_mod1_cuda_float16 ______________________________ Traceback (most recent call last): File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper method(args, *kwargs) File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper method(args, kwargs) File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 446, in instantiated_test raise rte File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test result = test(self, param_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 1349, in dep_fn return fn(self, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 1215, in dep_fn return fn(slf, args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/test/inductor/test_flex_attention.py", line 1430, in test_GQA self.run_test(inputs) File "/home/drisspg/meta/pytorch/test/inductor/test_flex_attention.py", line 566, in run_test compiled_out.backward(backward_grad) File "/home/drisspg/meta/pytorch/torch/_tensor.py", line 625, in backward torch.autograd.backward( File "/home/drisspg/meta/pytorch/torch/autograd/__init__.py", line 354, in backward _engine_run_backward( File "/home/drisspg/meta/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/autograd/function.py", line 315, in apply return user_fn(self, args) ^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2303, in backward return impl_fn() ^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2289, in impl_fn out = CompiledFunction._backward_impl(ctx, all_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2394, in _backward_impl CompiledFunction.compiled_bw = aot_config.bw_compiler( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/schemas.py", line 1256, in __call__ return self.compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_dynamo/backends/common.py", line 76, in _wrapped_bw_compiler disable( File "/home/drisspg/meta/pytorch/torch/_dynamo/eval_frame.py", line 1005, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_utils_internal.py", line 92, in wrapper_function return function(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 2428, in bw_compiler return inner_compile( ^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 773, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_dynamo/repro/after_aot.py", line 124, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 952, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 1652, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 1506, in codegen_and_compile compiled_module = graph.compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2318, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2328, in _compile_to_module mod = self._compile_to_module_lines(wrapper_code) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2396, in _compile_to_module_lines mod = PyCodeCache.load_by_key_path( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/codecache.py", line 3466, in load_by_key_path mod = _reload_python_module(key, path, set_sys_modules=in_toplevel) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/compile_tasks.py", line 33, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/tmp/tmp0yiz3c94/az/caza2gzmsagyuusmf2ka3oat3na4xv6zudssk244xmlzsbv2knze.py", line 117, in <module> File "/home/drisspg/meta/pytorch/torch/_inductor/async_compile.py", line 489, in triton kernel.precompile( File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 437, in precompile self._precompile_worker() File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 459, in _precompile_worker compile_results.append(self._precompile_config(c)) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 748, in _precompile_config binary = triton.compile(compile_args, **compile_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/compiler/compiler.py", line 359, in compile next_module = compile_ir(module, metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 456, in <lambda> stages["ttgir"] = lambda src, metadata: self.make_ttgir(src, metadata, options, capability) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 298, in make_ttgir pm.run(mod) RuntimeError: PassManager::run failed To execute this test, run the following from the base repo dir: python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_score_mod1_cuda_float16 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ============================================= short test summary info ============================================= FAILED [5.1441s] test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_GQA_score_mod1_cuda_float16 - RuntimeError: PassManager::run failed ================================== 1 failed, 1 passed, 1404 deselected in 18.10s ================================== ~/meta/pytorch flex-warning !1 ❯ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160227 Approved by: https://github.com/Skylion007, https://github.com/Chillee	2025-08-11 23:30:20 +00:00
Sherlock Huang	99bc2f94c1	Update export/schema.py (#160220 ) Summary: Model could have multiple ExportedPrograms - for different methods. They can have different weights. - for different delegates. They can also have different weights. For this reason, we make weight per ExportedProgram. Also, we cleanup Model, and Program. IIUC, Model and Program are not used anywhere, so it's ok to make BC breaking change. Test Plan: CI Rollback Plan: Differential Revision: D79917395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160220 Approved by: https://github.com/angelayi, https://github.com/dolpm, https://github.com/jingsh	2025-08-11 23:14:08 +00:00
Yidi Wu	fc25c68f20	[hop][exc] make UncapturedHigherOrderOpError print user code and avoid re-raise (#159296 ) After the change, the error stacktrace is attached with user code stack and is suppressed into 1 (without the scrolling up mssage). For example: ```python class Test(torch.nn.Module): def forward(self, c, x): def cond_fn(c, x): return c > 0 and x.size(0) < 20 def body_fn(c, x): return c - 1, x.sin() return torch._higher_order_ops.while_loop(cond_fn, body_fn, (c, x)) ``` Now gives the following error message: ```python Traceback (most recent call last): File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1705, in test_while_loop_size_mismatch_tensor_expansion self._run_test( ~~~~~~~~~~~~~~^ model=WhileLoopModels.SizeMismatchTensorExpansion(), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...<2 lines>... dynamic=dynamic, ^^^^^^^^^^^^^^^^ ) ^ File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1417, in _run_test result = model(inputs_with_counters) File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(args, *kwargs) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(args, *kwargs) File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1053, in forward return torch._higher_order_ops.while_loop(cond_fn, body_fn, (c, x)) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 176, in while_loop return torch.compile( ~~~~~~~~~~~~~~ _while_loop_op_wrapper, backend=backend, fullgraph=True ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ )(flat_cond_fn, flat_body_fn, tuple(flat_inputs), tuple()) ~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 804, in compile_wrapper return fn(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1595, in __call__ result = self._torchdynamo_orig_backend( frame, cache_entry, self.hooks, frame_state, skip=1 ) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1353, in __call__ result = self._inner_convert( frame, cache_entry, hooks, frame_state, skip=skip + 1 ) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 682, in __call__ result = _compile( frame.f_code, ...<16 lines>... convert_frame_box=self._box, ) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1172, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/home/yidi/local/pytorch/torch/_utils_internal.py", line 98, in wrapper_function return function(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 858, in compile_inner return _compile_inner(code, one_graph, hooks, transform) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 897, in _compile_inner out_code = transform_code_object(code, transform) File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1461, in transform_code_object transformations(instructions, code_options) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 300, in _fn return fn(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 818, in transform tracer.run() ~~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3528, in run super().run() ~~~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run while self.step(): ~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step self.dispatch_table[inst.opcode](self, inst) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 852, in wrapper return inner_fn(self, inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2240, in CALL_FUNCTION_EX self.call_function(fn, argsvars.items, kwargsvars) ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1200, in call_function self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type] ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward return getattr(self.realize(), name)(args, *kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 91, in graph_break_as_hard_error raise exc.with_traceback(sys.exc_info()[2]) from None File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 77, in graph_break_as_hard_error return fn(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1287, in call_function ) = speculate_subgraph( ~~~~~~~~~~~~~~~~~~^ tx, ^^^ ...<33 lines>... supports_aliasing=self.supports_aliasing, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 877, in speculate_subgraph raise ex File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 718, in speculate_subgraph output = f.call_function(tx, args, sub_kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 580, in call_function return super().call_function(tx, args, kwargs) ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 334, in call_function return tx.inline_user_function_return(self, [self.self_args(), args], kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1217, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3733, in inline_call return tracer.inline_call_() ~~~~~~~~~~~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3936, in inline_call_ self.run() ~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run while self.step(): ~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step self.dispatch_table[inst.opcode](self, inst) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 852, in wrapper return inner_fn(self, inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2240, in CALL_FUNCTION_EX self.call_function(fn, argsvars.items, kwargsvars) ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1200, in call_function self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type] ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward return getattr(self.realize(), name)(args, *kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 580, in call_function return super().call_function(tx, args, kwargs) ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 334, in call_function return tx.inline_user_function_return(self, [self.self_args(), args], kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1217, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3733, in inline_call return tracer.inline_call_() ~~~~~~~~~~~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3936, in inline_call_ self.run() ~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run while self.step(): ~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step self.dispatch_table[inst.opcode](self, inst) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 830, in inner unimplemented_v2( ~~~~~~~~~~~~~~~~^ gb_type="Data-dependent branching", ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...<5 lines>... ], ^^ ) ^ File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 580, in unimplemented_v2 raise Unsupported(msg) torch._dynamo.exc.UncapturedHigherOrderOpError: while_loop doesn't work unless it is captured completely with torch.compile. Got Data-dependent branching Explanation: Detected data-dependent branching (e.g. `if my_tensor.sum() > 0:`). Dynamo does not support tracing dynamic control flow. Hint: This graph break is fundamental - it is unlikely that Dynamo will ever be able to trace through your code. Consider finding a workaround. Hint: Use `torch.cond` to express dynamic control flow. Developer debug context: attempted to jump with TensorVariable() For more details about this graph break, please visit: https://pytorch-labs.github.io/compile-graph-break-site/gb/gb0170.html from user code: File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 167, in _while_loop_op_wrapper return while_loop_op(args, *kwargs) File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 137, in flat_cond_fn return cond_fn(carried, *additional) File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1047, in cond_fn return c > 0 and x.size(0) < 20 Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" To execute this test, run the following from the base repo dir: python test/inductor/test_control_flow.py WhileLoopTests.test_while_loop_size_mismatch_tensor_expansion_device_cpu_dynamic_False This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159296 Approved by: https://github.com/zou3519	2025-08-11 22:48:10 +00:00
Pat Vignola	5a40c57844	[MTIA] Implement isAvailable() for MTIA hooks (#160304 ) Summary: MTIA is missing the `isAvailable()` override, which is necessary for some of the device agnostic methods. Test Plan: `torch._C._get_accelerator()` Rollback Plan: Differential Revision: D79981115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160304 Approved by: https://github.com/nautsimon	2025-08-11 21:45:11 +00:00
Nikita Shulga	7d2ec704e4	Fix MPS autocast for ConvTranspose3d (#160345 ) ## Summary - ensure ConvTranspose3d uses fp32 under MPS autocast - add MPS autocast test for ConvTranspose3d Generated by Codex, see https://chatgpt.com/codex/tasks/task_e_689a360388288327a2cac6f55bbfc42c Fixes https://github.com/pytorch/pytorch/issues/160332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160345 Approved by: https://github.com/dcci	2025-08-11 21:01:52 +00:00
Sandeep Narendranath Karjala	fc80f6859e	Fix collective schedule logging and runtime tests (#160260 ) Summary: - Fix collective schedule logging so that only logs when collectives present - Fix runtime estimate test to check if each op has a number value Pull Request resolved: https://github.com/pytorch/pytorch/pull/160260 Approved by: https://github.com/Skylion007	2025-08-11 20:58:52 +00:00
PaulZhang12	cf0a0dcb0a	Make user defined Triton kernels serializable for fx_graph_runnable (#160002 ) Resolves issue https://github.com/pytorch/pytorch/issues/153475 where `fx_graph_runnable` didn't work with user defined triton kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160002 Approved by: https://github.com/eellison	2025-08-11 20:54:33 +00:00
PyTorch MergeBot	b149c7204c	Revert "port distributed pipeline test files for Intel GPU (#159033 )" This reverts commit 76a0609b6bddb2bc40f1eb4ade12885023653d59. Reverted https://github.com/pytorch/pytorch/pull/159033 on behalf of https://github.com/clee2000 due to broke test_cpp_extensions_stream_and_event.py::TestCppExtensionStreamAndEvent::test_stream_event [GH job link](https://github.com/pytorch/pytorch/actions/runs/16890370216/job/47849586456) [HUD commit link](`76a0609b6b`) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/159033#issuecomment-3176833314))	2025-08-11 20:44:45 +00:00
PyTorch MergeBot	09381f5dac	Revert "[Graph Partition] Pass all OSS unit tests (#154667 )" This reverts commit ca7315c17162ea21b1ca5ba23f4bf6168766c7b9. Reverted https://github.com/pytorch/pytorch/pull/154667 on behalf of https://github.com/clee2000 due to broke inductor/test_memory.py::TestOperatorReorderForPeakMemory::test_reorder_peak_memory_lpmf [GH job link](https://github.com/pytorch/pytorch/actions/runs/16885961204/job/47836769279) [HUD commit link](`ca7315c171`) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154667#issuecomment-3176805477))	2025-08-11 20:34:27 +00:00
Pian Pawakapan	9eedd2a20b	[PGO] no counterfactual suggestions for dynamic allowlist (#160231 ) Being more conservative with whitelist suggestions as we roll out suggestions; now we only suggest sources that were dynamic in previous runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160231 Approved by: https://github.com/bobrenjc93	2025-08-11 20:13:25 +00:00
Edward Yang	c3dc8dc412	159965 is merged, no need to patch it in (#160275 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160275 Approved by: https://github.com/albanD, https://github.com/ZainRizvi	2025-08-11 19:55:04 +00:00
Liao, Wei	76a0609b6b	port distributed pipeline test files for Intel GPU (#159033 ) In this PR we will port all distributed pipeline test files. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. instantiate_device_type_tests() 2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend 3. use "requires_accelerator_dist_backend()" to replace requires_nccl() 4. use "get_default_backend_for_device()" to get backend 5. enabled XPU for some test path 6. add TEST_MULTIACCELERATOR in common_utils for all backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159033 Approved by: https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Daisy Deng <daisy.deng@intel.com>	2025-08-11 19:43:15 +00:00
Simon Fan	c8205cb354	[autograd] match 0-dim gradients device type regardless of subclassness (#160165 ) Not sure if there some subclasses where the outer.dim() == 0 but you wouldn't want to move it? FIXES https://github.com/pytorch/pytorch/issues/160084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160165 Approved by: https://github.com/ezyang, https://github.com/albanD	2025-08-11 17:57:32 +00:00
Nikita Shulga	d25c4f954d	[MPS] Type-promote tensor-iterator common dtype (#160334 ) Otherwise, `torch.add(FloatTensor, IntTensor, alpha=2)` and `torch.add(FloatTensor, IntTensor, alpha=2)` were dispatched to different kernels Fixes https://github.com/pytorch/pytorch/issues/160208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160334 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-08-11 17:53:56 +00:00
David Berard	d0e2240f68	[triton_heuristics] Optimize the triton launcher in pt2 (#160000 ) Summary: (Original author: Xu Zhao. Commandeered by David to land this since it is relatively urgent) We observed ~10us PT2-Triton launch overhead regression after pin update. Before Triton pin-update: {F1980557238} After Triton pin-update: {F1980557240} The root cause is because https://github.com/pytorch/pytorch/pull/145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path. The motivation for `_get_args_with_constexprs` was that between triton 3.2 and triton 3.3, the convention for calling Triton kernels (at the level that non-static-cuda-launcher inductor integrates) changed. Previously, the callable did not take constexpr arguments as parameters; after 3.3, it does. With pointwise/reduction kernels, we don't know the constexpr values until after autotuning occurs; so `_get_args_with_constexprs` would inject constexprs into the arguments list before calling the Triton kernel. The fix (in this PR) is to instead inject the constexpr args into the launcher string - this avoids the cost of sorting/reordering arguments which previously occurred upon execution of each kernel. Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (`e96c7c4bb0/torch/_inductor/runtime/static_cuda_launcher.py (L220)`), there is no need to pass in constexprs to the generated launcher code. The new launcher code needs to work on three cases: - StaticallyLaunchedCudaKernel - triton.compile.CompiledKernel - AOTInductor Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0 Test Plan: Before: ``` $ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 1.893x ``` ``` $ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency x_val nop_python_function-walltime nop_triton_kernel-walltime nop_triton_compiled_kernel_run-walltime nop_inductor_kernel-walltime nop_inductor_kernel_cudagraph-walltime ------- ------------------------------ ---------------------------- ----------------------------------------- ------------------------------ ---------------------------------------- 0 0.00760921 1.80298 0.623282 5.25024 0.203722 19 0.00799885 4.78223 1.00226 5.8213 0.239084 average 0.00780403 3.29261 0.812769 5.53577 0.221403 ``` After: ``` buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency x_val nop_python_function-walltime nop_triton_kernel-walltime nop_triton_compiled_kernel_run-walltime nop_inductor_kernel-walltime nop_inductor_kernel_cudagraph-walltime ------- ------------------------------ ---------------------------- ----------------------------------------- ------------------------------ ---------------------------------------- 0 0.00747067 1.92589 0.726509 4.35459 0.204205 19 0.00747823 7.36852 1.26241 6.28208 0.239278 average 0.00747445 4.6472 0.994459 5.31834 0.221741 ``` ``` $ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 1.985x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160000 Approved by: https://github.com/jansel Co-authored-by: Xu Zhao <xzhao9@meta.com>	2025-08-11 17:22:40 +00:00
Shangdi Yu	9ccd0f5e31	Fix unbacked symint and memory leak in inductor memory planning (#159839 ) Summary: In memory planning, some allocation sizes involve unbacked symints. These unbacked symints are not known before they are computed in run time, so allocation pools that involve unbacked symints cannot be allocated until we have the values of the unbacked symints . So we add a notion of `earliest_available` to Allocation nodes. If an allocation node has unbacked symint, it is available at only when its live range begin. Then in AllocationPool, if a pool involves an Allocation node that has an earliest available time, we restrict its life range. If a block's earliest available time is later than a pool's life range's start time, we cannot allocate it from the pool. We also fix a memory leak that's caused by allocating tensor without wrapping it with RAIIAtenTensor. In python wrapper for JIT inductor, `codegen_alloc_from_pool` doesn't actually write the alloc lines to wrapper, it just returns the string to alloc. However, in cpp_wrapper, `codegen_alloc_from_pool` actually write to the wrapper. Specifically, it writes the following and returns string `RAIIAtenTensorHandle`. ``` AtenTensorHandle handle_name; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__alloc_from_pool(....); ``` This is bug prune. If you write aoti_torch__alloc_from_pool lines, you must write the RAIIAtenTensorHandle as well, otherwise you get memory leaks. We remove the alloc_from_pool call from codegen_create, because this doesn't work for AOTI. In python wrapper, we can generate the same alloc_from_pool variable name for the same block, but cpp_wrapper will generate a different variable name for each call to alloc_from_pool. Test Plan: ``` python test/inductor/test_memory_planning.py ``` Rollback Plan: Differential Revision: D79603119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159839 Approved by: https://github.com/jansel	2025-08-11 17:16:15 +00:00
Boyuan Feng	ca7315c171	[Graph Partition] Pass all OSS unit tests (#154667 ) Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315). Run the same diff on two days and both show speedup on average. [first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d) <img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" /> [second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf) <img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667 Approved by: https://github.com/eellison	2025-08-11 16:25:12 +00:00
Richard Barnes	68a4b4b2e3	[codemod] Fix unreachable-break issue in caffe2/c10/cuda/CUDAFunctions.cpp +2 (#160257 ) Summary: LLVM has a warning `-Wunreachable-code-break` which identifies `break` statements that cannot be reached. These compromise readability, are misleading, and may identify bugs. This diff removes such statements. For questions/comments, contact r-barnes. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Rollback Plan: Differential Revision: D79835614 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160257 Approved by: https://github.com/Skylion007	2025-08-11 16:09:24 +00:00
Xu Han	80cca83079	[inductor] Skip some AOTI UTs on Windows. (#160287 ) Skip some AOTI UTs on Windows, it is not fully ready. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160287 Approved by: https://github.com/ezyang	2025-08-11 13:50:43 +00:00
Xu Han	515cb70367	[inductor] normalize_path_separator for test_different_file_paths_local_pgo (#160286 ) `normalize_path_separator` for test_different_file_paths_local_pgo Pull Request resolved: https://github.com/pytorch/pytorch/pull/160286 Approved by: https://github.com/ezyang	2025-08-11 13:50:18 +00:00
cyy	c184cb3852	[submodule] Bump fbgemm to latest (#158210 ) Merge the recent commits of FBGEMM and remove unnecessary CMake code. Specifically, we 1. enable `fbgemm_autovec` since the target is now correctly handled. 2. remove option `USE_FAKELOWP` which is not used. 3. remove `CAFFE2_COMPILER_SUPPORTS_AVX512_EXTENSIONS` check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158210 Approved by: https://github.com/q10	2025-08-11 13:48:02 +00:00
PyTorch UpdateBot	2259dbed4e	Update slow tests (#158222 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158222 Approved by: https://github.com/pytorchbot	2025-08-11 12:00:13 +00:00
PyTorch UpdateBot	05029ad1c3	[xla hash update] update the pinned xla hash (#160306 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160306 Approved by: https://github.com/pytorchbot	2025-08-11 11:28:49 +00:00
cyy	cf4964be68	Remove unnecessary CMake checks for glog (#158185 ) With the updating to CMake 2.27, some old scripts can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158185 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-08-11 10:14:47 +00:00
Tanmay Sinha	ecea81117b	Fix clang builds by adding headers (#160252 ) Clang compiler from llvm-14 fails to build full torch from source with the message ``` no template named 'unordered_map' in namespace 'std' std::unordered_map<std::string, HandlerFunc> handlers_{}; ~~~~~^ ``` A similar issue here https://github.com/intel/llvm/issues/5264 Fix is to add the correct headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160252 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-08-11 09:03:14 +00:00
fduwjj	1c2cba17ea	[FR] Add stack_id and an optional print of stack_id to stack_trace mapping (#160119 ) To better help users debug with FR, we want to add stack_id and print a map between stack_id and stack_trace (optional) Screenshot: <img width="1029" height="529" alt="image" src="https://github.com/user-attachments/assets/8404a1d3-cc33-4f5f-971b-29609ec316c1" /> <img width="1620" height="358" alt="image" src="https://github.com/user-attachments/assets/3dd29c8c-ff68-41a2-acfd-e770036cfeb1" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160119 Approved by: https://github.com/H-Huang, https://github.com/wconstab	2025-08-11 07:27:10 +00:00
Nick Riasanovsky	ff0d56d035	[Inductor] [Triton] Enable Configuration warmup/rep iterations when benchmarking in inductor (#159982 ) Summary: When benchmarking on B200 Max Autotune, I discovered that the estimations from the autotune logs consistently produced a better ATEN result by > 20% on an example shape. Here is an example of the output: ``` Autotune Choices Stats: {"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.3081120103597641, "best_triton_pos": 1, "best_triton_time": 0.6589759886264801, "best_triton_kernel": "triton_mm_16", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0"} AUTOTUNE mm(3840x1152, 1152x49136) strides: [1, 3840], [49152, 1] dtypes: torch.bfloat16, torch.bfloat16 mm 0.3081 ms 100.0% triton_mm_16 0.6590 ms 46.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_17 0.6830 ms 45.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_13 0.7015 ms 43.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_9 0.8487 ms 36.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_11 0.8695 ms 35.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_10 0.8797 ms 35.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_18 0.9089 ms 33.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_14 0.9718 ms 31.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_15 1.0169 ms 30.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 SingleProcess AUTOTUNE benchmarking takes 2.8574 seconds and 0.1032 seconds precompiling for 20 choices Removed 3483 outliers from 28645 samples 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:20<00:00, 20.00s/it] (M, N, K) pt2_matmul_maxautotune-latency pt2_matmul_maxautotune-speedup pt2_matmul_maxautotune-tflops ------------------- -------------------------------- -------------------------------- ------------------------------- (3840, 49136, 1152) 0.359392 (±8.27%) 1209.61 average 1209.61 ``` Based on my reading about B200 power usage, I believe this is due to the new for power aware benchmarking as a kernel may perform better in short bursts. This adds environment variables to expand autotuning iterations so we can get more consistent results between the estimation and the actual runtime. I did not update the default yet, even for B200 because I'm not sure how this is used in practice. This is the new output: ``` Autotune Choices Stats: {"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.3848319947719574, "best_triton_pos": 1, "best_triton_time": 0.6287680268287659, "best_triton_kernel": "triton_mm_16", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0"} AUTOTUNE mm(3840x1152, 1152x49136) strides: [1, 3840], [49152, 1] dtypes: torch.bfloat16, torch.bfloat16 mm 0.3848 ms 100.0% triton_mm_16 0.6288 ms 61.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_13 0.6299 ms 61.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_17 0.6728 ms 57.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_9 0.7189 ms 53.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_18 0.8566 ms 44.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_11 0.8693 ms 44.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_14 0.9298 ms 41.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_10 0.9524 ms 40.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_15 1.0216 ms 37.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 SingleProcess AUTOTUNE benchmarking takes 3.9245 seconds and 0.0965 seconds precompiling for 20 choices Removed 3537 outliers from 29530 samples 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:23<00:00, 23.70s/it] (M, N, K) pt2_matmul_maxautotune-latency pt2_matmul_maxautotune-speedup pt2_matmul_maxautotune-tflops ------------------- -------------------------------- -------------------------------- ------------------------------- (3840, 49136, 1152) 0.359328 (±9.71%) 1209.82 average 1209.82 ``` Test Plan: `TORCH_AUTOTUNE_REP=1000 CUDA_VISIBLE_DEVICES=2 ENABLE_MMA_V5_ATT_PIPELINE=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 buck2 run mode/opt //pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op gemm --iter $NUM_ITERS --input-loader /home/njriasan/parsed_shapes.json --only pt2_matmul_maxautotune` Rollback Plan: Reviewed By: NikhilAPatel Differential Revision: D79737929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159982 Approved by: https://github.com/NikhilAPatel	2025-08-11 05:27:51 +00:00
Jiaxi WANG	334b38ccc4	Fix typo in README.md (#160160 ) The "Get the PyTorch Source" section is now located before the "Install Dependencies/Common" section, so "... using the “Get the PyTorch Source“ section below" should be "... using the “Get the PyTorch Source“ section above". Pull Request resolved: https://github.com/pytorch/pytorch/pull/160160 Approved by: https://github.com/BoyuanFeng	2025-08-11 05:09:59 +00:00
FFFrog	dc0d18e023	[CUDA] Remove the uncessary CUDA_GUARD (#160249 ) `CUDA_GUARD` is unnecessary in `initDeviceStreamState`, because the `initSingleStream` has already done it. `29712314dd/c10/cuda/CUDAStream.cpp (L202-L203)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160249 Approved by: https://github.com/Skylion007	2025-08-11 05:08:05 +00:00
cyy	8ae4d2652f	Tidy torch/csrc/jit/passes/onnx code (#160262 ) Apply clang-tidy fixes to torch/csrc/jit/passes/onnx Pull Request resolved: https://github.com/pytorch/pytorch/pull/160262 Approved by: https://github.com/justinchuby	2025-08-11 04:50:38 +00:00
Edward Z. Yang	8088cfa592	Add type assert for tensor_meta, based on real bug in autoparallel. (#157927 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157927 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/wconstab	2025-08-11 04:22:02 +00:00
Nikita Shulga	d8cb3db533	Add unsigned support to `IValue` (#160102 ) - Moved repeated logic of saving int64/uint64 into a polymorphic container into `THPUtils_unpackInteger` - Added `TestPythonDispatch.test_dispatch_uint64` regression test Fixes https://github.com/pytorch/pytorch/issues/159168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160102 Approved by: https://github.com/ezyang	2025-08-11 03:57:18 +00:00
Han, Xu	e7152ff8a6	[inductor] fix some windows inductor UTs (#160292 ) This PR is the UT part of https://github.com/pytorch/pytorch/pull/160161. As @malfet 's comments: https://github.com/pytorch/pytorch/pull/160161#pullrequestreview-3103812178 This PR will not land turn on change, and only land UT part. changes: 1. Fixed `test_invalid_artifact_flag_error_msg`. 2. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`. 3. Skiped whole UT `test_cpu_select_algorithm.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160292 Approved by: https://github.com/malfet	2025-08-11 02:55:37 +00:00
Nikita Shulga	842cc77ab9	[MPS] Extend addmm to integral types (#160270 ) By adding `addmm` kernel, which is a logical continuation of `mm` one. The only tricking part are how alpha and beta constants are handled, which are passed as `optmath_t`, i.e. that it could be, int64, int32 or float Unified all MM flavors instantiations thru `INSTANTIATE_MM_OPS` and tested that `addmm` metal kernel works as expected for floating types as well by testing it via ``` PYTORCH_MPS_PREFER_METAL=1 python test/test_mps.py -v -k test_output_match_addmm_mps_ ``` Fixes https://github.com/pytorch/pytorch/issues/154901 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160270 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #160228, #160234	2025-08-11 00:54:17 +00:00
PyTorch MergeBot	b602ea9cab	Revert "[inductor] turn on windows inductor UTs (#160161 )" This reverts commit 4416433c7c625127b7f975c92f8ec98ea4c67fd3. Reverted https://github.com/pytorch/pytorch/pull/160161 on behalf of https://github.com/xuhancn due to auto merged with two related issue ([comment](https://github.com/pytorch/pytorch/pull/160161#issuecomment-3172982125))	2025-08-11 00:04:25 +00:00
Xu Han	4416433c7c	[inductor] turn on windows inductor UTs (#160161 ) With this PR, we can turn on the inductor UTs on Windows CPU. changes: 1. Turn on inductor UTs on Windows CPU. 2. Add a shard to balance added UTs, otherwise it should run timeout. 3. Fixed `test_invalid_artifact_flag_error_msg`. 4. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`. 5. Skiped whole UT `test_cpu_select_algorithm.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160161 Approved by: https://github.com/jansel	2025-08-10 23:18:35 +00:00
Andy (An) Wang	05c19d1ace	[Inductor] Add back the revert part (#160054 ) Add back the reverted code(https://github.com/pytorch/pytorch/pull/159809) as we've figured out the actual root cause of the internal test failures. Mote details in the internal diff. Rollback Plan: Differential Revision: D79776691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160054 Approved by: https://github.com/blaine-rister	2025-08-10 19:20:30 +00:00
Xu Han	d6786741a7	[inductor] slow test some Windows UTs. (#160267 ) When we enabled Windows inductor UTs since the PR: https://github.com/pytorch/pytorch/pull/160161/ The main branch CI occurred timeout issue, Let's move some UT to slow test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160267 Approved by: https://github.com/ezyang	2025-08-10 18:35:42 +00:00
PyTorch MergeBot	7ae0629d64	Revert "[inductor] turn on windows inductor UTs (#160161 )" This reverts commit f0980fc0bbd656d6c02d23ad97e945353b314f35. Reverted https://github.com/pytorch/pytorch/pull/160161 on behalf of https://github.com/clee2000 due to broke some inductor tests on windows inductor\test_codecache.py::TestStandaloneCompile::test_different_process [GH job link](https://github.com/pytorch/pytorch/actions/runs/16853706010/job/47748778757) [HUD commit link](`f0980fc0bb`). note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/160161#issuecomment-3172784292))	2025-08-10 17:33:19 +00:00
Xu Han	0e3e377bd5	[inductor] fix CompiledArtifact.load path on Windows. (#160268 ) fix CompiledArtifact.load path on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160268 Approved by: https://github.com/ezyang	2025-08-10 14:22:52 +00:00
Isalia20	a84b60c0c4	[MPS] Sparse coalesce more dtypes to match cpu (#160254 ) More dtypes to match the cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/160254 Approved by: https://github.com/malfet	2025-08-10 12:25:18 +00:00
atalman	3ac86e728d	Add Alban and Piotr to list of maintainers (#160187 ) Add Alban and Piotr to list of maintainers Pull Request resolved: https://github.com/pytorch/pytorch/pull/160187 Approved by: https://github.com/albanD	2025-08-10 12:00:16 +00:00
Edward Yang	c9671dc865	Delete Python reference implementation from torchdim, as it is untested (#160115 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160115 Approved by: https://github.com/albanD	2025-08-10 11:21:33 +00:00
ghostspiders	af10f1f86c	Fix requires_cuda to requires_cuda_and_triton (#160222 ) Fixes ##159399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160222 Approved by: https://github.com/janeyx99	2025-08-10 07:05:52 +00:00
Edward Yang	5dddcd5b07	Correctly copy self.module_stack in ModuleStackTracer (#159956 ) There is a bigger cluster of issues which this does not completely fix, but I think this is a matter of good hygiene, especially because we immediately mutate the dict after assigning it. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159956 Approved by: https://github.com/pianpwk	2025-08-10 03:33:59 +00:00
PyTorch MergeBot	d3d359dbaf	Revert "Fix get_free_symbol_uses for several nodes. (#160134 )" This reverts commit db78943a1ca13a32a3d6045eb15e2b719ee13a2f. Reverted https://github.com/pytorch/pytorch/pull/160134 on behalf of https://github.com/malfet due to No, those are not pre-existing, see `df55ec7d4b/1` ([comment](https://github.com/pytorch/pytorch/pull/160134#issuecomment-3172314322))	2025-08-10 02:37:40 +00:00
Nikita Shulga	df55ec7d4b	[OpInfo][BE] Better inputs for addmm (#160234 ) Right now alpha and betha are both less than zero, which makes them useless for all addmm samples for interal types Pull Request resolved: https://github.com/pytorch/pytorch/pull/160234 Approved by: https://github.com/Skylion007 ghstack dependencies: #160228	2025-08-10 01:26:48 +00:00
Xu Han	f0980fc0bb	[inductor] turn on windows inductor UTs (#160161 ) With this PR, we can turn on the inductor UTs on Windows CPU. changes: 1. Turn on inductor UTs on Windows CPU. 2. Add a shard to balance added UTs, otherwise it should run timeout. 3. Fixed `test_invalid_artifact_flag_error_msg`. 4. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`. 5. Skiped whole UT `test_cpu_select_algorithm.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160161 Approved by: https://github.com/jansel	2025-08-09 21:06:00 +00:00
Laith Sakka	db78943a1c	Fix get_free_symbol_uses for several nodes. (#160134 ) get_free_symbol_uses is used to know what unbacked symbols are used by a given node. not having correct get_free_symbol_uses defined properly leads to : 1. eliminating of some nodes due to not detection of any users. (See the added unit test) 2. Incorrect topological sort. Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel. for ComputedBuffer with NonOwningLayout its interesting case. when layout is NonOwningLayout we need to access the actual view op base layout and use detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160134 Approved by: https://github.com/bobrenjc93	2025-08-09 18:15:46 +00:00
thenumberouscode	29712314dd	[fx][pass] Support converting a float32 tensor to a scalar in FX trace. (#158216 ) Fixes https://github.com/pytorch/pytorch/issues/158083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158216 Approved by: https://github.com/laithsakka	2025-08-09 15:13:13 +00:00
cyy	01f66d08d9	Remove outdated CMAKE_CUDA_COMPILER_VERSION branch (#160075 ) Remove the condition `if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.0)` in cmake/Codegen.cmake, because we are now default to CUDA >=12.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160075 Approved by: https://github.com/Skylion007	2025-08-09 14:23:17 +00:00
PyTorch MergeBot	2f4c222617	Revert "Make user defined Triton kernels serializable for fx_graph_runnable (#160002 )" This reverts commit 4183d4ff3dcc1d87400326a9a7998c3f9e966f60. Reverted https://github.com/pytorch/pytorch/pull/160002 on behalf of https://github.com/albanD due to Breaks inductor tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/160002#issuecomment-3170855866))	2025-08-09 14:01:58 +00:00
xinan.lin	8047421fbb	[Linter] Expanding the scope of detecting device-bias code. (#159949 ) Currently, the device-bias linter only targets functions decorated with @requires_gpu. This PR adds support for two new detection scenarios: 1. Detect device-bias code in functions decorated with @requires_triton. 2. Detect device-bias code for entire test suites that are defined as shared across GPUs. For example: ``` if __name__ == "__main__": if HAS_GPU: run_tests() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159949 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-08-09 09:41:16 +00:00
PaulZhang12	4183d4ff3d	Make user defined Triton kernels serializable for fx_graph_runnable (#160002 ) Resolves issue https://github.com/pytorch/pytorch/issues/153475 where `fx_graph_runnable` didn't work with user defined triton kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160002 Approved by: https://github.com/eellison	2025-08-09 09:26:05 +00:00
Sherlock Huang	fb887c3bb5	Add Sherlock and Zhengxu as codeowner for schema.py (#160233 ) Test Plan: CI Rollback Plan: Differential Revision: D79933462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160233 Approved by: https://github.com/zhxchen17	2025-08-09 04:44:12 +00:00
PyTorch UpdateBot	bcf23ecc47	[vllm hash update] update the pinned vllm hash (#160235 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160235 Approved by: https://github.com/pytorchbot	2025-08-09 04:17:32 +00:00
Animesh Jain	303c614f3d	[dynamo] Be consistent with UserMethodVariable source (#160155 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160155 Approved by: https://github.com/StrongerXi	2025-08-09 04:16:14 +00:00
PyTorch UpdateBot	0d88593dd8	[audio hash update] update the pinned audio hash (#160153 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160153 Approved by: https://github.com/pytorchbot	2025-08-09 04:01:31 +00:00
Rob Timpe	5ed4f91779	[dynamo] support itertools.permutations (#159694 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159694 Approved by: https://github.com/guilhermeleobas ghstack dependencies: #159693	2025-08-09 03:01:58 +00:00
Rob Timpe	e07c52b2c0	[dynamo] Improve support for itertools.product (#159693 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159693 Approved by: https://github.com/guilhermeleobas, https://github.com/mlazos	2025-08-09 03:01:58 +00:00
cyy	10e3514c96	Remove tensorexpr tests (#158928 ) The tests are not maintained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928 Approved by: https://github.com/albanD, https://github.com/malfet	2025-08-09 02:21:22 +00:00
Shangdi Yu	11a3565f18	[Torch Native] Add test for packaging weight (#158750 ) Add test that require weights to be packaged for torch native For now, we need `package_weights_in_so=True` for compile standalone. The constants are in a `.o` file and will be added as a source to the CMakeLists.txt of the model. After we added weight deduping, we should be able to let this config be False. ``` python test/inductor/test_aot_inductor_package.py -k test_compile_with_exporter_weights ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158750 Approved by: https://github.com/desertfire	2025-08-09 01:04:21 +00:00
Ankita George	e96c7c4bb0	[dcp][hf] Improve HF consolidation algorithm (#158648 ) Before we had a bunch of if-else cases based on sharding strategy to decide how to save the tensor with different logic for different strategies. This can be consolidated into one function that uses an algorithm to handle all cases by finding the max possible contiguous bytes that can be written Differential Revision: [D78489438](https://our.internmc.facebook.com/intern/diff/D78489438/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158648 Approved by: https://github.com/saumishr	2025-08-09 00:11:22 +00:00
Jane Xu	9b803cdbe2	[BE] Remove more optim entries from docs coverage ignore list (#160194 ) This PR does privatize ReduceLRSchedulerOnPlateau.is_better -> ReduceLRSchedulerOnPlateau._is_better because that API was never meant to be public. A GitHub search for it also reveals that the API is not commonly used much. https://github.com/search?q=.is_better%28&type=code&p=2 If you do use this API and you rely on it for some reason, please file an issue. In the meantime, you can access it through `_is_better(...)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160194 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-08-09 00:09:45 +00:00
Nikita Shulga	8c41cb800a	[MPS][BE] Combine all pre-MacOS14 xfail lists (#160228 ) It does not matter whether it started to fail after 13.1 or 13.3, fact that it still fails on latest MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/160228 Approved by: https://github.com/dcci	2025-08-09 00:00:46 +00:00
Yanan Cao (PyTorch)	731ee31f7b	[TorchScript, PT2] Add torch._check compatibility support (#159988 ) Summary: Add support for torch._check() in TorchScript jit.script frontend. * It will be special cased to behave like torch._assert, turned into an if + raise exception. Test Plan: Unit tests Rollback Plan: Differential Revision: D79744604 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159988 Approved by: https://github.com/davidberard98	2025-08-08 23:14:13 +00:00
Ti-Tai Wang	566c6d52ef	[ONNX] Fix the export of the model having none as output (#160200 ) Fixes #160150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160200 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-08-08 23:09:34 +00:00
Aidyn-A	4e2ddb5db6	[Inductor][CUTLASS] Copy cutlass_mock_imports directory (#159724 ) Pip wheels of PyTorch nightly and 2.8 release candidates do not contain `cutlass_mock_imports`. This is the path to the source code: ``` root@8120d02fd9c5:$ tree ./torch/_inductor/codegen/cuda/cutlass_lib_extensions/ ./torch/_inductor/codegen/cuda/cutlass_lib_extensions/ ├── cutlass_mock_imports │ ├── cuda │ │ ├── __init__.py │ │ ├── cuda.py │ │ └── cudart.py │ ├── pydot │ │ └── __init__.py │ └── scipy │ ├── __init__.py │ └── special.py ├── evt_extensions.py └── gemm_operation_extensions.py 5 directories, 8 files ``` And this what installed wheel has: ``` root@8120d02fd9c5:$ tree /usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/cuda/cutlass_lib_extensions/ /usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/cuda/cutlass_lib_extensions/ ├── __init__.py ├── evt_extensions.py └── gemm_operation_extensions.py 1 directory, 3 files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159724 Approved by: https://github.com/henrylhtsang	2025-08-08 22:56:05 +00:00
Kanya-Mo	9e07673deb	Fix test_fsdp_ep.py due to _MeshEnv API change (#158695 ) #132339 changed parent/child mesh related APIs from _MeshEnv. UT TestFSDPWithEP.test_e2e still uses old APIs and will fail: ``` File "/home/kanya/pytorch/test/distributed/checkpoint/e2e/test_fsdp_ep.py", line 77, in test_e2e mesh_fsdp_ep = _mesh_resources.create_child_mesh(mesh_fsdp_tp, ("dp",)) AttributeError: '_MeshEnv' object has no attribute 'create_child_mesh' To execute this test, run the following from the base repo dir: python test/distributed/checkpoint/e2e/test_fsdp_ep.py TestFSDPWithEP.test_e2e This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0. Did you mean: 'create_sub_mesh'? ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158695 Approved by: https://github.com/Skylion007, https://github.com/nWEIdia	2025-08-08 22:36:47 +00:00
Eddie Yan	1128f4c2a8	[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 ) cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282 Approved by: https://github.com/drisspg Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-08-08 22:22:48 +00:00
Robert Hardwick	334ecbd4ff	Add torchao to install_inductor_benchmark_deps cleanup stage (#160191 ) It looks like `torcho` was missed from the cleanup during torchbench setup. Fixes #160188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160191 Approved by: https://github.com/huydhn	2025-08-08 22:18:41 +00:00
PyTorch MergeBot	206c1eef65	Revert "[pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#159655 )" This reverts commit 2ee22e435131369a7e4f8cc4732579acc29a941b. Reverted https://github.com/pytorch/pytorch/pull/159655 on behalf of https://github.com/clee2000 due to broke dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed [GH job link](https://github.com/pytorch/pytorch/actions/runs/16839294394/job/47711078667) [HUD commit link](`2ee22e4351`). Probably a landrace since it did run on the PR ([comment](https://github.com/pytorch/pytorch/pull/159655#issuecomment-3169400889))	2025-08-08 22:04:22 +00:00
Nikita Shulga	28ccc9e724	[MPS] Extend `index_put` to complex types (#160159 ) And delete confusing supported types check. Move all pseudo atomic (but eventually consistent) ops to `c10/metal/atomic.h` header Fixes https://github.com/pytorch/pytorch/issues/160034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160159 Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/Skylion007	2025-08-08 21:54:30 +00:00
Syed Tousif Ahmed	2247aa6d1d	Documents tuning NVLink performance on H100/H200 (#159792 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159792 Approved by: https://github.com/ngimel	2025-08-08 20:28:24 +00:00
Sheng Fu	1febab2a89	Do not treat ReinterpretView as a realized node (#159920 ) Summary: Do not treat ReinterpretView as a realized node Function [gather_origins](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L888](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fmain%2Ftorch%2F_inductor%2Futils.py%23L888&h=AT2PYr83thTo6VUjPs26Y8QAN6Sid16rvDMHtxO-Bp9FDwHr4J5PObtH3IhNTL-LPSRVC9WVJAcmwUToVWJIrDWb84i0j61QE55ySYAkGbuigqcNc7xczlirHhbiC9vMqiz91VwWdl4Pe2yKN7VIjjCiFUqw) calls is_realized_node to decide if a FX node should be included in the origins of a IR node. ReinterpretView is considered a realized node, so it is not included in the origins. It leads to an incomplete graph. For example: ``` @torchdynamo.optimize("inductor") def fn(input_data, weight): normalized_input = input_data * weight.unsqueeze(0) return normalized_input input_data = torch.randn(4272, 192, requires_grad=True).to(device) weight = torch.randn(192, requires_grad=True).to(device) fn(input_data, weight) ``` The original FX graph returned in [get_kernel_metadata](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L723](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fmain%2Ftorch%2F_inductor%2Futils.py%23L723&h=AT2PYr83thTo6VUjPs26Y8QAN6Sid16rvDMHtxO-Bp9FDwHr4J5PObtH3IhNTL-LPSRVC9WVJAcmwUToVWJIrDWb84i0j61QE55ySYAkGbuigqcNc7xczlirHhbiC9vMqiz91VwWdl4Pe2yKN7VIjjCiFUqw) is the following: %primals_2 : Tensor "f32[4272, 192][192, 1]cuda:0" = PlaceHolder[target=primals_2] %primals_1 : Tensor "f32[192][1]cuda:0" = PlaceHolder[target=primals_1] %mul : Tensor "f32[4272, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %unsqueeze), kwargs = {}) return %mul The unsqueeze op is missing. With this DIFF, the new FX graph is the following: %primals_2 : Tensor "f32[4272, 192][192, 1]cuda:0" = PlaceHolder[target=primals_2] %primals_1 : Tensor "f32[192][1]cuda:0" = PlaceHolder[target=primals_1] %unsqueeze : Tensor "f32[1, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.unsqueeze.default](args = (%primals_1, 0), kwargs = {}) %mul : Tensor "f32[4272, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %unsqueeze), kwargs = {}) return %mul Pull Request resolved: https://github.com/pytorch/pytorch/pull/159920 Approved by: https://github.com/mlazos	2025-08-08 20:13:35 +00:00
Jovian Anthony Jaison	2ee22e4351	[pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#159655 ) This change logs the stack trace of the code being compiled by Dynamo, improving visibility into what is compiled. It adds a stack_trace field to compilation metrics. This helps with debugging and analysis of Dynamo compilation behavior. Ref [D79287964](https://www.internalfb.com/diff/D79287964) Test Plan: $ python -m test_utils Internal: ref [D79372519](https://www.internalfb.com/diff/D79372519) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159655 Approved by: https://github.com/c00w	2025-08-08 19:53:47 +00:00
James Dong	c86040a8e6	[torch.export] Fix test_export_api_with_dynamic_shapes (#160164 ) Summary: Update test KJT's dynamic_shapes to match the newly exported fields. Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test:test_export -- --exact 'caffe2/test:test_export - test_export_api_with_dynamic_shapes_cpp_runtime_nonstrict (caffe2.test.export.test_nativert.NativeRTTestExport)' File changed: fbcode//caffe2/test/export/test_export.py Buck UI: https://www.internalfb.com/buck2/8247eaf8-eaf9-4876-95cb-7b4263d15ef2 Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275093345198 Network: Up: 100KiB Down: 0B (reSessionID-72a2579f-df3f-4262-9aa3-de0db9687 Executing actions. Remaining 0/2 Command: test. Time elapsed: 2:20.5s Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Reviewed By: malaybag Differential Revision: D79862872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160164 Approved by: https://github.com/angelayi, https://github.com/ezyang	2025-08-08 19:45:30 +00:00
Anshul Sinha	72009ec6be	[replicate][be] improved readability and cleaned up remaining DDP code (#160133 ) Summary As much of ReplicateState functionality is copied from FSDPState, I fixed any remaining comments that incorrectly used FSDP instead of Replicate. In addition, instead of labeling modules FSDPModule or FSDPLinear, I have changed it so that is now uses Replicate____. Finally, I have removed some leftover code from the DDP implementation. I have included test cases to verify correctness. Test Case 1. pytest test/distributed/_composable/test_replicate_with_fsdp.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/160133 Approved by: https://github.com/mori360 ghstack dependencies: #160128	2025-08-08 19:42:23 +00:00
Andres Lugo	5f5f508aa8	[ROCm] Ck backend UX refactor (#152951 ) Refactors how the enablement/disablement of CK Gemms and SDPA works. - Adds USE_ROCM_CK_GEMM compile flag for enabling CK gemms. - USE_ROCM_CK_GEMM is set to True by default on Linux - Updates USE_CK_FLASH_ATTENTION to USE_ROCM_CK_SDPA. - USE_ROCM_CK_SDPA is set to False by default - (USE_CK_FLASH_ATTENTION still works for now, but will be deprecated in a future release) - Prevents these CK libraries from being used unless pytorch has been built specifically with the functionality AND is running on a system architecture that supports it. - the getters for these library backends will also do some validity checking in case the user used an environment variable to change the backend. If invalid, (i.e. one of the cases mentioned above is false) the backend will be set as the current non-CK default Pull Request resolved: https://github.com/pytorch/pytorch/pull/152951 Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/m-gallus Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-08-08 18:40:17 +00:00
Yu, Guangye	da1f608ca3	Add UT for torch.accelerator memory-related API (#155200 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155200 Approved by: https://github.com/albanD ghstack dependencies: #138222, #152932	2025-08-08 17:41:22 +00:00
Yu, Guangye	84f7e88aef	Add unified memory APIs for torch.accelerator (#152932 ) # Motivation The following API will be put under torch.accelerator - empty_cache - max_memory_allocated - max_memory_reserved - memory_allocated - memory_reserved - memory_stats - reset_accumulated_memory_stats - reset_peak_memory_stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932 Approved by: https://github.com/albanD ghstack dependencies: #138222	2025-08-08 17:41:22 +00:00
Yu, Guangye	d7114f05b1	Add DeviceAllocator as the base device allocator (#138222 ) # Motivation In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases. <div align="center"> <table> <tr> <td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td> </tr> <tr> <td> ```python torch.xxx.empty_cache ``` </td> <td> ```python torch.accelerator.empty_cache ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_peak_memory_stats ``` </td> <td> ```python torch.accelerator.reset_peak_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_accumulated_memory_stats ``` </td> <td> ```python torch.accelerator.reset_accumulated_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_stats ``` </td> <td> ```python torch.accelerator.memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_allocated ``` </td> <td> ```python torch.accelerator.memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_allocated ``` </td> <td> ```python torch.accelerator.max_memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_reserved ``` </td> <td> ```python torch.accelerator.memory_reserved ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_reserved ``` </td> <td> ```python torch.accelerator.max_memory_reserved ``` </td> </tr> </table> </div> # Solution This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222 Approved by: https://github.com/albanD, https://github.com/Camyll	2025-08-08 17:41:10 +00:00
albanD	c5ec5458a5	Don't build nccl when distributed is disabled (#160086 ) Because distributed doesn't build on recent compilers, I have to disable distributed, but this makes it still fail as nccl is still built Pull Request resolved: https://github.com/pytorch/pytorch/pull/160086 Approved by: https://github.com/Skylion007, https://github.com/janeyx99	2025-08-08 17:19:16 +00:00
Kurt Mohler	86eb65f7f0	[MPS] Move max_pool2d to Metal for `stride != 1` (#157876 ) This PR updates `max_pool2d` to use a Metal kernel instead of the old MPS graph impl. However, when the `stride` argument is 1 in all dimensions, the old implementation gives significantly better performance, so we fall back to it in that case. Below is a performance comparison of `max_pool2d` before and after this PR, obtained from this script: `2f02f2bf7a/max_pool_mps/perf.py` <details><summary>Click to expand</summary> case \| before PR \| after PR \| speedup \| \| case info -- \| -- \| -- \| -- \| -- \| -- 0 \| 0.014264 \| 0.004473 \| 3.188911245 \| \| (3, 2, 2), {'kernel_size': 2, 'return_indices': True} 1 \| 0.010752 \| 0.00421 \| 2.55391924 \| \| (3, 2, 2), {'kernel_size': 2, 'return_indices': False} 2 \| 0.020777 \| 0.006123 \| 3.393271272 \| \| (3, 10, 10), {'kernel_size': 5, 'return_indices': True} 3 \| 0.011065 \| 0.005759 \| 1.921340511 \| \| (3, 10, 10), {'kernel_size': 5, 'return_indices': False} 4 \| 0.01452 \| 0.007829 \| 1.854642994 \| \| (3, 100, 100), {'kernel_size': 5, 'return_indices': True} 5 \| 0.009258 \| 0.007075 \| 1.308551237 \| \| (3, 100, 100), {'kernel_size': 5, 'return_indices': False} 6 \| 0.188137 \| 0.168688 \| 1.115295694 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': True} 7 \| 0.161362 \| 0.154746 \| 1.042753932 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': False} 8 \| 0.182883 \| 0.16945 \| 1.079274122 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': True} 9 \| 0.156875 \| 0.163346 \| 0.9603847049 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': False} 10 \| 0.193433 \| 0.167396 \| 1.155541351 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': True} 11 \| 0.158967 \| 0.151246 \| 1.051049284 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': False} 12 \| 0.931071 \| 0.932883 \| 0.9980576342 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': True} 13 \| 0.324496 \| 0.3252 \| 0.9978351784 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': False} 14 \| 0.944071 \| 0.936246 \| 1.008357846 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': True} 15 \| 0.322171 \| 0.314854 \| 1.023239343 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': False} 16 \| 0.894158 \| 0.886408 \| 1.008743152 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': True} 17 \| 0.309338 \| 0.304146 \| 1.017070749 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': False} 18 \| 0.606 \| 0.260546 \| 2.325884873 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': True} 19 \| 0.30445 \| 0.231054 \| 1.317657344 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': False} 20 \| 0.474708 \| 0.261925 \| 1.812381407 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': True} 21 \| 0.23175 \| 0.231883 \| 0.9994264349 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': False} 22 \| 0.434475 \| 0.266246 \| 1.631855502 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': True} 23 \| 0.236942 \| 0.231792 \| 1.022218196 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': False} 24 \| 0.202396 \| 0.174888 \| 1.157289237 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': True} 25 \| 0.160679 \| 0.158246 \| 1.015374796 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': False} 26 \| 0.200354 \| 0.184133 \| 1.088093932 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': True} 27 \| 0.160779 \| 0.160679 \| 1.000622359 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': False} 28 \| 0.199175 \| 0.178625 \| 1.115045486 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': True} 29 \| 0.159458 \| 0.160883 \| 0.9911426316 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': False} 30 \| 0.199021 \| 0.165329 \| 1.203787599 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': True} 31 \| 0.156337 \| 0.158213 \| 0.9881425673 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': False} 32 \| 0.180146 \| 0.174483 \| 1.032455884 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': True} 33 \| 0.156988 \| 0.158167 \| 0.9925458534 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': False} 34 \| 0.182133 \| 0.176521 \| 1.031792251 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': True} 35 \| 0.169042 \| 0.156483 \| 1.080257919 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': False} 36 \| 1.767821 \| 1.766254 \| 1.000887188 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': True} 37 \| 1.059346 \| 1.058775 \| 1.000539302 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': False} 38 \| 1.85755 \| 1.859429 \| 0.9989894747 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': True} 39 \| 1.100417 \| 1.097683 \| 1.002490701 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': False} 40 \| 1.843167 \| 1.847558 \| 0.9976233493 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': True} 41 \| 1.090142 \| 1.093163 \| 0.9972364597 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': False} 42 \| 0.480867 \| 0.251733 \| 1.910226311 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': True} 43 \| 0.319246 \| 0.236479 \| 1.349997251 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': False} 44 \| 0.49315 \| 0.256408 \| 1.923301925 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': True} 45 \| 0.316746 \| 0.227854 \| 1.390127011 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': False} 46 \| 0.4912 \| 0.257762 \| 1.905633879 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': True} 47 \| 0.324771 \| 0.229371 \| 1.41592006 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': False} 48 \| 0.152904 \| 0.095079 \| 1.608178462 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': True} 49 \| 0.102963 \| 0.089217 \| 1.154073775 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': False} 50 \| 0.155158 \| 0.095429 \| 1.625899884 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': True} 51 \| 0.104338 \| 0.089979 \| 1.15958168 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': False} 52 \| 0.153121 \| 0.096429 \| 1.587914424 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': True} 53 \| 0.103642 \| 0.090254 \| 1.148336916 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': False} 54 \| 0.191071 \| 0.165125 \| 1.157129447 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': True} 55 \| 0.153971 \| 0.149021 \| 1.033216795 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': False} 56 \| 0.193192 \| 0.166892 \| 1.157586942 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': True} 57 \| 0.156617 \| 0.15215 \| 1.029359185 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': False} 58 \| 0.178033 \| 0.167308 \| 1.06410333 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': True} 59 \| 0.157425 \| 0.164404 \| 0.9575496947 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': False} 60 \| 1.757638 \| 1.750896 \| 1.0038506 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': True} 61 \| 1.048471 \| 1.047967 \| 1.000480931 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': False} 62 \| 1.790708 \| 1.789767 \| 1.000525767 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': True} 63 \| 1.054575 \| 1.054796 \| 0.9997904808 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': False} 64 \| 1.785837 \| 1.784192 \| 1.000921986 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': True} 65 \| 1.054713 \| 1.054492 \| 1.00020958 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': False} 66 \| 0.478267 \| 0.261017 \| 1.832321266 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': True} 67 \| 0.32005 \| 0.226654 \| 1.412064204 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': False} 68 \| 0.484008 \| 0.254721 \| 1.900149575 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': True} 69 \| 0.321 \| 0.218842 \| 1.466811672 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': False} 70 \| 0.482087 \| 0.248771 \| 1.937874591 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': True} 71 \| 0.316558 \| 0.230533 \| 1.373156988 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': False} 72 \| 0.137842 \| 0.085088 \| 1.619993419 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': True} 73 \| 0.100671 \| 0.0769 \| 1.309115735 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': False} 74 \| 0.148321 \| 0.086967 \| 1.705485989 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': True} 75 \| 0.101392 \| 0.075454 \| 1.343759112 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': False} 76 \| 0.150208 \| 0.083742 \| 1.793699697 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': True} 77 \| 0.099587 \| 0.075825 \| 1.313379492 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': False} 78 \| 0.622546 \| 0.602729 \| 1.03287879 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': True} 79 \| 0.531696 \| 0.5067 \| 1.049330965 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': False} 80 \| 0.626646 \| 0.617038 \| 1.015571164 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': True} 81 \| 0.530354 \| 0.525367 \| 1.009492412 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': False} 82 \| 0.633933 \| 0.577775 \| 1.097197006 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': True} 83 \| 0.533067 \| 0.526954 \| 1.011600633 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': False} 84 \| 3.372867 \| 3.386412 \| 0.9960001914 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': True} 85 \| 1.155975 \| 1.156604 \| 0.9994561665 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': False} 86 \| 3.401921 \| 3.39755 \| 1.001286515 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': True} 87 \| 1.202829 \| 1.192538 \| 1.008629494 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': False} 88 \| 3.23675 \| 3.220238 \| 1.005127571 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': True} 89 \| 1.077067 \| 1.085613 \| 0.9921279498 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': False} 90 \| 1.572925 \| 0.925625 \| 1.699311276 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': True} 91 \| 0.791204 \| 0.793454 \| 0.9971642969 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': False} 92 \| 1.572742 \| 0.922729 \| 1.704446268 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': True} 93 \| 0.784292 \| 0.788871 \| 0.9941955022 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': False} 94 \| 1.526546 \| 0.925708 \| 1.649057802 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': True} 95 \| 0.769321 \| 0.787675 \| 0.9766985114 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': False} 96 \| 0.736033 \| 0.612808 \| 1.201082558 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': True} 97 \| 0.574625 \| 0.530925 \| 1.082309177 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': False} 98 \| 0.722021 \| 0.614488 \| 1.174996094 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': True} 99 \| 0.563171 \| 0.533721 \| 1.055178642 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': False} 100 \| 0.735725 \| 0.613992 \| 1.198264798 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': True} 101 \| 0.583487 \| 0.532513 \| 1.095723485 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': False} 102 \| 0.656383 \| 0.575313 \| 1.140914598 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': True} 103 \| 0.559796 \| 0.509079 \| 1.099625009 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': False} 104 \| 0.662046 \| 0.572362 \| 1.156691045 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': True} 105 \| 0.552633 \| 0.508671 \| 1.086425214 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': False} 106 \| 0.634108 \| 0.574629 \| 1.103508525 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': True} 107 \| 0.534013 \| 0.510996 \| 1.045043405 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': False} 108 \| 7.056642 \| 7.066717 \| 0.9985743026 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': True} 109 \| 4.144275 \| 4.142658 \| 1.000390329 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': False} 110 \| 7.172683 \| 7.189867 \| 0.9976099697 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': True} 111 \| 4.162538 \| 4.158875 \| 1.000880767 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': False} 112 \| 7.194233 \| 7.181837 \| 1.001726021 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': True} 113 \| 4.294083 \| 4.196062 \| 1.023360236 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': False} 114 \| 1.875692 \| 0.891071 \| 2.104986022 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': True} 115 \| 1.097479 \| 0.781175 \| 1.404907991 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': False} 116 \| 1.8883 \| 0.89015 \| 2.121327866 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': True} 117 \| 1.101329 \| 0.778542 \| 1.414604479 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': False} 118 \| 1.872833 \| 0.893654 \| 2.095702587 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': True} 119 \| 1.096712 \| 0.784579 \| 1.397835017 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': False} 120 \| 0.513029 \| 0.374417 \| 1.370207549 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': True} 121 \| 0.349546 \| 0.305763 \| 1.143192603 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': False} 122 \| 0.518929 \| 0.377487 \| 1.374693698 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': True} 123 \| 0.364662 \| 0.3145 \| 1.159497615 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': False} 124 \| 0.521275 \| 0.375242 \| 1.389170189 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': True} 125 \| 0.367488 \| 0.308354 \| 1.191773092 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': False} 126 \| 0.652342 \| 0.569308 \| 1.145850752 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': True} 127 \| 0.555696 \| 0.506892 \| 1.096280865 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': False} 128 \| 0.654333 \| 0.570367 \| 1.147213987 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': True} 129 \| 0.548925 \| 0.505825 \| 1.085207335 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': False} 130 \| 0.655908 \| 0.571904 \| 1.146884792 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': True} 131 \| 0.560808 \| 0.508238 \| 1.103435792 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': False} 132 \| 6.949462 \| 6.949112 \| 1.000050366 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': True} 133 \| 4.072913 \| 4.065013 \| 1.001943413 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': False} 134 \| 7.200896 \| 7.197792 \| 1.000431243 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': True} 135 \| 4.291367 \| 4.218538 \| 1.017264038 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': False} 136 \| 7.1823 \| 7.306933 \| 0.9829431856 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': True} 137 \| 4.151175 \| 4.149592 \| 1.000381483 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': False} 138 \| 1.781279 \| 0.884288 \| 2.014365229 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': True} 139 \| 1.050804 \| 0.774362 \| 1.356993241 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': False} 140 \| 1.860758 \| 0.884637 \| 2.103414169 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': True} 141 \| 1.099908 \| 0.775887 \| 1.417613647 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': False} 142 \| 1.857387 \| 0.885738 \| 2.096993693 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': True} 143 \| 1.105279 \| 0.77365 \| 1.428655077 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': False} 144 \| 0.489408 \| 0.269583 \| 1.815426047 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': True} 145 \| 0.322525 \| 0.236979 \| 1.360985573 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': False} 146 \| 0.515475 \| 0.265813 \| 1.93923924 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': True} 147 \| 0.315525 \| 0.228146 \| 1.382995976 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': False} 148 \| 0.503438 \| 0.277204 \| 1.816128194 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': True} 149 \| 0.335421 \| 0.228275 \| 1.469372467 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': False} 150 \| 5.72495 \| 4.909554 \| 1.166083518 \| \| (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': None, 'return_indices': True} 151 \| 4.45215 \| 4.251333 \| 1.047236243 \| \| (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': None, 'return_indices': False} 152 \| 29.953021 \| 29.879879 \| 1.002447868 \| \| (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': True} 153 \| 9.854683 \| 9.839517 \| 1.001541336 \| \| (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': False} 154 \| 6.178033 \| 5.697375 \| 1.084364817 \| \| (10, 10, 1000, 1000), {'kernel_size': 100, 'padding': 50, 'return_indices': True} 155 \| 6.280317 \| 5.712525 \| 1.099394226 \| \| (10, 10, 1000, 1000), {'kernel_size': 100, 'padding': 50, 'return_indices': False} 156 \| 10.256062 \| 11.336527 \| 0.9046917103 \| \| (10, 10, 1000, 1000), {'kernel_size': 250, 'padding': 50, 'return_indices': True} 157 \| 9.469546 \| 11.33705 \| 0.8352742556 \| \| (10, 10, 1000, 1000), {'kernel_size': 250, 'padding': 50, 'return_indices': False} 158 \| 0.119087 \| 0.0797 \| 1.494190715 \| \| (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': True} 159 \| 0.098713 \| 0.047173 \| 2.092574142 \| \| (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': False} 160 \| 0.960812 \| 0.675762 \| 1.421820108 \| \| (10, 10, 300, 300), {'kernel_size': 2, 'return_indices': True} 161 \| 0.536546 \| 0.485958 \| 1.104099531 \| \| (10, 10, 300, 300), {'kernel_size': 2, 'return_indices': False} 162 \| 2.555225 \| 1.791567 \| 1.426251432 \| \| (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': True} 163 \| 1.419087 \| 1.305137 \| 1.087308842 \| \| (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': False} 164 \| 5.182008 \| 3.48085 \| 1.488719135 \| \| (10, 10, 700, 700), {'kernel_size': 2, 'return_indices': True} 165 \| 2.831779 \| 2.498537 \| 1.133374851 \| \| (10, 10, 700, 700), {'kernel_size': 2, 'return_indices': False} 166 \| 8.546038 \| 5.7783 \| 1.478988284 \| \| (10, 10, 900, 900), {'kernel_size': 2, 'return_indices': True} 167 \| 4.731004 \| 4.161975 \| 1.136720908 \| \| (10, 10, 900, 900), {'kernel_size': 2, 'return_indices': False} 168 \| 0.084754 \| 0.07435 \| 1.139932751 \| \| (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': True} 169 \| 0.057933 \| 0.043096 \| 1.344277891 \| \| (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': False} 170 \| 2.568592 \| 1.802117 \| 1.425319222 \| \| (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': True} 171 \| 1.433054 \| 1.307342 \| 1.096158465 \| \| (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': False} 172 \| 10.3213 \| 7.111604 \| 1.451332217 \| \| (10, 10, 1000, 1000), {'kernel_size': 2, 'return_indices': True} 173 \| 5.680525 \| 5.168129 \| 1.099145358 \| \| (10, 10, 1000, 1000), {'kernel_size': 2, 'return_indices': False} 174 \| 1.02255 \| 1.01375 \| 1.008680641 \| \| (10, 1000, 1000), {'kernel_size': 2, 'padding': 1, 'stride': 1, 'return_indices': False} 175 \| 3.074233 \| 3.094383 \| 0.993488201 \| \| (10, 1000, 1000), {'kernel_size': 2, 'padding': 1, 'stride': 1, 'return_indices': True} 176 \| 1.016812 \| 1.030575 \| 0.9866453194 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': False} 177 \| 3.053658 \| 3.089504 \| 0.9883974903 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': True} 178 \| 1.025863 \| 1.032088 \| 0.9939685376 \| \| (10, 1000, 1000), {'kernel_size': 8, 'padding': 1, 'stride': 1, 'return_indices': False} 179 \| 3.798942 \| 3.799213 \| 0.9999286694 \| \| (10, 1000, 1000), {'kernel_size': 8, 'padding': 1, 'stride': 1, 'return_indices': True} 180 \| 4.492979 \| 4.493421 \| 0.999901634 \| \| (10, 1000, 1000), {'kernel_size': 16, 'padding': 1, 'stride': 1, 'return_indices': False} 181 \| 51.543363 \| 51.266204 \| 1.005406271 \| \| (10, 1000, 1000), {'kernel_size': 16, 'padding': 1, 'stride': 1, 'return_indices': True} 182 \| 1.018008 \| 1.001587 \| 1.016394981 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 1), 'return_indices': False} 183 \| 3.035404 \| 3.003113 \| 1.010752509 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 1), 'return_indices': True} 184 \| 0.610421 \| 0.56 \| 1.0900375 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 4), 'return_indices': False} 185 \| 1.138983 \| 0.757296 \| 1.504012962 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 4), 'return_indices': True} 186 \| 0.641558 \| 0.557808 \| 1.150141267 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (4, 1), 'return_indices': False} 187 \| 1.181475 \| 0.754725 \| 1.565437742 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (4, 1), 'return_indices': True} 188 \| 1.03045 \| 1.026904 \| 1.003453098 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 1), 'return_indices': False} 189 \| 3.041421 \| 3.0263 \| 1.00499653 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 1), 'return_indices': True} 190 \| 0.609929 \| 0.572304 \| 1.065743032 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 4), 'return_indices': False} 191 \| 1.146875 \| 0.756446 \| 1.516135983 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 4), 'return_indices': True} 192 \| 0.645187 \| 0.561708 \| 1.148616363 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (4, 1), 'return_indices': False} 193 \| 1.181721 \| 0.758054 \| 1.558887625 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (4, 1), 'return_indices': True} 194 \| 0.927654 \| 0.925946 \| 1.0018446 \| \| (10, 1000, 1000), {'kernel_size': 1, 'return_indices': False} 195 \| 2.749983 \| 2.740354 \| 1.00351378 \| \| (10, 1000, 1000), {'kernel_size': 1, 'return_indices': True} </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157876 Approved by: https://github.com/malfet	2025-08-08 16:40:10 +00:00
Animesh Jain	a4f69a5da0	[dynamo][guards] Remove guards on stdlib modules (#159913 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159913 Approved by: https://github.com/StrongerXi	2025-08-08 16:26:04 +00:00
Adam J. Stewart	231c72240d	CMake build: preserve PYTHONPATH (#160144 ) Fixes #160092 I'm very new to CMake, so let me know if there's a fancier way to do this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160144 Approved by: https://github.com/malfet Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>	2025-08-08 16:03:49 +00:00
gaoyvfeng	50f23ff6f8	rename-HAS_CUDA-to-HAS_CUDA_AND_TRITON (#159883 ) Fixes #159399 "Modified torch.testing._internal.inductor_utils and test/inductor" Pull Request resolved: https://github.com/pytorch/pytorch/pull/159883 Approved by: https://github.com/janeyx99	2025-08-08 15:44:52 +00:00
zpcore	8a37f0c903	improve gather and scatter_add strategy (#160140 ) As title. This PR made a small fix on top of https://github.com/meta-pytorch/autoparallel/pull/81. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160140 Approved by: https://github.com/fmassa	2025-08-08 15:06:24 +00:00
Edward Z. Yang	b5fd7223b1	Improve pin_memory error message on CPU-only systems (#159994 ) ## Summary - clarify pin_memory error message when no accelerator backend is available ## Testing - `python repro_pin_memory.py` (fails: Need to provide pin_memory allocator to use pin memory) - `lintrunner -a` ------ https://chatgpt.com/codex/tasks/task_e_6893ba92c93483238a9bdfdd6c52812b Pull Request resolved: https://github.com/pytorch/pytorch/pull/159994 Approved by: https://github.com/albanD	2025-08-08 14:36:45 +00:00
Edward Yang	9fa8ce26cf	Working setup with runnable PyTorch on Codex. (#159968 ) Sample transcript: https://chatgpt.com/s/cd_68938effc1a88191ae78bc82a8cefe94 This makes use of https://github.com/pytorch/pytorch/pull/159965 to bypass doing an actual build and use nightly. Things to improve: - Once USE_NIGHTLY is in main can remove the patching - We should just keep using the latest nightly, instead of a hard coded one Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159968 Approved by: https://github.com/wdvr	2025-08-08 14:34:15 +00:00
David Berard	62bac07981	[inductor][triton] support profile_scratch launcher arg (#159772 ) This adds support for Triton after https://github.com/triton-lang/triton/pull/7258 landed. https://github.com/triton-lang/triton/pull/7258 adds a new argument to all the Triton kernels - a profile_scratch argument, similar to global_scratch. This PR updates the static cuda launcher and the AOTI kernel callers to pass in these arguments when calling the Triton kernel. Tests: https://github.com/pytorch/pytorch/pull/159158. I also verified these test locally with triton 3.2, 3.3, and 3.4. Fixes: * static_cuda_launcher (test/repro: `python tools/dynamo/verify_dynamo.py`) * AOTI calling logic (test/repro: `TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_linalg_vander_cuda_float32`) Differential Revision: [D79825121](https://our.internmc.facebook.com/intern/diff/D79825121) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159772 Approved by: https://github.com/NikhilAPatel, https://github.com/eellison	2025-08-08 14:27:38 +00:00
Isalia20	7f4cb4a3e0	[MPS] coalesce for sparse tensors (#159729 ) MPS coalesce function for sparse tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/159729 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-08 13:49:55 +00:00
Aidyn-A	556e2a73f4	[Test][Easy] Use float16 dtype in test_sort_large (#159939 ) The test fails with: >RuntimeError: var_mean only support floating point and complex dtypes Pull Request resolved: https://github.com/pytorch/pytorch/pull/159939 Approved by: https://github.com/eqy	2025-08-08 09:56:44 +00:00
Xuehai Pan	178515d0ff	[BE][PYFMT] remove `black`: finish `black -> ruff format` migration (#144557 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144557 Approved by: https://github.com/ezyang	2025-08-08 07:46:10 +00:00
codingwithsurya	3a56237440	[SymmMem] Send tensors with unerased type information to NVSHMEM Triton kernels (#159788 ) This PR introduces a small `@triton.jit` wrapper function over our core NVSHMEM extern functions for users to send tensors as inputs to their NVSHMEM Triton kernels (rather than pointers). The goal is to abstract away tedious details from the developer, like manual byte-size calculations and handling of raw `int64` pointers. This lets developers work directly with typed Triton tensors and element counts, which will also be useful if you want to do for instance some local math on the data. ----- TODO: This is almost complete. One pending item is tensor-aware implementation of `nvshmem.putmem_signal_block `and `nvshmem.signal_wait_until` From my investigation, I found the root cause to be that this specific tensor API uses local addresses instead of remote addresses for the peer ``` Pointer-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Remote buffer: 0x2430300c00 (dst) ← Rank 1's memory Remote signal: 0x2430301600 (sig) ← Rank 1's signal Rank 1 (waiting): Local signal: 0x430301600 (waits here) Tensor-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Local buffer: 0x430300c00 (dst) ← this is wrong Local signal: 0x430300e00 (sig) ← this is wrong Rank 1 (waiting): Local signal: 0x430300e00 (waits here) ``` Next Steps: Need mechanism to resolve local tensor → remote PE address, equivalent to handle.buffer_ptrs[peer] lookup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159788 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734, #159755, #159756	2025-08-08 05:20:42 +00:00
codingwithsurya	e0d8a315c5	[SymmMem] Add helpful docstrings for all NVSHMEM APIs (#159756 ) Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159756 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734, #159755	2025-08-08 05:20:42 +00:00
codingwithsurya	bfff2e3592	[SymmMem] Refactor NVSHMEM Reduction API to be more ergonomic with automatic dtype‐based dispatch (#159755 ) This change introduces a single, generic Triton‐extern wrapper for NVSHMEM team‐based reductions. We now expose one function, `nvshmem.reduce(team, dest, source, nreduce, operation, dtype_id)`, that covers all supported ops (sum, max, min, prod) and dtypes (int8…int64, uint8…uint64, float16, bfloat16, float32, float64). It accepts real dtype objects (torch.dtype or tl.dtype) directly in the Triton kernel launch. Internally, we normalize dtype_id (handling tl.dtype, torch.dtype, str, or constexpr) into the canonical NVSHMEM typename and assemble the proper function name, e.g. nvshmem_float_sum_reduce or nvshmem_bfloat16_prod_reduce Pull Request resolved: https://github.com/pytorch/pytorch/pull/159755 Approved by: https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734	2025-08-08 05:20:36 +00:00
codingwithsurya	1c881440f4	[SymmMem] Initialize NVSHMEM module only for kernels that have nvshmem in their name (#159734 ) Previously, a global post-compile hook initialized the NVSHMEM module for all Triton kernels, which was inefficient. This change conditionally initializes `_nvshmemx_cumodule_init(kernel.module)` only for Triton kernels containing "nvshmem" in their name. Also updated the names for all of our nvshmem kernels to align with this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159734 Approved by: https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701	2025-08-08 05:20:29 +00:00
codingwithsurya	7c4f7b9340	[SymmMem] Add Triton 3.4 support to NVSHMEM Triton and fix CI tests (make device library discoverable + fix peer calculation bug) (#159701 ) This PR introduces support for Triton 3.4 and resolves several CI and test-related issues. Triton 3.4 Compatibility - The JIT post-compile hook has been updated from the legacy JITFunction.compiled_hook to the new API path at triton.knobs.runtime.jit_post_compile_hook. - The internal parameter for kernel semantics in extern function definitions has been updated from _semantic to _builder to align with API changes. Fix CI Errors - The new logic inspects the RPATH of libtorch_nvshmem.so to find the NVSHMEM device library, preventing CI tests from being skipped. - Added a decorator to run NVSHMEM tests only on H100s (compatible hardware) Peer Rank Calculation Fix - The peer calculation in test_nvshmem_triton.py was changed from peer = (world_size - 1) - rank to peer = 1 - rank. Reasoning: The previous logic was only valid for a 2-rank setup. In the 8-rank CI environment, it incorrectly mapped peers (e.g., rank 0 to 7), breaking tests that assume a 0↔1 communication pattern. This was reproduced and validated on an 8-rank dev setup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159701 Approved by: https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215	2025-08-08 05:20:22 +00:00
codingwithsurya	1783d6e966	[SymmMem] Fix flaky wait_until test (#159215 ) When playing around with it, I noticed some flakiness in this test across sessions. After debugging, turns out the heavy sync primitives that I was calling (like `nvshmem_quiet()` or `nvshmem_fence()`) from inside Triton kernels was causing deadlocks. The original test tried to guarantee ordering: `put(data) -> fence/quiet -> put(flag)`. But the GPU thread got stuck in `quiet()` waiting for network confirmation while holding the SM, creating a deadlock. The fix was realizing `wait_until` already provides all the sync you need. Just do: - PE A: `nvshmem_wait_until(&ivar, ...)` - PE B: `nvshmem_put(&ivar_on_PE_A, ...)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159215 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136	2025-08-08 05:20:16 +00:00
codingwithsurya	ea7fe0ecf6	[SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + improve code clarity (#159136 ) Quick refactor for consistency and clarity. 1. We now standardize all NVSHMEM data-moving collectives (put, get, alltoall, broadcast) to use their byte-based *_mem_block variants. This makes the API behavior more predictable and avoids mixing paradigms. 2. Previously, some functions operated on element counts (nelems), while others expected byte sizes but still used `nelems` as the param name. That inconsistency was easy to miss and could lead to bugs, especially for devs not familiar with the NVSHMEM internals. To clean this up: • All byte-based APIs now use nbytes or nbytes_per_pe to make the units explicit. • Typed APIs consistently use nelems for element counts. • Docstrings were added or updated to clarify expected units. Also did some code cleanup — removed unused functions, fixed typos in comments, and did some general housekeeping. This should make the API more intuitive and reduce friction for developers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159136 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718	2025-08-08 05:20:09 +00:00
codingwithsurya	b0b229b197	[SymmMem] Use _get_default_group() instead of group.WORLD for group_name access (#158718 ) Both approaches functionally return the default process group created by `init_process_group()` but `_get_default_group()` is a dedicated function with [better error handling and type safety](`4869f71170/torch/distributed/distributed_c10d.py (L1300-L1310)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/158718 Approved by: https://github.com/Skylion007, https://github.com/fduwjj ghstack dependencies: #158515	2025-08-08 05:20:02 +00:00
codingwithsurya	b5c937259b	[SymmMem] Add NVSHMEM Reduction support (sum, min, max) into Triton (#158515 ) Implements sum_reduce, min_reduce, and max_reduce collective operations for NVSHMEM Triton kernels. Enables parallel reduction computations across PE teams for int64 data types. Tests: `python test/distributed/test_nvshmem_triton.py` <details> <summary> Quick debug print for sanity check </summary> ```markdown ============================================================ [Rank 1] Starting min/max reduction test with world_size=2 ============================================================ ============================================================ [Rank 0] Starting min/max reduction test with world_size=2 ============================================================ [Rank 0] Source data for min/max: [10, 20] [Rank 1] Source data for min/max: [15, 5] [Rank 1] All values across PEs: [Rank 0] All values across PEs: - Position 0: [10, 15] - Position 0: [10, 15] - Position 1: [20, 5] - Position 1: [20, 5] [Rank 1] Expected min: [10, 5] [Rank 0] Expected min: [10, 5] [Rank 1] Expected max: [15, 20] [Rank 0] Expected max: [15, 20] [Rank 0] Executing MIN reduction... [Rank 1] Executing MIN reduction... [Rank 0] Executing MAX reduction... [Rank 1] Executing MAX reduction... /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [Rank 1] Results: [Rank 0] Results: [Rank 1] MIN reduction result: [10, 5] [Rank 1] MAX reduction result: [15, 20] [Rank 0] MIN reduction result: [10, 5] [Rank 0] MAX reduction result: [15, 20] [Rank 1] ============================================================ [Rank 1] Min/Max reduction test PASSED ✓ [Rank 1] ============================================================ [Rank 0] ============================================================ [Rank 0] Min/Max reduction test PASSED ✓ [Rank 0] ============================================================ ...... ============================================================ ============================================================ [Rank 0] Starting sum reduction test with world_size=2 [Rank 1] Starting sum reduction test with world_size=2 ============================================================ ============================================================ [Rank 0] Configuration: [Rank 1] Configuration: - nreduce: 3 (number of separate reductions) - nreduce: 3 (number of separate reductions) - dtype: torch.int64 - dtype: torch.int64 [Rank 1] Source data: [2, 4, 6] [Rank 1] Contribution explanation: [Rank 0] Source data: [1, 2, 3] [Rank 0] Contribution explanation: - Element 0: 2 = (rank=1+1) * (index=0+1) - Element 0: 1 = (rank=0+1) * (index=0+1) - Element 1: 4 = (rank=1+1) * (index=1+1) - Element 1: 2 = (rank=0+1) * (index=1+1) - Element 2: 6 = (rank=1+1) * (index=2+1) - Element 2: 3 = (rank=0+1) * (index=2+1) [Rank 1] Initial destination: [-1, -1, -1] [Rank 0] Initial destination: [-1, -1, -1] [Rank 0] Expected results after reduction: [3, 6, 9] [Rank 1] Expected results after reduction: [3, 6, 9] [Rank 0] Executing sum reduction... [Rank 1] Executing sum reduction... [Rank 1] Sum reduction completed /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [Rank 0] Sum reduction completed /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [Rank 0] Results after reduction: [Rank 0] Destination buffer: [3, 6, 9] [Rank 1] Results after reduction: [Rank 0] Verification: - Reduction 0: PE0: 1 + PE1: 2 = 3 Result: 3, Match: ✓ - Reduction 1: PE0: 2 + PE1: 4 = 6 Result: 6, Match: ✓ [Rank 1] Destination buffer: [3, 6, 9] - Reduction 2: PE0: 3 + PE1: 6 = 9 [Rank 1] Verification: - Reduction 0: PE0: 1 + PE1: 2 = 3 Result: 9, Match: ✓ Result: 3, Match: ✓ - Reduction 1: PE0: 2 + PE1: 4 = 6 Result: 6, Match: ✓ - Reduction 2: PE0: 3 + PE1: 6 = 9 Result: 9, Match: ✓ [Rank 0] ============================================================ [Rank 0] Sum reduction test PASSED ✓ [Rank 0] All 3 reductions computed correctly across 2 PEs [Rank 0] ============================================================ [Rank 1] ============================================================ [Rank 1] Sum reduction test PASSED ✓ [Rank 1] All 3 reductions computed correctly across 2 PEs [Rank 1] ============================================================ ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158515 Approved by: https://github.com/mandroid6, https://github.com/ngimel	2025-08-08 05:19:55 +00:00
PyTorch UpdateBot	24257f5bfa	[vllm hash update] update the pinned vllm hash (#159822 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159822 Approved by: https://github.com/pytorchbot	2025-08-08 04:13:48 +00:00
Yiming Zhou	017259f9c6	[benchmarks] Add nativert benchmark (#159922 ) Add NativeRT as an option in the PT2 OSS benchmark ``` python ./benchmarks/dynamo/huggingface.py --performance --inference --export-nativert python ./benchmarks/dynamo/timm_models.py --performance --inference --export-nativert python ./benchmarks/dynamo/torchbench.py --performance --inference --export-nativert ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159922 Approved by: https://github.com/angelayi	2025-08-08 03:38:32 +00:00
xinan.lin	2ea40fba84	[Linter] Improve device-bias linter by adding detection for `with torch.device("cuda")`. (#159926 ) ``` For example, detect the following situation: >>>Lint for test/dynamo/test_modes.py: Error (TEST_DEVICE_BIAS) [device-bias] `@requires_gpu` function should not hardcode `with torch.device('cuda')`, suggest to use torch.device(GPU_TYPE) 687 \| flex_attention as flex_attention_eager, 688 \| ) 689 \| >>> 690 \| with torch.device("cuda"): 691 \| flex_attention = torch.compile(flex_attention_eager, dynamic=False) 692 \| 693 \| with self.assertRaisesRegex( ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159926 Approved by: https://github.com/EikanWang, https://github.com/jansel ghstack dependencies: #159759	2025-08-08 03:20:42 +00:00
Aaron Gokaslan	beb4d7816d	[BE]: ruff PLC0207 - use maxsplit kwarg (#160107 ) Automatically replaces split with rsplit when relevant and only performs the split up to the first ( or last value). This allows early return of the split function and improve efficiency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160107 Approved by: https://github.com/albanD	2025-08-08 03:14:59 +00:00
Guilherme Leobas	3fcd79e023	Fix infinite loop when iterating over an empty zip (#159673 ) Dynamo would enter in an infinite recursion when `ZipVariable.next_variable(tx)` was called and there was no iterable to be iterated Pull Request resolved: https://github.com/pytorch/pytorch/pull/159673 Approved by: https://github.com/williamwen42	2025-08-08 02:50:21 +00:00
bobrenjc93	05c417715f	integrate kernacle into inductor (#160121 ) This adds integration into inductor in two parts 1) It kicks off the best config lookup at lowering time within mm.py 2) It awaits the future at scheduling time in select_algorithm.py Notably this does not do the following 1) Support for enumerating between mm, addmm and bmm 2) Support for enumerating between exhaustive/max 3) Enumerating different hardware SKUs eg. H100, A100, etc. those will come in the next diffs Differential Revision: [D79824921](https://our.internmc.facebook.com/intern/diff/D79824921/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160121 Approved by: https://github.com/izaitsevfb	2025-08-08 02:14:44 +00:00
Georgia Phillips	ba4ccf5d67	turn on executon frame clenaup by default (#160110 ) Summary: Turning execution frame cleanup back on since D78621408 is done Test Plan: See D78621408 Rollback Plan: Differential Revision: D79730674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160110 Approved by: https://github.com/jingsh	2025-08-08 02:13:48 +00:00
Wenyuan Chi	d68c323692	Log max_autotune exceptions (#159687 ) (#159688 ) Summary: Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures. Currently, exceptions are dumped to the console in the following format:: ``` [0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help. [0/0] Runtime error during autotuning: [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.. [0/0] Ignoring this choice. ``` The exception tracebacks: ``` # inner exception traceback: File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers launchers.append(result.make_launcher()) ^^^^^^^^^^^^^^^^^^^^^^ File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher self.kernel.load_kernel(device) File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel (self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel( # wrapped exception traceback: File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run result = self.fn(self.args, *self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout choice.precompile() File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile self.bmreq.precompile() File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile getattr(mod, self.kernel_name).precompile() File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile self._make_launchers() File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}") ``` With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event. The format: ``` { "exceptions": [ { "choice_type": "triton", "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0", "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.", "exception": "OutOfMemoryError", "required_memory": "262144", "hardware_limit": "232448" } ] } ``` Test Plan: buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt Rollback Plan: Differential Revision: D79420953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159688 Approved by: https://github.com/stashuk-olek	2025-08-08 01:30:08 +00:00
Edward Z. Yang	03b254e49f	Extend torch function support to ALL arguments, not just scalar type (but not insides of list) (#145089 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145089 Approved by: https://github.com/albanD, https://github.com/zou3519	2025-08-07 23:43:53 +00:00
PyTorch MergeBot	195b5c2e27	Revert "dynamo: Remove passing or deleted dynamo_expected_failures (#159691 )" This reverts commit 36f46d082a4954921cb8493223f000f2aab79ed7. Reverted https://github.com/pytorch/pytorch/pull/159691 on behalf of https://github.com/izaitsevfb due to breaking dynamo tests ([comment](https://github.com/pytorch/pytorch/pull/159691#issuecomment-3166067241))	2025-08-07 22:55:51 +00:00
Anshul Sinha	f077c2402e	[replicate][be] improved readability of test case description (#160128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160128 Approved by: https://github.com/mori360	2025-08-07 22:51:58 +00:00
Patrick C. Toulme	d46768db04	[MTIA] Allow users who know what they are doing to ignore all device mismatches in tracing and take a preferred device. (#159931 ) Summary: Device mismatches in tracing can most often be ignored. These are only logical mismatches not physical. Take any intermediate computation, and that computation will not actually materialize in a compiled binary execution. So a device mismatch in the middle of the program is not real. The runtime will never materialize those tensors on CPU device during the execution, as they are temporary allocations. If a user knows his tensors at graph input are all on the correct device, then he can ignore all tracing errors. Users who know what they are doing should have an escape hatch to ignore any device mismatch in tracing. Users can set ``` torch._functorch.config.fake_tensor_prefer_device_type = 'mtia' ``` to forcefully override any mismatch and prefer the non cpu device. This unblocks vLLM graph mode for MTIA. Test Plan: Added two unit tests. Rollback Plan: Differential Revision: D79698438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159931 Approved by: https://github.com/jansel	2025-08-07 22:37:15 +00:00
clr	36f46d082a	dynamo: Remove passing or deleted dynamo_expected_failures (#159691 ) partially generated with ``` for TESTCASE in $(ls \| cut -f1 -d'.' \| grep -v CPython \| uniq); do if grep "$TESTCASE" -m 1 .. -r; then echo; else sl rm "$TESTCASE"* ; fi; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159691 Approved by: https://github.com/xmfan	2025-08-07 21:41:50 +00:00
Sherlock Huang	8147370733	Fix qembeddingbag_byte_prepack_meta to use sym_sizes (#159985 ) Summary: In qembeddingbag_byte_prepack_meta, weight.sizes() would return a concrete int. we should use .sym_size() to return a SymInt instead. Test Plan: CI Rollback Plan: Reviewed By: kqfu, henryoier Differential Revision: D79744512 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159985 Approved by: https://github.com/jerryzh168, https://github.com/henryoier	2025-08-07 21:22:29 +00:00
Angela Yi	e619c6bb90	[export] Apply move_to_device_pass to all submodules (#159992 ) Previously we only applied this move_to_device_pass to the toplevel graph. However if we have HOO, this pass will not be applied on the HOO submodules. This PR modifies the pass to run on all submodules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159992 Approved by: https://github.com/yiming0416	2025-08-07 18:51:15 +00:00
Will Constable	3cf7b4024e	[DTensor] Support user-supplied Generator for random ops (#159933 ) If the user provides a generator kwarg to a random op (e.g. nn.init.uniform_(..., generator=my_generator)), we can still advance that generator's state in a SPMD-global way so that each local-tensor gets appropriate values and the generator advances to the same state as if it had operated on the full tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159933 Approved by: https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/wanchaol	2025-08-07 18:47:22 +00:00
Xu Han	21392c0e06	[inductor] disable flex decoding on Windows. (#160072 ) Discussed with @jianan-gu and @Valentine233 , disable flex decoding on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160072 Approved by: https://github.com/angelayi	2025-08-07 18:07:36 +00:00
Aleksei Nikiforov	ee1fb43450	Fix docker image creation (#158634 ) Since switching from wheel 0.34.2 to wheel 0.45.1 python symlinks are no longer correctly created. Migrate to packaging package for symlink creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/158634 Approved by: https://github.com/malfet	2025-08-07 17:41:47 +00:00
Aidyn-A	0bd3af4fb8	Further fix failing tests in test/inductor/test_analysis.py (#160070 ) This is a follow up on #159800 as other tests are still failing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160070 Approved by: https://github.com/aorenste	2025-08-07 17:32:58 +00:00
Ankita George	8399cf88ce	Use only safetensors APIs in HFStorageReader (#159681 ) Get rid of the logic to read the metadata from the header of the safetensors file manually and use the functions as part of safe_open() to get the metadata. This is much cleaner and allows us to not rely on our own custom methods to get metadata, but use safetensors provided APIs Differential Revision: [D79460272](https://our.internmc.facebook.com/intern/diff/D79460272/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159681 Approved by: https://github.com/saumishr ghstack dependencies: #159405, #159406	2025-08-07 17:23:03 +00:00
Ankita George	0b187b3114	DCP HF reader: use safe_open instead of reading the bytes (#159406 ) Reading the bytes and converting to tensors is much slower than using safe_open. For a 8B model across 8 ranks, took ~30s to load before this change and ~4s after. Differential Revision: [D78994259](https://our.internmc.facebook.com/intern/diff/D78994259/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159406 Approved by: https://github.com/saumishr ghstack dependencies: #159405	2025-08-07 17:23:03 +00:00
Ankita George	69cc606fda	HF component update to not use fsspec components (#159405 ) Update HF components to not inherit from fsspec components and instead use filesystem writer/reader. The reason is because there doesn't seem to be much of a need for fsspec, since users are using mounted storage. Using local storage will allow for performance improvements because we can take advantage of the safe_open API provided by HF safetensors (30s vs 4s for load of 8b model), which is signifcant performance wins over reading bytes and converting to tensors which is what we are doing now. Also, we can use the official methods provided by HF instead of relying on reading the metadata by bytes and loading it Differential Revision: [D78993550](https://our.internmc.facebook.com/intern/diff/D78993550/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159405 Approved by: https://github.com/saumishr	2025-08-07 17:22:54 +00:00
Markus Hoehnerbach	57f738b635	[inductor] move all cpu scalars using pinned memory for graph partition (#155360 ) (#158983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158983 Approved by: https://github.com/eellison ghstack dependencies: #158758	2025-08-07 17:07:26 +00:00
Markus Hoehnerbach	e167c7d0f3	[inductor] allocate non-blocking copy destinations in pinned memory (#155121 ) (#158758 ) Fixes #155121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758 Approved by: https://github.com/EikanWang, https://github.com/eellison	2025-08-07 17:07:26 +00:00
Shivam Raikundalia	b1a602762e	[Profiler] Update README (#159816 ) Summary: Updated README with code structure and explanation of core features within profiler Test Plan: N/A Rollback Plan: Differential Revision: D79604189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159816 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2025-08-07 16:44:41 +00:00
Han, Xu	e1cf0d496e	[inductor] unification for inductor debug. (#159998 ) Unification inductor debug build, follow @desertfire 's suggestion: https://github.com/pytorch/pytorch/pull/159938#pullrequestreview-3093803196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159998 Approved by: https://github.com/angelayi	2025-08-07 16:38:00 +00:00
Xu Han	06824f3c72	[inductor] fix test_dynamo_timed on Windows. (#159981 ) Fixed `test_dynamo_timed `: <img width="1030" height="389" alt="image" src="https://github.com/user-attachments/assets/02d84dd8-6a65-4f91-8d4c-48ba0a81fac1" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159981 Approved by: https://github.com/angelayi	2025-08-07 16:37:52 +00:00
PyTorch MergeBot	f3a4d742ec	Revert "Add DeviceAllocator as the base device allocator (#138222 )" This reverts commit f7a66da5f9f6b8b75119b1ee8ce9ddc23e15570e. Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))	2025-08-07 16:34:36 +00:00
PyTorch MergeBot	74da2604c9	Revert "Add unified memory APIs for torch.accelerator (#152932 )" This reverts commit 15f1173e5d72d6d45faba4cecd135e0160f06c6f. Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))	2025-08-07 16:34:36 +00:00
PyTorch MergeBot	c4e64467b5	Revert "Add UT for torch.accelerator memory-related API (#155200 )" This reverts commit 4604f0482c2b4a3001b62e5bc5085149a9bb053c. Reverted https://github.com/pytorch/pytorch/pull/155200 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))	2025-08-07 16:34:36 +00:00
Zain Rizvi	90b78ee50f	Move xla jobs to unstable workflow (#159272 ) Disables the job on PRs completely, so that we don't litter people's CI signals and use machines unnecessarily. If you want to run these xla tests, add the ciflow/unstable label to your PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/159272 Approved by: https://github.com/atalman, https://github.com/malfet	2025-08-07 16:22:52 +00:00
Xilun Wu	e248719ac0	[DTensor] support _StridedShard in view op (#159656 ) Summary Some thoughts on view-op and `_StridedShard` interaction: 1. `_StridedShard` has no impact on sharding (i.e. how tensor is partitioned) compared to `Shard`. It only changes how shards permute across the devices. 2. `view()` op on DTensor strictly forbids shard redistribution which means if `view()` may cause shard permutation across devices, it should be rejected. This is enforced in today's sharding prop for `view()`. 3. Since DTensor `view()` won't introduce any redistribution, it's certain that `placements` won't change except the inner `dim` attribute of `Shard` or `_StridedShard`. Therefore, to support `_StridedShard` in `view()` op, the only change required is to keep `_StridedShard` as `_StridedShard` in the output spec. Test `pytest test/distributed/tensor/test_view_ops.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159656 Approved by: https://github.com/wconstab	2025-08-07 15:59:25 +00:00
Aleksei Nikiforov	f60454cce8	S390X: update test dependencies (#158636 ) numba currently doesn't build from source due to https://github.com/numba/numba/pull/10073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158636 Approved by: https://github.com/malfet	2025-08-07 15:58:30 +00:00
rzou	8ab5868a21	Actually run the einops tests in CI (#159776 ) The test filter was wrong, it should not start with "test/". Test Plan: - wait for CI - Tested locally with `python test/run_test.py --einops --verbose` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159776 Approved by: https://github.com/atalman, https://github.com/StrongerXi	2025-08-07 15:23:06 +00:00
Wang, Chuanqi	d20c4c20e6	[CI] Update xpu ci use rolling driver for new features (#158340 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/158340 Approved by: https://github.com/seemethere Co-authored-by: xinan.lin <xinan.lin@intel.com>	2025-08-07 15:18:51 +00:00
Zhengxu Chen	83875cdb55	[nativert] Expose ModelRunner to public through pmpl type ModelRunnerHandle. (#159989 ) Summary: Today users outside of pytorch core cannot `#include <torch/nativert/ModelRunner.h>`. It turns out that we should place a header inside `torch/csrc/api/include/`. Placing every single nativert header here would pollute the namespace a lot and that's not what we want in general. Therefore here we just create a Handle type which hold a pointer to decouple the actual type from header definition. Test Plan: CI Rollback Plan: Differential Revision: D79751098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159989 Approved by: https://github.com/dolpm	2025-08-07 14:23:21 +00:00
PyTorch MergeBot	a53d14d5f8	Revert "unskipped mobilenet_v3 quantization and mobilenet_v2 quantization plus tests from https://github.com/pytorch/pytorch/issues/125438 (#157786 )" This reverts commit 3a2c3c8ed365eb4e4cf4620c25d70b2f70483762. Reverted https://github.com/pytorch/pytorch/pull/157786 on behalf of https://github.com/albanD due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/157786#issuecomment-3164126250))	2025-08-07 13:09:33 +00:00
Dev Sashidhar	8cb91e20bc	Renaming HAS_XPU to HAS_XPU_AND_TRITON (#159908 ) This PR follows up on the discussion in #159399 where @Akabbaj and @janeyx99 mentioned renaming HAS_XPU to HAS_XPU_AND_TRITON for consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159908 Approved by: https://github.com/janeyx99, https://github.com/guangyey	2025-08-07 11:24:44 +00:00
Huy Do	b0df7715e8	Remove benchmark dependencies from regular ROCm CI images (#160047 ) Instead, use a new `pytorch-linux-jammy-rocm-n-py3-benchmarks` image for Docker benchmark job. This addresses 2 issues: * The current ROCm failures in trunk w.r.t librosa version https://github.com/pytorch/pytorch/actions/runs/16789466749/job/47549950994 that TorchBench pulls in. * Reduce the size of the regular ROCm CI images by removing TorchBench models, which is needed only for benchmarking jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160047 Approved by: https://github.com/malfet, https://github.com/izaitsevfb	2025-08-07 09:26:58 +00:00
Avik Chaudhuri	422bd6808b	dataclass pytree fix (#159916 ) Differential Revision: D79687243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159916 Approved by: https://github.com/XuehaiPan, https://github.com/angelayi	2025-08-07 08:22:41 +00:00
thenumberouscode	24f43d0da7	[inductor] [cpu] fix the dype hardcoded to int64 in store_reduction (#157904 ) ## Fixes https://github.com/pytorch/pytorch/issues/157683 ## mini repro * Just copy the code from the issue to reproduce it. ```python import torch device = "cpu" # Input tensors v2_0 = torch.randn(16, 24, 59, dtype=torch.complex64, device=device) v3_0 = torch.randn(16, 24, 59, dtype=torch.complex64, device=device) def my_model(v2_0, v3_0): v6_0 = -v3_0 v4_0 = v2_0 * v3_0 v1_0 = v4_0.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).unsqueeze(-1) v0_0 = v2_0.to(torch.int32) v5_0 = v0_0.amax(dim=0) return v6_0, v4_0, v1_0, v0_0, v5_0 v6_0, v4_0, v1_0, v0_0, v5_0 = my_model(v2_0, v3_0) print("v6_0", v6_0.shape) print("v4_0", v4_0.shape) compiled_model = torch.compile(my_model, backend="inductor") v6_0, v4_0, v1_0, v0_0, v5_0 = compiled_model(v2_0, v3_0) print("v6_0", v6_0.shape) print("v4_0", v4_0.shape) print("v1_0", v1_0.shape) print("v0_0", v0_0.shape) print("v5_0", v5_0.shape) ``` error_stack ``` /home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注：candidate: ‘template<class dst_t, class src_t> std::enable_if_t<(! is_same_v<dst_t, src_t>), at::vec::CPU_CAPABILITY::Vectorized<T> > at::vec::CPU_CAPABILITY::convert(const at::vec::CPU_CAPABILITY::Vectorized<T>&)’ 41 \| convert(const Vectorized<src_t>& src) { \| ^~~~~~~ /home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注： template argument deduction/substitution failed: /tmp/torchinductor_admin/6k/c6kr65o43rlmp2cmkpn5ezewhe5bla4w72hpcrg5biyelrs4skyw.main.cpp:37:99: 错误：模板参数数目不对(不应是 4 个而应是 2 个) 37 \| auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec); ``` ## summary The C++ kernel generated by the Inductor had the wrong data type for the output variable; it should be int32_t instead of int64_t. This incorrect data type led to an incompatible data type conversion, which caused the g++ compilation to fail. The original code that caused the problem. ``` def my_model(v2_0, v3_0): v6_0 = -v3_0 v4_0 = v2_0 * v3_0 v1_0 = v4_0.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).unsqueeze(-1) v0_0 = v2_0.to(torch.int32) // The original code that caused the problem. v5_0 = v0_0.amax(dim=0) ``` ## proof procedure The c++ kernel generated by inductor: ```c++ #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const int32_t* in_ptr0, int32_t* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(1416L); x0+=static_cast<int64_t>(16L)) { { int32_t tmp_acc0_arr[16]; for (int i = 0; i < 16; i++) { tmp_acc0_arr[i] = std::numeric_limits<int32_t>::min(); } int32_t tmp_acc0 = std::numeric_limits<int32_t>::min(); at::vec::Vectorized<int32_t> tmp_acc0_vec = at::vec::Vectorized<int32_t>(std::numeric_limits<int32_t>::min()); for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(1L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1408L))) { auto tmp0 = at::vec::Vectorized<int32_t>::loadu(in_ptr0 + static_cast<int64_t>(x0 + 1416Lx1), static_cast<int64_t>(16)); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(1408L) && x0 < static_cast<int64_t>(1416L))) { for (int64_t x0_tail = static_cast<int64_t>(1408L);x0_tail < static_cast<int64_t>(1416L); x0_tail++) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail + 1416Lx1)]; tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)] = max_propagate_nan(tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)], tmp0); } } } } if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1408L))) { // impossible data type conversion which would caused the g++ compilation to fail. auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec); int32_t_tmp_acc0_vec.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(1408L) && x0 < static_cast<int64_t>(1416L))) { for (int64_t x0_tail = static_cast<int64_t>(1408L);x0_tail < static_cast<int64_t>(1416L); x0_tail++) { out_ptr0[static_cast<int64_t>(x0_tail)] = tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)]; } } } } } } ``` the compilers complains ```text /home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注：candidate: ‘template<class dst_t, class src_t> std::enable_if_t<(! is_same_v<dst_t, src_t>), at::vec::CPU_CAPABILITY::Vectorized<T> > at::vec::CPU_CAPABILITY::convert(const at::vec::CPU_CAPABILITY::Vectorized<T>&)’ 41 \| convert(const Vectorized<src_t>& src) { \| ^~~~~~~ /home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注： template argument deduction/substitution failed: /tmp/torchinductor_admin/6k/c6kr65o43rlmp2cmkpn5ezewhe5bla4w72hpcrg5biyelrs4skyw.main.cpp:37:99: 错误：模板参数数目不对(不应是 4 个而应是 2 个) 37 \| auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec); ``` so the following line have problem ```c++ // this line means that tmp_acc0_vec should be Vectorized<int64_t>, and it will convert it to Vectorized<int32_t>. auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec); ``` The issue is that tmp_acc0_vec is of type Vectorized<int32_t>, but the template parameters expect it to be Vectorized<int64_t>. and it will convert it to a Vectorized<int32_t>. this is conflict. the conversion should not be exist for tmp_acc0_vec is already Vectorized<int32_t>.The following line hardcodes the output variable type to int64, which causes unnecessary and incorrect type conversions. `d89f30ad45/torch/_inductor/codegen/cpp.py (L2985-L2993)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157904 Approved by: https://github.com/jgong5	2025-08-07 08:03:05 +00:00
Sherlock Huang	aa75e917bd	[Export Schema] Remove deviceAllocationMap field (#159653 ) Summary: This field is not used today, and it's not useful either. The device allocation is configured at model loading time, specified by user. It shouldn't be part of the model definition. Test Plan: CI Rollback Plan: Differential Revision: D79385513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159653 Approved by: https://github.com/zhxchen17	2025-08-07 07:31:42 +00:00
PyTorch UpdateBot	3f1636ebef	[audio hash update] update the pinned audio hash (#160046 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160046 Approved by: https://github.com/pytorchbot	2025-08-07 04:16:35 +00:00
IlyasMoutawwakil	c859ba7114	Make onnx export SDPA match aten behavior (#159973 ) This PR makes onnx sdpa export match the behavior of aten sdpa when boolean mask is used. @justinchuby ```python import onnxruntime as ort import torch class ScaledDotProductAttention(torch.nn.Module): def forward(self, query, key, value, attn_mask): return torch.nn.functional.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask) model = ScaledDotProductAttention() attn_mask = torch.ones(2, 4, 8, 8).bool() # boolean mask for attention attn_mask[0, 0, 0, :] = False # masking an entire row (padding token) query = key = value = torch.randn(2, 4, 8, 16) output = model(query, key, value, attn_mask) torch.onnx.export( model, (query, key, value, attn_mask), "scaled_dot_product_attention.onnx", input_names=["query", "key", "value", "attn_mask"], output_names=["output"], dynamo=false, # or True, ) ort_session = ort.InferenceSession("scaled_dot_product_attention.onnx") np_inputs = {"query": query.numpy(), "key": key.numpy(), "value": value.numpy(), "attn_mask": attn_mask.numpy()} onnx_outputs = ort_session.run(None, np_inputs)[0] torch.testing.assert_close(output, torch.tensor(onnx_outputs), equal_nan=True) ``` fails the assertion because the ort model outputs nans. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159973 Approved by: https://github.com/xadupre, https://github.com/titaiwangms	2025-08-07 04:06:07 +00:00
Simon Fan	d4c1a08c89	Relax unclaimed successes in dtype op tests when running under TEST_WITH_DYNAMO/TEST_WITH_INDUCTOR (#159976 ) This PR changes the behavior for compile wrapped op tests: - supported_but_unclaimed_forward - supported_but_unclaimed_backward These typically manifest when the op doesn't support inputs of certain dtypes. But under torch.compile, Dynamo/AOTAutograd will trace the graph with FakeTensors, which @ezyang and @eellison tell me need to run decomps before op dispatch. The decomp may map this test to a different op, one that does support the dtype. I suspect all of our failures here are due to decomps, and so I propose to just disable this check for compile. ~~TODO: re-enable all the failed tests.~~ jk there were no failed tests outside of compiled autograd due to this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159976 Approved by: https://github.com/ezyang	2025-08-07 02:38:45 +00:00
Nikita Shulga	81d72fb1f7	Move smoke binary builds to 3.12 (#159993 ) And limit them just to stable CUDA version (as there weren't any recent instances when only one of those jobs failed to build) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159993 Approved by: https://github.com/ngimel ghstack dependencies: #159986, #159990	2025-08-07 01:59:30 +00:00
Nikita Shulga	d0226719a9	[BE][EZ] Delete remains of split-build logic (#159990 ) Hopefully last piece of https://github.com/pytorch/pytorch/issues/138750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159990 Approved by: https://github.com/atalman ghstack dependencies: #159986	2025-08-07 01:59:30 +00:00
Edward Yang	38d65c6465	Add a USE_NIGHTLY option to setup.py (#159965 ) If you run python setup.py develop with USE_NIGHTLY, instead of actually building PyTorch we will just go ahead and download the corresponding nightly version you specified and dump its binaries. This is intended to obsolete tools/nightly.py. There's some UX polish for detecting what the latest nightly is if you pass in a blank string. I only tested on OS X. Coded with claude code. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159965 Approved by: https://github.com/malfet	2025-08-07 01:44:20 +00:00
Yu, Guangye	2ba2f598f3	[Dynamo] Add torch.xpu.stream to trace rules (#159844 ) # Motivation Previously, I thought using `with stream:` was sufficient. However, many older scripts still use `torch.xpu.stream` as the context manager. To maintain backward compatibility, I had to include `torch.xpu.stream` in the trace rules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159844 Approved by: https://github.com/jansel	2025-08-07 01:35:50 +00:00
Laith Sakka	1bb5e6c076	update expected results (#159867 ) refresh due to https://github.com/pytorch/pytorch/pull/159696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159867 Approved by: https://github.com/masnesral	2025-08-07 01:18:36 +00:00
Denghui Dong	8b0be7b65a	[Profiler] Fix unexpected C return events (#159574 ) The fix in https://github.com/pytorch/pytorch/pull/155446 addressed the "stack empty" issue that's easily reproducible on CPython 3.12.0-4. While this issue can also appear in other versions, it's not as easy to reproduce there. I recently found a new cause for this problem. `1df5d00145/Python/ceval.c (L5807-L5836)` In the CPython 3.10 implementation, PyTrace_C_CALL and PyTrace_C_RETURN/PyTrace_C_EXCEPTION are supposed to appear in pairs. However, when c_profilefunc is changed, unexpected PyTrace_C_RETURN/PyTrace_C_EXCEPTION events can occur. Here is the code to reproduce this problem. ``` import threading import time import torch from threading import Event, Lock lock = Lock() lock.acquire() event1 = Event() event2 = Event() event3 = Event() def run(): event1.set() event2.wait() lock.acquire() event3.set() threading.Thread(target=run).start() with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], with_stack=True): event1.wait() event2.set() time.sleep(1) with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], with_stack=True): lock.release() event3.wait() ``` <img width="1766" height="1250" alt="image" src="https://github.com/user-attachments/assets/6794eeca-7364-429e-91eb-62cdad116bd3" /> To fix this problem, we can record active_frames_ and remaining_start_frames_ for each thread, and when the PyTrace_C-RETURN/PyTrace_CEXT CEPTION event occurs, we can determine whether to record this event based on these two fields. In reality, even without this fix, the final data appears to be right since the match process can handle this case (it would just result in an exception log being printed). Do you think the fix is necessary? Pull Request resolved: https://github.com/pytorch/pytorch/pull/159574 Approved by: https://github.com/sraikund16	2025-08-07 01:17:55 +00:00
Xuehai Pan	5cedc5a0ff	[BE][PYFMT] migrate PYFMT for `torch/[p-z]*/` to `ruff format` (#144552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144552 Approved by: https://github.com/ezyang	2025-08-07 00:09:56 +00:00
William Wen	fd606a3a91	[dynamo] update pytorch-labs -> meta-pytorch in graph break URLs (#159975 ) Related PR: https://github.com/meta-pytorch/compile-graph-break-site/pull/30 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159975 Approved by: https://github.com/Lucaskabela	2025-08-06 23:57:31 +00:00
Animesh Jain	3daef4d128	[dynamo] Trace nn.Module __delattr__ (#159969 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159969 Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/StrongerXi	2025-08-06 23:43:19 +00:00
PyTorch MergeBot	cb4b29b754	Revert "[pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#159874 )" This reverts commit 9fd5b5f73589cf08dca60910368cc0f05c7906c8. Reverted https://github.com/pytorch/pytorch/pull/159874 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/159874#issuecomment-3161896978))	2025-08-06 23:21:29 +00:00
drisspg	a6bc296207	[FlexAttention] Update the guard semantics for divisibility (#159884 ) We don't add guards unless we know (and another guard has ensured this) that this is a safe optimization Pull Request resolved: https://github.com/pytorch/pytorch/pull/159884 Approved by: https://github.com/Chillee	2025-08-06 23:12:44 +00:00
Thomas Bohnstingl	64dc30c213	[HOP, map] Rework of map autograd to the new interface (#153343 ) This PR reworks the current autograd implementation of map to the new interface. @pytorchbot label "topic: not user facing" Pull Request resolved: https://github.com/pytorch/pytorch/pull/153343 Approved by: https://github.com/ydwu4	2025-08-06 23:02:42 +00:00
Nathan Brown	93da9952a7	gloo: fix building system gloo with CUDA/HIP (#146637 ) Fix incorrect linking of Gloo's libraries when building with system Gloo. Previously, either Gloo's native library or Gloo's CUDA library were linked. However, Gloo had changed such that all users of Gloo must link the native library, and can optionally link the CUDA or HIP library for Gloo + CUDA/HIP support. This had been updated when building/linking with vendored Gloo, but not when using system Gloo. Fixes: #146239 Reported-by: Adam J Stewart <ajstewart426@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146637 Approved by: https://github.com/malfet	2025-08-06 22:56:31 +00:00
christinaburge	3a2c3c8ed3	unskipped mobilenet_v3 quantization and mobilenet_v2 quantization plus tests from https://github.com/pytorch/pytorch/issues/125438 (#157786 ) These tests now pass on AArch64 in our downstream CI. `test_quantization.py::TestNumericSuiteEager::test_mobilenet_v2 <- test/quantization/eager/test_numeric_suite_eager.py PASSED [2.4434s] [ 35%]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157786 Approved by: https://github.com/jerryzh168, https://github.com/malfet	2025-08-06 22:41:07 +00:00
Jovian Anthony Jaison	9fd5b5f735	[pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#159874 ) Summary: Writing torch.compile worked logs to dedicated_log_rank{RANK} if we're running on mast. Test Plan: See: D79456310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159874 Approved by: https://github.com/c00w	2025-08-06 22:33:04 +00:00
Xiaochang Wu	2507ae63f2	Partitioner: Fix to align partition node order with original graph (#157892 ) Fixes #157891 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157892 Approved by: https://github.com/ezyang	2025-08-06 22:12:47 +00:00
Lucas Kabela	40c4d61f9a	[Dynamo][Better Engineering] Typing `torch/_dynamo/guards.py` (#159315 ) As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to `torch/_dynamo/guards.py` Running ``` mypy torch/_dynamo/guards.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Annotated \| Lines Total \| % lines covered \| Funcs Annotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 2030 \| 3945 \| 51.46% \| 70 \| 138 \| 50.72% \| \| This PR \| 4055 \| 4055 \| 100.00% \| 138 \| 138 \| 100.00% \| \| Delta \| +2025 \| +90 \| +48.54% \| +68 \| 0 \| +49.28% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/159315 Approved by: https://github.com/williamwen42, https://github.com/Skylion007	2025-08-06 21:52:14 +00:00
Tom Ritchford	a5725965ea	Remove unnecessary "# noqa: set_linter" comments (#159467 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159467 Approved by: https://github.com/eellison	2025-08-06 21:31:52 +00:00
Ruben Rodriguez Buchillon	289f62ce8a	[inductor][ez] fixup scaled_mm (#159948 ) Summary: This reverts the part of #159383 for scaled_mm where now, like before, we pass through the normal input_nodes (not the triton_input_nodes) to select_algorithm - #159383 refactored how kwargs are retrieved - it introduced this notion of KernelInputs that wrap input_nodes - scaled_mm uses unsqueezed input nodes for triton to retrieve params - the issue: it uses a squeezed (regular) bias for select_algorithm instead This fixes that by passing the original input nodes rather than the triton input nodes. Test Plan: ``` buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_rowwise_scaling_shape_1024,1024,512_has_bias_True_use_fast_accum_True_persistent_matmul_False (caffe2.test.inductor.test_fp8.TestFP8Lowering)' buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_rowwise_scaling_shape_1024,1024,512_has_bias_True_use_fast_accum_True_persistent_matmul_True (caffe2.test.inductor.test_fp8.TestFP8Lowering)' ``` This set of tests was failing, and is passing now Side note: these tests were failing I believe because the unsqueezed bias made the ATEN choice no longer eligible, and there is some minor numerical discrepancy between ATEN and Triton for this. I'm not sure the test should be written like that, as we're implicitly relying on ATEN being the choice here. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D79717654](https://our.internmc.facebook.com/intern/diff/D79717654) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159948 Approved by: https://github.com/izaitsevfb, https://github.com/eellison	2025-08-06 21:25:48 +00:00
Nikita Shulga	512b4730e3	[EZ] Remove useless `cross_compile_arm64` (#159986 ) As we don't have any Intel Mac runners in CI for last 2+ years Pull Request resolved: https://github.com/pytorch/pytorch/pull/159986 Approved by: https://github.com/atalman	2025-08-06 21:01:05 +00:00
Xia, Weiwen	d2368aa6f3	[CPUBLAS] add macros for brgemm APIs for versioning (#158629 ) Summary Add macros for brgemm, so that callers (e.g., Torchao's cpp kernels) know which APIs are available. It is useful when callers need to co-work with old versions of PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158629 Approved by: https://github.com/CaoE, https://github.com/Valentine233, https://github.com/ezyang	2025-08-06 20:54:05 +00:00
Mwiza Kunda	0afaeb7c4e	Improve `extract_test_fn` (#158637 ) The current implementation assumes test functions are resolved as test_module.TestClass.test_fn, however this would not work for modules nested in directories e.g. inductor.test_torchinductor.TestClass.test_fn Pull Request resolved: https://github.com/pytorch/pytorch/pull/158637 Approved by: https://github.com/jbschlosser	2025-08-06 20:45:21 +00:00
Alan Du	50580b5053	Add minimal nn.functional.log_softmax support for NestedTensor (#159662 ) This only works for the jagged layout and for the non-batch and non-jagged dimensions. I did this mostly by copy-pasting from the existing softmax implementation, but it seems fairly straightforward and I think it should work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159662 Approved by: https://github.com/jbschlosser	2025-08-06 20:34:02 +00:00
Frank Seide	b8ef60b6bc	Enable XNNPACK aarch64 builds (#159762 ) Summary: This fixes the build of TorchScript's XNNPACK dependency for our aarch64 device. Thanks to andrewjcg for proposing this fix. Rollback Plan: Reviewed By: andrewjcg Differential Revision: D79497613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159762 Approved by: https://github.com/frankseide, https://github.com/malfet Co-authored-by: Frank Seide <seide@meta.com>	2025-08-06 20:20:32 +00:00
Nikita Shulga	0de2a45a48	[BE] Merge 3 CUDA build jobs into one (#159890 ) Before this change there were build+test jobs: - s89 build+tests - sm75 build+distributed_test - sm_75 build+pr_time_benchmark test This change compiles all 3 builds into one (for 2 architectures) and skips testing sm86 as it never found any new regressions that were not found at the same time on sm89 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159890 Approved by: https://github.com/clee2000, https://github.com/seemethere	2025-08-06 20:09:55 +00:00
xinan.lin	12a54e4ac1	[Inductor UT][Fix XPU CI] Fix case failures introduced by community. (#159759 ) Fixes #159631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159759 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-08-06 20:02:20 +00:00
Nikita Shulga	d10e9e4781	[MPS] Remove all pre-MacOS14 logic (#159912 ) Delete older enums, checks for MacOS-13.3+ for int64 support, etc Fixes https://github.com/pytorch/pytorch/issues/159275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159912 Approved by: https://github.com/manuelcandales	2025-08-06 19:48:12 +00:00
Xu Han	c71950907d	[inductor] add _get_inductor_debug_symbol_cflags for debug symbol control. (#159938 ) We need to add inductor debug symbol support for crash case debug. When we turn on generate debug symbol. On Windows, it should create a [module_name].pdb file. It helps debug by WinDBG. On Linux, it should create some debug sections in binary file. I added UT for it also. It works well on Windows inductor debug. <img width="1648" height="833" alt="image" src="https://github.com/user-attachments/assets/5282a7de-cef3-4a38-9cd4-a0e63482c8b6" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159938 Approved by: https://github.com/jansel, https://github.com/angelayi	2025-08-06 19:31:45 +00:00
Divyansh Khanna	6fa3592dc6	Dataloader benchmark script (#159432 ) This script adds a simple dataloading benchmark tracking throughput and memory. The output looks like this ``` System Information: PyTorch version: 2.9.0a0+gitf87d117 PyTorch location: /home/divyanshkhanna/pytorch/torch/__init__.py Torchvision version: 0.24.0a0+f52c4f1 Torchvision location: /home/divyanshkhanna/pytorch/vision/torchvision/__init__.py CUDA available: True CUDA device: NVIDIA PG509-210 CPU count: 192 Physical CPU cores: 96 Total system memory: 1510.11 GB Loading dataset from imagenet/val (1 copies) Dataset size: 50000 --- Benchmarking DataLoader with worker_method=multiprocessing --- Memory before DataLoader creation: 500.59 MB Detailed memory information: USS (Unique Set Size): 499.00 MB PSS (Proportional Set Size): 500.74 MB RSS (Resident Set Size): 497.39 MB Memory after DataLoader creation: 1127.61 MB Memory increase: 627.02 MB Starting training loop with 1 epochs (max 100 batches per epoch) Epoch 1, Batch 10, Time: 0.2910s, Memory: 12044.50 MB Epoch 1, Batch 20, Time: 0.2909s, Memory: 12185.71 MB Epoch 1, Batch 30, Time: 0.2909s, Memory: 10654.93 MB Epoch 1, Batch 40, Time: 0.2909s, Memory: 12378.26 MB Epoch 1, Batch 50, Time: 0.2907s, Memory: 12402.28 MB Epoch 1, Batch 60, Time: 0.2909s, Memory: 10559.35 MB Epoch 1, Batch 70, Time: 0.2907s, Memory: 12644.69 MB Epoch 1, Batch 80, Time: 0.2909s, Memory: 12654.65 MB Epoch 1, Batch 90, Time: 0.2909s, Memory: 12727.20 MB Epoch 1, Batch 100, Time: 0.2908s, Memory: 12722.09 MB Results: Worker method: multiprocessing DataLoader init time: 0.1553 seconds Average batch time: 0.3408 seconds Samples per second: 375.53 Peak memory usage: 12738.76 MB Memory increase: 12238.17 MB ``` > TODO: This script right now is CPU-only friendly and GPU friendly. But it might be worth upgrading it to test against a canonical DistributedDataParallel setup on say a 1x8 node. Or maybe we can keep that as a separate script inside `benchmarks` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159432 Approved by: https://github.com/ramanishsingh	2025-08-06 19:05:19 +00:00
PyTorch MergeBot	ba37f589d4	Revert "[dynamo] Be consistent with storing func source for UserMethodVariable (#159696 )" This reverts commit ee62177c196d716fc3a2d641370bed8a673a45d3. Reverted https://github.com/pytorch/pytorch/pull/159696 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/159696#issuecomment-3161196192))	2025-08-06 18:41:05 +00:00
Bin Bao	44dd3684d2	[AOTI] Fix memory leak from all_reduce (#159818 ) Summary: This PR solves two issues: 1. When lowering the all_reduce op, Inductor expects to convert it to the in-place version, all_reduce_, but it was calling ir._AllReduceKernel.create_inplace instead of ir._AllReduce_Kernel.create_inplace. This triggers a tricky bug in AOIT because it generates cpp call to the functional version aoti_torch_cpu__c10d_functional_all_reduce, but later corresponding wait operation will still wait on the input to aoti_torch_cpu__c10d_functional_all_reduce instead of the output from aoti_torch_cpu__c10d_functional_all_reduce. This causes unwaited tensor leading to memory leak. 2. Since AOTI generates the inplace version aoti_torch_cpu__c10d_functional_all_reduce_ now. The return tensor from aoti_torch_cpu__c10d_functional_all_reduce_ doesn't get used. It will be released when the program exists, so it's not a memory leak but it will unnecessarily hold that tensor which causes high memory water mark. This PR generates tensor delete operation right after calling aoti_torch_cpu__c10d_functional_all_reduce_. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159818 Approved by: https://github.com/henryhu6, https://github.com/yushangdi	2025-08-06 18:11:14 +00:00
Georgia Phillips	c669b0ab87	Fix execution frame cleanup logic (#158717 ) Summary: This fixes a bug in the execution fram cleanup logic - previously, whenever we hit the time interval to clear out the frames, we were removing any cached execution frames beyond the configured minimum number (frameEntry.used was unused). Instead, we only want to clear frames that were NOT USED in during the last time interval. This diff refactors the executor to have the correct logic. Test Plan: ``` buck2 test 'mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:model_runner_test -- ModelRunnerTest.Basic_InterpreterCuda_Multithread_Cleanup --run-disabled --print-passing-details ``` Rollback Plan: Differential Revision: D78621408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158717 Approved by: https://github.com/dolpm	2025-08-06 18:04:24 +00:00
Luca Wehrstedt	d7a855d67d	[async-TP] Make scaled-mm + reduce-scatter preserve alignment of scales (#159957 ) After https://github.com/pytorch/pytorch/pull/157905 started using cuBLAS for row-wise scaling on CUDA 12.9+, this broke some downstream tests for fp8 which were testing "odd" shapes. After checking in with the cuBLAS team this turned out to be due to the scale tensors' starting addresses not being aligned to 16 bytes. PyTorch storages are always aligned at 256 bytes, hence this came from a "slicing" of the scale tensor being done inside async-TP when chunking a matmul in order to overlap it with reduce-scatter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159957 Approved by: https://github.com/vkuzo, https://github.com/danielvegamyhre	2025-08-06 17:42:26 +00:00
Meet Vadakkanchery	4c01991b38	[DCP][Prototype] Checkpoint replication via PGTransport (#157963 ) (#159801 ) Summary: ### PR Context Introduce simple replication logic via PGTransport. The goal is to showcase a working prototype of replication via PGTransport, in this impl we assume world_sizes are equal allowing us to create perfect bi-directional pairs for the purpose of choosing replica "partners". Test Plan: CI Rollback Plan: Differential Revision: D79590797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159801 Approved by: https://github.com/saumishr	2025-08-06 16:52:03 +00:00
Bin Bao	a4b07fe8f6	[AOTI] Add more default options to compile_standalone (#158560 ) Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560 Approved by: https://github.com/yushangdi	2025-08-06 15:59:27 +00:00
Mikayla Gawarecki	d87161c3c8	[Easy] Fix wrong propagation of fallback_ops_dict in gen_aoti_c_shim (#159904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159904 Approved by: https://github.com/janeyx99	2025-08-06 15:09:18 +00:00
Zhengxu Chen	79eca4677b	[precompile] Skip serializing unnecesssary objects for guards. (#158926 ) Summary: The following type of objects don't need to be serialized for precompile: 1. PyCapsule because we don't guard on C binding objects in meaningful ways. 2. Code object because we only id matching on these but id matches will always be dropped for precompile. 3. Nested function objects since we also ban CLOSURE_MATCH. Test Plan: buck run mode/opt test/dynamo:test_dynamo -- -k test_skipped_objects Rollback Plan: Differential Revision: D78816888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158926 Approved by: https://github.com/jamesjwu	2025-08-06 15:00:28 +00:00
PyTorch MergeBot	2855688a1d	Revert "Replace C array with std::array in formatSockAddr (#159812 )" This reverts commit e7feedf6a9bb346ad205796aa4084c8dcfb18072. Reverted https://github.com/pytorch/pytorch/pull/159812 on behalf of https://github.com/malfet due to Looks like it broke distribtued tests, see `2231c3ca3a/1` ([comment](https://github.com/pytorch/pytorch/pull/159812#issuecomment-3160513656))	2025-08-06 14:55:48 +00:00
Nikita Shulga	2231c3ca3a	[CI][CD] Fix `install_nvshem` function (#159907 ) When one builds CD docker, all CUDA dependencies must be installed into `/usr/local/cuda/` folder Test plan: Looks at the binary build logs, for example [here](https://github.com/pytorch/pytorch/actions/runs/16768141521/job/47477380147?pr=159907): ``` 2025-08-06T05:58:00.7347471Z -- NVSHMEM_HOME set to: '' 2025-08-06T05:58:00.7348378Z -- NVSHMEM wheel installed at: '' 2025-08-06T05:58:00.7392528Z -- NVSHMEM_HOST_LIB: '/usr/local/cuda/lib64/libnvshmem_host.so' 2025-08-06T05:58:00.7393251Z -- NVSHMEM_DEVICE_LIB: '/usr/local/cuda/lib64/libnvshmem_device.a' 2025-08-06T05:58:00.7393792Z -- NVSHMEM_INCLUDE_DIR: '/usr/local/cuda/include' 2025-08-06T05:58:00.7394252Z -- NVSHMEM found, building with NVSHMEM support ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159907 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-08-06 14:44:37 +00:00
can-gaa-hou	c03a734ba1	[OpenReg] Disable automatic inclusion of data files (#159845 ) # Background After I built torch_openreg, I noticed that the wheel package contained the stub.c file under the csrc directory, which was not used in the runtime. # Motivation This PR aims to remove the stub.c file and any unused file when running torch_openreg. Changes: - Setting include_package_data keyword to false in the setup function Pull Request resolved: https://github.com/pytorch/pytorch/pull/159845 Approved by: https://github.com/albanD	2025-08-06 10:35:13 +00:00
Benji Beck	98316e5896	[WOQ] Add CUDA kernel for _weight_int8pack_mm (#159325 ) Summary This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. https://github.com/pytorch/pytorch/issues/158849 Motivation A fused GPU kernel for aten._weight_int8pack_mm would: - Eliminate reliance on the .mul().sum() fallback in quantization.py - Improve performance for quantized inference on CUDA - Extend Inductor’s GPU quantization support across more workloads Implementation - Implement a Triton kernel for: ``` out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n] where: x: [B, K] float32 w: [N, K] int8 scale: [N] float32 out: [B, N] float32 ``` - Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py - Route it conditionally in quantization.py where GPU currently falls back to .mul().sum() - Add unit tests comparing results to the reference fallback path Test Plan: ``` buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda ``` Log: P1882799769 ``` buck2 test 'fbcode//mode/opt' caffe2/test:linalg ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/ Benchmark Results: ``` [Shape B=256, K=1024, N=512] CPU and CUDA outputs match Max abs diff: 2.59e-04, max rel diff: 0.75 CPU: 144.14 ms, CUDA: 303.67 µs Speedup: ×474.6 [Shape B=512, K=2048, N=1024] CPU and CUDA outputs match Max abs diff: 5.49e-04, max rel diff: 0.15 CPU: 1173.27 ms, CUDA: 2.40 ms Speedup: ×488.5 ``` Rollback Plan: Differential Revision: D79042656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159325 Approved by: https://github.com/danielvegamyhre, https://github.com/jerryzh168	2025-08-06 10:28:08 +00:00
angelayi	23cf241039	[aoti][mps] Initialize mps kernels first (#159753 ) In some cases we have mps kernels which are reused across higher-order-op subgraphs and the toplevel code. However, currently we initialize the variable for the mps kernel the first time we use it, which runs into an issue if we run into the mps kernel within a subgraph since the kernel will only be initialized within the subgraph scope. For instance: ``` if ... auto mps_lib_0_func = ... mps_lib_0_func->run() // since we already used mps_lib_0 once, we don't re-initialize it mps_lib_0_func->run() // error, mps_lib_0_func not initialized ``` So the solution we took here is to initialize all the kernels at the beginning: ``` const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() { static const auto func = mps_lib_0.getKernelFunction("generated_kernel"); return func; } AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() { static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get()); return handle; } ... if ... get_mps_lib_0()->run() get_mps_lib_0()->run() // success ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159753 Approved by: https://github.com/malfet ghstack dependencies: #159456, #159695	2025-08-06 07:54:29 +00:00
Will Constable	e7feedf6a9	Replace C array with std::array in formatSockAddr (#159812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159812 Approved by: https://github.com/Skylion007	2025-08-06 07:44:29 +00:00
Will Constable	dad2a05bec	[DTensor] Set up DTensorContinuousTestBase (#159885 ) Also migrate `test_common_rules.py` since it was a short file `python test/distributed/tensor/test_common_rules.py` Before: Ran 10 tests in 91.516s After: Ran 10 tests in 5.604s Pull Request resolved: https://github.com/pytorch/pytorch/pull/159885 Approved by: https://github.com/ezyang	2025-08-06 07:40:31 +00:00
Colin L Reliability Rice	0495cab545	Wire in pt2_triton_builds (#159897 ) Summary: This allows us to start seeing the failure rate on these models (and potentially alert on it). Test Plan: ``` FORCE_LOG_TRITON_BUILDS_TO_PROD=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 buck2 run @//mode/opt :compile 2>&1 \| tee out ``` P1889607054 Waiting for scuba table to generate, but manual logging show it should show up at https://fburl.com/scuba/pt2_triton_builds_inc_archive/7852kt8h soon. Rollback Plan: Reviewed By: masnesral Differential Revision: D79308333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159897 Approved by: https://github.com/masnesral	2025-08-06 07:39:51 +00:00
Mengtian Xu	abfe403981	[AIDIR] Internal util function to insert MLHub debugging insight for dynamic shape (#159391 ) Summary: This feature is Meta internal only Add a util function to put dynamic shape-related suggestion to MLHubDebugInsightService, which will then be surfaced to users in the MLHub . The rollout will be controlled by JK. Test Plan: MAST job aps-omnifmv3_dev_baseline_test-a34fdccf21 {F1980593060} * If you're not able to see the insight, please add yourself to this gk 'mlhub_debugging_insights_dev_visibility' * The URL link should route to a new Job Inspector page that will provide details and straight forward instructions of how to config the ds. The page is currently still in development so here we use the general PT2 compile JI page. * Test fails because of the export checks. I'll export after addressing all the comments from reviewers. Rollback Plan: Reviewed By: pianpwk Differential Revision: D78526522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159391 Approved by: https://github.com/jingsh	2025-08-06 07:39:39 +00:00
Jane Xu	1690c0c3a0	[Reland] Migrate ScalarType to headeronly (#159911 ) The non ghstack version of #159416, to make sure we don't get reverted again Pull Request resolved: https://github.com/pytorch/pytorch/pull/159911 Approved by: https://github.com/mikaylagawarecki	2025-08-06 07:36:37 +00:00
Aidyn-A	e9d27aa8fd	[CUDA 13] CMake/Dependencies: no need to call find_package(CUB) (#159854 ) CUB library is the part of CCCL of the CUDA Toolkit 13. If CUDA Found, CUB is found as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159854 Approved by: https://github.com/eqy	2025-08-06 06:03:58 +00:00
PyTorch MergeBot	2457e62c90	Revert "Set PYTHONHOME for inductor subprocesses using torch (#159382 )" This reverts commit fe8984a9f43bde10d1956abe7cb40710ed7ceed2. Reverted https://github.com/pytorch/pytorch/pull/159382 on behalf of https://github.com/malfet due to Broke MacOS testing see `d0fccbc99c/1` ([comment](https://github.com/pytorch/pytorch/pull/159382#issuecomment-3157455367))	2025-08-06 05:30:20 +00:00
Nikita Shulga	d0fccbc99c	[CI] Delete sm86 tests from pull (#159903 ) And delete sm89+cuda12.4 builds from periodic (as sm86+legacy driver should be enough) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159903 Approved by: https://github.com/huydhn	2025-08-06 05:16:55 +00:00
PyTorch UpdateBot	3461988a4b	[audio hash update] update the pinned audio hash (#159823 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159823 Approved by: https://github.com/pytorchbot	2025-08-06 05:02:35 +00:00
Will Constable	9764981116	Pass fw/bw compilers to aot_export_joint_with_descriptors (#159814 ) Allow overriding nop compilers with real ones when using this flow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159814 Approved by: https://github.com/fmassa	2025-08-06 04:50:56 +00:00
Michael Lazos	704594eb23	[Dynamo] make HOPs hashable (#159910 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159910 Approved by: https://github.com/yf225	2025-08-06 04:02:17 +00:00
eqy	bfc27cf468	[Distributed] Fix `@parametrize` on unordered iterable in distributed test (#159793 ) seems to fix https://github.com/pytorch/pytorch/issues/145807 sets aren't ordered so `@parametrize` can cause two processes to spawn with different settings originally debugged thanks to @k-artem, see https://github.com/pytorch/pytorch/issues/145807#issuecomment-2971009451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159793 Approved by: https://github.com/Skylion007, https://github.com/wconstab	2025-08-06 03:51:42 +00:00
bobrenjc93	311f74089a	remove print (#159917 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159917 Approved by: https://github.com/laithsakka	2025-08-06 03:48:23 +00:00
Tianhao Huang	14c7358c64	Enable fr_trace to read local traces from multiple hosts. (#159490 ) Summary: For training jobs particularly from GenAI, NCCL trace dumps are generated in the format of `<hostname>.pci3_rank_<rank>`. For multi-node training jobs, the hostname varies across traces. The current prefix matching logic can't handle this case. Test Plan: Create a local folder `dumps` and several empty files: `host0.pci3_rank_0`, `host0.pci3_rank_1`, `host1.pci3_rank_0`, `host1.pci3_rank_1` inside it. Then run ``` buck2 run fbcode//caffe2/fb/flight_recorder:fr_trace -- trace_dir dumps ``` Before this diff, fr_trace cannot locate any trace files, giving the following assertion error: ``` AssertionError: no files loaded from /home/tianhaoh/dumps with prefix pci3_rank_ ``` After this diff, fr_trace is able to locate the trace files, resulting in the exceptions like ``` dump = pickle.load(infile) ^^^^^^^^^^^^^^^^^^^ EOFError: Ran out of input ``` (since the trace files are fake and empty). Rollback Plan: Differential Revision: D79224727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159490 Approved by: https://github.com/fduwjj	2025-08-06 03:15:34 +00:00
Dave Lei	8ce81bcee1	[Torch Package] Make get names of OrderedImporters support fallback to importers (#155743 ) Summary: OrderedImporters is supposed to be an importer which tries out every single importer in self._importers. However the get_name API does not follow this behavior and only uses the get_name from the basic Importer class. This change is to update the OrderedImporters get_name API so that it tries the get_name API of every single importers. Differential Revision: D76463252 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155743 Approved by: https://github.com/jcwchen, https://github.com/jingsh	2025-08-06 02:26:10 +00:00
Yu, Guangye	4604f0482c	Add UT for torch.accelerator memory-related API (#155200 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155200 Approved by: https://github.com/albanD ghstack dependencies: #138222, #152932	2025-08-06 02:22:18 +00:00
Yu, Guangye	15f1173e5d	Add unified memory APIs for torch.accelerator (#152932 ) # Motivation The following API will be put under torch.accelerator - empty_cache - max_memory_allocated - max_memory_reserved - memory_allocated - memory_reserved - memory_stats - reset_accumulated_memory_stats - reset_peak_memory_stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932 Approved by: https://github.com/albanD ghstack dependencies: #138222	2025-08-06 02:22:18 +00:00
henrylhtsang	e16c48ae97	[BE] Fix type hint in AOTIRunnerUtil (#159577 ) Not sure why it was labelled as list in the first place. In test_aot_inductor.py, I scanned a few use cases and they are tuple as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159577 Approved by: https://github.com/Skylion007	2025-08-06 01:20:45 +00:00
Yu, Guangye	f7a66da5f9	Add DeviceAllocator as the base device allocator (#138222 ) # Motivation In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases. <div align="center"> <table> <tr> <td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td> </tr> <tr> <td> ```python torch.xxx.empty_cache ``` </td> <td> ```python torch.accelerator.empty_cache ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_peak_memory_stats ``` </td> <td> ```python torch.accelerator.reset_peak_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_accumulated_memory_stats ``` </td> <td> ```python torch.accelerator.reset_accumulated_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_stats ``` </td> <td> ```python torch.accelerator.memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_allocated ``` </td> <td> ```python torch.accelerator.memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_allocated ``` </td> <td> ```python torch.accelerator.max_memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_reserved ``` </td> <td> ```python torch.accelerator.memory_reserved ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_reserved ``` </td> <td> ```python torch.accelerator.max_memory_reserved ``` </td> </tr> </table> </div> # Solution This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222 Approved by: https://github.com/albanD, https://github.com/Camyll	2025-08-06 00:40:29 +00:00
Animesh Jain	3eb3da9b4b	[dynamo][guards] Skip ID_MATCH guard on self.__class__.__closure__ (#159888 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159888 Approved by: https://github.com/williamwen42	2025-08-06 00:36:43 +00:00
Jane Xu	3ddfd46bd2	Cut a version of TORCH_ERROR_CODE_CHECK in headeronly from AOTI (#159604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159604 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-08-06 00:29:56 +00:00
Zhengxu Chen	6a82da392e	[export] Fix generated schema for C++20/23 (#159871 ) Summary: Fixing the issue from https://github.com/pytorch/pytorch/issues/159838 Test Plan: buck run caffe2/:export_update_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/ Rollback Plan: Differential Revision: D79647167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159871 Approved by: https://github.com/malfet	2025-08-06 00:23:05 +00:00
Simon Fan	22bedc429f	Extract some HOP utils to be importable (#159705 ) Useful helper function for stage 1 export -> manual partitioner -> stage 2 compile users Pull Request resolved: https://github.com/pytorch/pytorch/pull/159705 Approved by: https://github.com/zou3519 ghstack dependencies: #159134	2025-08-05 23:59:47 +00:00
Huy Do	49abc0e3f8	[Take 2] Setup TorchBench in Docker (#159300 ) Fix and reland https://github.com/pytorch/pytorch/pull/158613, I keep `checkout_install_torchbench` in `.ci/pytorch/macos-test.sh` script because it's still used there, and there is no Docker. ### Testing MacOS perf nightly run https://github.com/pytorch/pytorch/actions/runs/16580798470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159300 Approved by: https://github.com/ZainRizvi	2025-08-05 23:47:42 +00:00
Xu Han	1052604acd	fix logging setup issue for Windows.. (#159887 ) When we setup logging config as guide: https://docs.pytorch.org/docs/stable/logging.html Such as: TORCH_LOGS="+schedule,+inductor,+output_code" On Linux, it shows as: ```cmd declare -x SSH_TTY="/dev/pts/0" declare -x TERM="xterm" declare -x TORCH_LOGS="+schedule,+inductor,+output_code" declare -x USER="xu" ``` On Windows, it shows as: ```cmd TORCHINDUCTOR_WINDOWS_TESTS=1 TORCH_LOGS="+schedule,+inductor,+output_code" UCRTVersion=10.0.22000.0 ``` For Linux, it shows quotes by default, And Windows is not shows quotes. Besides that, Windows would auto assemble quotes when env var processing. On Linux, we will get variable: "+schedule,+inductor,+output_code" On Windows, we will get variable: '"+schedule,+inductor,+output_code"' So, we need remove the outer quotes for Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159887 Approved by: https://github.com/angelayi	2025-08-05 23:44:38 +00:00
Alex Malyshev	fe8984a9f4	Set PYTHONHOME for inductor subprocesses using torch (#159382 ) Summary: This is needed for subprocesses that are trying to call back into torch functionality, i.e. anything that's also setting `PYTHONPATH`. There are more `sys.executable` subprocesses in torch/ but it seems like they're fine. Test Plan: Local inference runs. Reviewed By: aorenste Differential Revision: D79124705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159382 Approved by: https://github.com/aorenste	2025-08-05 23:32:48 +00:00
angelayi	74a754aae9	Add meta kernel for sdpa_math_for_mps (#159695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159695 Approved by: https://github.com/malfet ghstack dependencies: #159456	2025-08-05 22:27:06 +00:00
angelayi	b1ec088113	[mps] Turn on inductor dynamic shapes tests (#159456 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159456 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-08-05 22:27:06 +00:00
angelayi	fb35a9ea4a	[export] Improve error messages (#159881 ) Originally, if the PT2 errored when loading, we would try to load using the old loader to fit BC issues. However this hides the error messages for if an up-to-date PT2 is erroring when loading due to some other reason. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159881 Approved by: https://github.com/yushangdi	2025-08-05 22:26:48 +00:00
Sandeep Narendranath Karjala	8034b2a732	[inductor] Add TLParse artifact for logging runtime of collective and compute ops (#159730 ) Summary: - debug.py: Added log_runtime_estimates() function to dump runtime estimation data as structured tlparse artifacts in JSON format - test_structured_trace.py: Added comprehensive test coverage with testing compute and collective ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/159730 Approved by: https://github.com/yushangdi ghstack dependencies: #159190	2025-08-05 22:06:32 +00:00
anwang	64cc6f06b1	[Inductor] Revert minimal changes to avoid internal test failures (#159809 ) The diff/PR https://github.com/pytorch/pytorch/pull/159211 caused a bunch of test failures for graph compiler(T232684410). But I couldn't figure out a forward fix so far. So with this diff/PR, I'm proposing to revert the minimal changes to resolve the test failures. I'll continue the debugging, and re-land the reverted changes once we find out a forward fix. Differential Revision: [D79221721](https://our.internmc.facebook.com/intern/diff/D79221721/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159809 Approved by: https://github.com/blaine-rister, https://github.com/eellison	2025-08-05 22:05:26 +00:00
PyTorch MergeBot	410812763b	Revert "[Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777 )" This reverts commit bbc0df1094b5a4dcd2cce83f8402127b07913231. Reverted https://github.com/pytorch/pytorch/pull/159777 on behalf of https://github.com/izaitsevfb due to breaking inductor test on ROCm ([comment](https://github.com/pytorch/pytorch/pull/159777#issuecomment-3156770098))	2025-08-05 22:00:24 +00:00
Michael Lazos	bdb07a2bc5	[Cutlass] Allow offsets to be passed as arguments to kernel (#159761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159761 Approved by: https://github.com/henrylhtsang ghstack dependencies: #159760	2025-08-05 21:59:07 +00:00
Simon Fan	8085edc8f9	[autograd] torch._C._set_view_replay_enabled state leaking into other tests (#159840 ) This was causing view_fns to pop up in tests that ran after `TestAutograd.test_view_replay_enabled` where it isn't used as a context manager. It is unclear to me why we would want `_force_original_view_tracking` to mutate global state on __init__ rather than on __enter__, that could be an alternative fix. FIXES https://github.com/pytorch/pytorch/issues/156306 https://github.com/pytorch/pytorch/issues/156289 https://github.com/pytorch/pytorch/issues/156265 https://github.com/pytorch/pytorch/issues/156209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159840 Approved by: https://github.com/albanD	2025-08-05 21:57:49 +00:00
Nikita Shulga	882d50c5bf	[C10] Add `Scalar::isUnsigned()` method (#159877 ) That returns true if Scalar hold unsigned integral value With the implications of `Tag::HAS_u` semantic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159877 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2025-08-05 21:43:21 +00:00
Catherine Lee	b52a4d0821	[ez][CI] Remove some unused docker images (#159171 ) Removes unused docker images from the docker build workflow Then removes unused definitions in build.sh The only one I left is the vllm one because I'm pretty sure it's going to be used in the future I assume everything not mentioned is old and we forgot to remove them Pull Request resolved: https://github.com/pytorch/pytorch/pull/159171 Approved by: https://github.com/yangw-dev	2025-08-05 21:31:53 +00:00
Nikita Shulga	a45a840926	[CI] Disable check-labels and check_mergeability (#159900 ) See https://github.com/pytorch/pytorch/issues/159825 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159900 Approved by: https://github.com/clee2000	2025-08-05 21:16:12 +00:00
Nikita Shulga	9b953bb3fb	[BE] Update TensorPipe pin (#159834 ) No functional changes, just: - Update C++ standard to C++17 - Update `cmake` min version to 3.18 - Update `libuv` dependency to 1.51 (to move its cmake min version to 3.10) - Replace boost optional implementation with `std::optional` wrapper - Make it compilable with gcc-14.x plus by including `cstddef` in few headers - Avoid using deprecated enums for MacOS builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/159834 Approved by: https://github.com/Skylion007	2025-08-05 20:45:09 +00:00
eellison	eb25a95a6e	Fix inductor memory estimation when a single buf has multiple mutations. Add runtime verification of mem tracking (#159569 ) With fsdp, we sometimes have multiple, non-overlapping views of a single buffer which are all mutated. Previously we considered the original buffer as an allocation, and make the mutated buffer the deallocation. With multiple mutations of the same buffer, we need to consider the original buffer as deallocated only when all of its aliases die (and avoid double counting the input buffer size). See comment inline: ``` When an operation mutates a buffer in-place, the scheduler creates a new buffer name to track the "before" and "after" states, even though they share the same memory. The mutated buffer represents a rename with zero allocation and deallocation cost. During dependency tracking, we transfer dependencies from the mutated name back to the original buffer, ensuring the original memory is only freed when all aliases are done. This handles cases where a buffer has multiple non-overlapping aliases - rather than trying to assign free costs to individual aliases, we forward all alias dependencies to the original buffer. Consider: buf0 = op0() buf1 = mutation_op_(buf0) del buf0 ... op(buf1) del buf1 The only memory events are the creation prior to op0, and the deletion following buf1. ``` As @IvanKobzarev 's logs in https://github.com/pytorch/pytorch/pull/158361/files#diff-e173a1d52aff49959c9f6d17ecc09946d8a616fc5909df884e62a15e1ebd1d41R1776-R1807 show, it can a bit of a pain to pinpoint which part of our memory calculation is incorrect. This pr also adds a runtime verifier `config.test_configs.track_memory_lifecycle` which tracks buffer allocation and deallocation, and errors if their lifetime does not match our expectations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159569 Approved by: https://github.com/IvanKobzarev	2025-08-05 19:58:11 +00:00
eqy	9884d0351e	[CUDA] Decrease launch bounds of CTCLoss backward for blackwell (#159522 ) Otherwise we see `CUDA error: too many resources requested for launch` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159522 Approved by: https://github.com/janeyx99	2025-08-05 19:26:25 +00:00
Eli Uriegas	d7c83972d5	tools: Add mode to find python automatically (#159820 ) Add support for automatically finding Python interpreters in manylinux environments to our wheel building script. Scaffolding for sequential builds Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159820 Approved by: https://github.com/malfet	2025-08-05 19:19:22 +00:00
Nikita Shulga	e06b110f73	[Testing] Add MPS to NATIVE_DEVICES (#153835 ) This would allow me to enable more opinfo tests against MPS device eventually and supposed to be a very simple test, but actually required minor adjustments to lots of test files, namely: - Introduce `all_mps_types_and` that is very similar to `all_types_and`, but skips `float64` - Decorate lots of tests with `@dtypesIfMPS(*all_mps_types())` - Skip `test_from_dlpack_noncontinguous` as it currently crashes (need to be fixed) - Add lots of `expectedFailureIfMPS` - Delete all `@onlyNativeDeviceTypesAnd("mps")` <sarcasm> I love how well documented this variable are </sarcasm> Pull Request resolved: https://github.com/pytorch/pytorch/pull/153835 Approved by: https://github.com/Skylion007	2025-08-05 18:57:35 +00:00
Zheng, Zhaoqiong	0ba09a6d34	fix link for tutorial of inductor on windows (#159853 ) fix link issue from https://docs.pytorch.org/tutorials/prototype/inductor_windows.html to https://docs.pytorch.org/tutorials/unstable/inductor_windows.html due to structure change with pr https://github.com/pytorch/tutorials/pull/3489 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159853 Approved by: https://github.com/sekyondaMeta Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com> Co-authored-by: Zesheng Zong <zesheng.zong@outlook.com>	2025-08-05 18:37:47 +00:00
Luca Wehrstedt	aeb5321b63	Allow controlling PG backend and options via init_device_mesh (#159371 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159371 Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/wanchaol	2025-08-05 12:44:14 +00:00
Ruben Rodriguez Buchillon	625108ede2	[inductor] consolidate common GEMM triton param retrieval (#159383 ) \# Why - Make loop iteration simpler - Have a common spot where to make modifications that affect all the GEMM Triton templates, avoiding missed spots \# What - pull out commong logic of taking the BaseConfig objects and turning them into kwargs to feed into maybe_append_choice for Triton GEMM templates Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383 Approved by: https://github.com/jansel	2025-08-05 11:42:25 +00:00
Edward Z. Yang	09e5a93fcb	Improve graph output alias with subclass error message (#159619 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159619 Approved by: https://github.com/albanD	2025-08-05 06:47:31 +00:00
Yu, Guangye	908c5cc4c0	Generalize torch._C._set_allocator_settings to be generic (#156175 ) # Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175 Approved by: https://github.com/albanD ghstack dependencies: #159629, #150312, #156165	2025-08-05 04:08:42 +00:00
Yu, Guangye	c1145852a5	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #159629, #150312	2025-08-05 04:08:42 +00:00
Yu, Guangye	ae1a706444	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD ghstack dependencies: #159629	2025-08-05 04:08:04 +00:00
Yu, Guangye	56d19a5ced	Fix AllocatorConfig potential SIO issue (#159629 ) # Motivation As @ScottTodd identified in this [comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3141524874), using STL containers like `std::string` and `std::unordered_set` at static init time can cause static initialization order issues. This PR is based on and modified from his original PR: https://github.com/pytorch/pytorch/pull/159607. I’m stacking this PR here to help facilitate the landing and validation process. Co-authored-by: @ScottTodd Pull Request resolved: https://github.com/pytorch/pytorch/pull/159629 Approved by: https://github.com/ScottTodd, https://github.com/albanD	2025-08-05 04:07:51 +00:00
Lucas Kabela	b6c53383fe	[Dynamo][Better Engineering] Type annotation for `torch/_dynamo/output_graph.py` (#159602 ) As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to `torch/_dynamo/output_graph.py` Running ``` mypy torch/_dynamo/output_graph.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Annotated \| Lines Total \| % lines covered \| Funcs Annotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 2163 \| 4792 \| 45.14% \| 121 \| 268 \| 45.15% \| \| This PR \| 4818 \| 4818 \| 100.00% \| 268 \| 268 \| 100.00% \| \| Delta \| +2655 \| +26 \| +54.84% \| +147 \| 0 \| +54.85% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/159602 Approved by: https://github.com/Skylion007	2025-08-05 03:50:54 +00:00
Divyansh Khanna	4fd5fabee9	skip XPU for dataloader CPU only unit test (#159811 ) Fixes [#159802](https://github.com/pytorch/pytorch/issues/159802) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159811 Approved by: https://github.com/izaitsevfb	2025-08-05 03:44:01 +00:00
Nick Riasanovsky	bbc0df1094	[Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777 ) Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs. Test Plan: Relying on CI. Should be a NFC. Rollback Plan: Reviewed By: davidberard98 Differential Revision: D79378792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159777 Approved by: https://github.com/davidberard98	2025-08-05 03:29:13 +00:00
Mark Harfouche	33ec6e3e9a	Remove pin on libuv from instructions (#159504 ) This package doesn't exist at conda-forge and causes some confusion for users. see https://anaconda.org/conda-forge/libuv/files?version=1.39.0 libuv is quite stable, so the newer versions should be fine. we build with them anyway at conda-forge. see: https://github.com/conda-forge/libuv-feedstock/issues/80 Hopefully this can help future users. Fixes https://github.com/conda-forge/libuv-feedstock/issues/80 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159504 Approved by: https://github.com/seemethere	2025-08-05 03:18:42 +00:00
CaoE	efc4b460b3	Add cascade sum support for Inductor CPP backend (#156296 ) Fixes #154703 Add cascade summation support for Inductor CPP backend to improve precision for large size summation. Currently, Inductor CPP directly do reduction for sum. As shown in #154703, when the size of the sum is large and the number of parallel is small, direct reduction will cause an intolerable precision loss: ``` extern "C" void kernel(float* in_out_ptr0, const float* in_ptr0) { auto out_ptr0 = in_out_ptr0; { { float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(3000000000L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(3000000000L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); tmp_acc0_vec = tmp_acc0_vec + tmp0; } } } tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float, 1>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec); out_ptr0[static_cast<int64_t>(0L)] = static_cast<float>(tmp_acc0); } } { { { auto tmp0 = out_ptr0[static_cast<int64_t>(0L)]; auto tmp1 = static_cast<float>(3000000000.0); auto tmp2 = tmp0 / tmp1; in_out_ptr0[static_cast<int64_t>(0L)] = tmp2; } } } } ``` After adding cascade sum support: ``` extern "C" void kernel(float* in_out_ptr0, const float* in_ptr0) { auto out_ptr0 = in_out_ptr0; { { float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); at::vec::Vectorized<float> masked_tmp_acc0_vec = at::vec::Vectorized<float>(0); CascadeSumHelper<float, 65536> scalar_cascade_helper0(static_cast<int64_t>(3000000000L)); CascadeSumHelper<at::vec::Vectorized<float>, 65536> cascade_helper0(static_cast<int64_t>(187500000L)); CascadeSumHelper<at::vec::Vectorized<float>, 65536> masked_cascade_helper0(static_cast<int64_t>(0L)); for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(3000000000L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(3000000000L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); tmp_acc0_vec = cascade_sum_combine(tmp0, &cascade_helper0); } } } tmp_acc0 = cascade_sum_final(&scalar_cascade_helper0); tmp_acc0_vec = cascade_sum_final(&cascade_helper0); masked_tmp_acc0_vec = cascade_sum_final(&masked_cascade_helper0); tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float, 1>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec + masked_tmp_acc0_vec); out_ptr0[static_cast<int64_t>(0L)] = static_cast<float>(tmp_acc0); } } { { { auto tmp0 = out_ptr0[static_cast<int64_t>(0L)]; auto tmp1 = static_cast<float>(3000000000.0); auto tmp2 = tmp0 / tmp1; in_out_ptr0[static_cast<int64_t>(0L)] = tmp2; } } } } ``` This will inevitably reduce performance when cascade sum is turned on. For the case shown in #154703: performance reduced by ~3%. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156296 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-08-05 02:54:32 +00:00
Nikita Shulga	1ca8388442	[BE][MPS] Remove unused size12 variable (#159832 ) Fixes following compilation warning ``` /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:433:8: warning: unused variable 'size12' [-Wunused-variable] auto size12 = input_sizes[1] * input_sizes[2]; ^ 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159832 Approved by: https://github.com/dcci	2025-08-05 02:32:06 +00:00
dolpm	b69497351d	[nativert] force resize to zero. (#159683 ) Summary: this was quite a miserable bug. there are a few kernels that don't explicitly resize outputs to zero, which led to some weird UB. Rollback Plan: Differential Revision: D79476454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159683 Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier	2025-08-05 02:25:31 +00:00
Will Constable	482f069c41	[C10D] fix slow init due to repeated dns resolution failure (#159596 ) It can be be very slow to repeatedly hit DNS resolution failure, but its very helpful to have DNS names in logs by default. So we try to use DNS but if we hit a transient failure we just disable it for the remainder of the job, logging IP addresses instead. Fixes #159007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159596 Approved by: https://github.com/d4l3k	2025-08-05 02:15:26 +00:00
Benjamin Hottell	85d931f29e	Use uppercase OR when checking for system XNNPACK (#159527 ) This PR fixes `cmake/Dependencies.cmake` to work when compiling with `USE_SYSTEM_XNNPACK=ON` by changing a lowercase `or` to an uppercase `OR`. --- For a personal project, I was building pytorch with a customized build of XNNPACK. When trying to do so I encountered the following error: ``` CMake Error at cmake/Dependencies.cmake:566 (if): if given arguments: "NOT" "XNNPACK_LIBRARY" "or" "NOT" "microkernels-prod_LIBRARY" Unknown arguments specified Call Stack (most recent call first): CMakeLists.txt:868 (include) ``` Upon making the change in this PR (changing `or` to `OR`), the process continued as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159527 Approved by: https://github.com/janeyx99	2025-08-05 02:10:53 +00:00
Jack Taylor	8a2f53c523	Recursively sync fbgemm submodules before build (#159477 ) ROCm inductor benchmark builds failing fbgemm build stage https://ossci-raw-job-status.s3.amazonaws.com/log/46800456622 ``` 2025-07-27T08:00:32.3443858Z /var/lib/jenkins/pytorch/fbgemm/src/RowWiseSparseAdagradFused.cc:389:18: error: no matching function for call to ‘asmjit::v1_17::x86::Vec::Vec(uint32_t)’ 2025-07-27T08:00:32.3444080Z 389 \| x86::Xmm partial_sum_xmm(partial_sum_vreg.id()); ``` It looks like asmjit fails to build, this seems to be due to submodules of fbgemm not being updated after checking out to new commit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159477 Approved by: https://github.com/pruthvistony, https://github.com/eqy	2025-08-05 02:00:54 +00:00
Kurt Mohler	b59b61a099	Add `avg_pool3d` backward pass for MPS (#159089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159089 Approved by: https://github.com/malfet	2025-08-05 01:55:38 +00:00
Cui, Yifeng	57ab39f7e4	Update torch-xpu-ops commit pin (#159621 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1f7a57](`1f7a57f507`) includes: - Add Template Parameter to the function `gpu_kernel` for Controlling Broadcasting Vectorization - Add optional NaN checks to XCCL - Fix NllLossForwardReduce2DKernelFunctor accuracy - Extend the existing communication logging to include the reduction operation for collective calls - [Reland] Install xpu codegen header to torch/include Pull Request resolved: https://github.com/pytorch/pytorch/pull/159621 Approved by: https://github.com/EikanWang	2025-08-05 01:46:15 +00:00
Michael Lazos	182975e01a	[Dynamo] Enable torch function dispatch on HOPs (#159708 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159708 Approved by: https://github.com/zou3519, https://github.com/XilunWu ghstack dependencies: #159707	2025-08-05 01:43:22 +00:00
Michael Lazos	9f8cfe7476	[Dynamo] Fix arg ordering in tf modes (#159707 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159707 Approved by: https://github.com/zou3519	2025-08-05 01:43:21 +00:00
Oguz Ulgen	e273ff028a	Fix failing test (#159800 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159800 Approved by: https://github.com/aorenste	2025-08-05 00:28:51 +00:00
David Berard	5e0fc2c9a9	[AOTI] don't allow int32 indices if {non-inf, > int32_max} upper bound is provided (#159433 ) Motivation / Context: (what I _think_ is happening here) In "eager"/just-in-time PT2 usage, dynamo/inductor will guard on whether indices fit in int32 or not. So it's generally safe in Inductor code to rely on the example values for symbolic ints in order to determine whether indices fit in int32, because the indices will be guarded on anyway; and if the inputs ever increase to `>int32_max`, dynamo will cause a recompilation. But with AOTI, those int32 guards aren't respected; so if the example input is `< int32_max` but can be `> int32_max` during future execution, then the future execution might fail / IMA. Solution space Export allows users to specify which dimension are dynamic, and to provide ranges of valid sizes. One solution idea is to always respect the upper bound of the dynamic shape range when doing AOTI; if the index's range includes values `>int32_max`, then don't use the hint and assume that this index doesn't fit in int32. However, the problem with this is that many users may specify dynamism without specifying a range of values - the upper bound of the range will be set to the default of `inf`. Such use cases could potentially experience a perf regression if we implemented the idea above. To prevent any such regressions, this implementation will rely solely on the specified range only if the upper bound of the range isn't inf. In other words, we'll ignore the hints/example values for AOTI (and rely only on the specified range) only if the upper bound of the range isn't inf - if users explicitly specify a range that extends past int32, we can be fairly sure that they actually do need values `>int32_max`. If we continue to see correctness issues even with this implementation, we could consider more aggressively relying on the ranges. Differential Revision: [D79220301](https://our.internmc.facebook.com/intern/diff/D79220301) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159433 Approved by: https://github.com/jingsh, https://github.com/ColinPeppler	2025-08-05 00:17:09 +00:00
Shangdi Yu	bc4b04e058	DeviceCopy should have the same layout as input (#159615 ) Summary: Fix https://github.com/pytorch/pytorch/issues/159612 - Fix the meta implementation of `nan_to_num`, it should preserve the stride of the input - The DeviceCopy IR node should always preserve the input's layout, so we don't end up with a contiguous call during device copy Test Plan: ``` buck2 run @mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_d2h_copy ``` Rollback Plan: Differential Revision: D79411407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159615 Approved by: https://github.com/eellison	2025-08-04 23:56:58 +00:00
David Berard	6b414f56a4	Revert "[inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160 ) (#158462 )" (#159798 ) This reverts commit 305a03727672de42870f956ddf4ad9fa424443e1. Reason: causes device-side assertion failures when running with this repro (a minimized version of a failure seen in a real model) ``` import torch def ri(inp, repeats, output_size): return torch.repeat_interleave(inp, repeats, output_size=output_size) inp = torch.arange(0, 4, device="cuda").reshape(-1, 1) x = torch.tensor([1, 2, 3, 4], device="cuda") ri_c = torch.compile(ri) print(ri(inp, x, 10)) print(ri_c(inp, x, 10)) ``` which leads to errors like ``` /tmp/torchinductor_dberard/3h/c3hlb22fpptebupstsuhl6kexa6z3upgbnyxln7c24gfcr5747iu.py:30: unknown: block: [0,0,0], thread: [10,0,0] Assertion `index out of bounds: 0 <= tmp5 < 4` failed. ``` Differential Revision: [D79591561](https://our.internmc.facebook.com/intern/diff/D79591561) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159798 Approved by: https://github.com/danzimm	2025-08-04 23:39:20 +00:00
PyTorch MergeBot	fb8f32ef52	Revert "[mps] Turn on inductor dynamic shapes tests (#159456 )" This reverts commit 19f1f9960db7f29f2110a7f49f06a1a23c651ecf. Reverted https://github.com/pytorch/pytorch/pull/159456 on behalf of https://github.com/davidberard98 due to Sorry - this causes a merge conflict with https://github.com/pytorch/pytorch/pull/159798, which I'm trying to land with co-dev to resolve a sev ([comment](https://github.com/pytorch/pytorch/pull/159456#issuecomment-3152751821))	2025-08-04 23:11:05 +00:00
Michael Lazos	7ba996bbaa	[Cutlass] Fix wrapper code generation breakage (#159760 ) Fixes issues introduced by https://github.com/pytorch/pytorch/pull/159355 The issue got past OSS CI because the H100 tag wasn't added, not sure how to prevent these kinds of issues in the future, perhaps we should run H100 on Inductor PRs? Pull Request resolved: https://github.com/pytorch/pytorch/pull/159760 Approved by: https://github.com/angelayi	2025-08-04 23:03:03 +00:00
henrylhtsang	ddbdcdc710	[cutlass backend][test] Expand FP8 tests to FP16 (#159538 ) Differential Revision: [D79317343](https://our.internmc.facebook.com/intern/diff/D79317343/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159538 Approved by: https://github.com/mlazos	2025-08-04 23:01:55 +00:00
angelayi	19f1f9960d	[mps] Turn on inductor dynamic shapes tests (#159456 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159456 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-08-04 22:44:31 +00:00
yewentao256	fd6655a0f5	Feature: Implement support for `cudnn_batch_norm_out` kernel to replace the autogen approach. (#123020 ) Fixes #115611 Autogen kernel may cause redundant copy, so we develop the kernel to improve efficiency. Test Case: ```c++ #include <torch/torch.h> #include <iostream> #include <ATen/ATen.h> #include <ATen/cuda/CUDAContext.h> int main() { auto input = torch::rand({2, 3, 4, 4}, torch::device(torch::kCUDA)); auto weight = torch::randn({3}, torch::device(torch::kCUDA)); auto bias = torch::randn({3}, torch::device(torch::kCUDA)); auto running_mean = torch::zeros({3}, torch::device(torch::kCUDA)); auto running_var = torch::ones({3}, torch::device(torch::kCUDA)); bool training = true; double exponential_average_factor = 0.1; double epsilon = 1e-5; auto output = torch::empty_like(input); auto save_mean = torch::empty({3}, torch::device(torch::kCUDA)); auto save_var = torch::empty({3}, torch::device(torch::kCUDA)); auto reserve = torch::empty({0}, torch::device(torch::kCUDA)); // empty place-holder at::native::cudnn_batch_norm_out(input, weight, bias, running_mean, running_var, training, exponential_average_factor, epsilon, output, save_mean, save_var, reserve); auto outputs = at::native::cudnn_batch_norm(input, weight, bias, running_mean, running_var, training, exponential_average_factor, epsilon); bool is_close_output = torch::allclose(output, std::get<0>(outputs)); bool is_close_save_mean = torch::allclose(save_mean, std::get<1>(outputs)); bool is_close_save_var = torch::allclose(save_var, std::get<2>(outputs)); bool is_close_reserve = torch::allclose(reserve, std::get<3>(outputs)); std::cout << "Is output close: " << is_close_output << std::endl; std::cout << "Is save_mean close: " << is_close_save_mean << std::endl; std::cout << "Is save_var close: " << is_close_save_var << std::endl; std::cout << "Is reserve close: " << is_close_reserve << std::endl; return 0; } ``` Please CC @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/123020 Approved by: https://github.com/andrewor14, https://github.com/eqy, https://github.com/albanD	2025-08-04 22:40:33 +00:00
Lucas Kabela	a7f3bdf550	[Dynamo][Better Engineering] Type coverage for `torch/_dynamo/utils.py` (#159580 ) As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to `torch/_dynamo/utils.py` Running ``` mypy torch/_dynamo/utils.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Annotated \| Lines Total \| % lines covered \| Funcs Annotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 2163 \| 4792 \| 45.14% \| 121 \| 268 \| 45.15% \| \| This PR \| 4818 \| 4818 \| 100.00% \| 268 \| 268 \| 100.00% \| \| Delta \| +2655 \| +26 \| +54.84% \| +147 \| 0 \| +54.85% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/159580 Approved by: https://github.com/williamwen42	2025-08-04 21:51:53 +00:00
Xu Han	510e8b4ae0	[inductor] use writable temp file on windows (#159738 ) Use `WritableTempFile` on Windows, reference to: https://github.com/pytorch/pytorch/pull/159342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159738 Approved by: https://github.com/angelayi, https://github.com/Skylion007	2025-08-04 21:51:02 +00:00
PyTorch MergeBot	83ba3f1101	Revert "[inductor] allocate non-blocking copy destinations in pinned memory (#155121 ) (#158758 )" This reverts commit 6085bf7565fec0d2ed26e8590001f09c05adbbe4. Reverted https://github.com/pytorch/pytorch/pull/158758 on behalf of https://github.com/davidberard98 due to I need to revert #158462 (it causes device-side asserts), and this PR causes a merge conflict in the test file. Sorry about that! ([comment](https://github.com/pytorch/pytorch/pull/158758#issuecomment-3152490371))	2025-08-04 21:47:11 +00:00
PyTorch MergeBot	1fad16aacb	Revert "[inductor] move all cpu scalars using pinned memory for graph partition (#155360 ) (#158983 )" This reverts commit 444e2381d07a14cb501c00d11f9e63a3f1d2c86e. Reverted https://github.com/pytorch/pytorch/pull/158983 on behalf of https://github.com/davidberard98 due to I need to revert #158462 (it causes device-side asserts), and this PR causes a merge conflict in the test file. Sorry about that! ([comment](https://github.com/pytorch/pytorch/pull/158758#issuecomment-3152490371))	2025-08-04 21:47:11 +00:00
Markus Hoehnerbach	444e2381d0	[inductor] move all cpu scalars using pinned memory for graph partition (#155360 ) (#158983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158983 Approved by: https://github.com/eellison ghstack dependencies: #158758	2025-08-04 21:42:05 +00:00
Markus Hoehnerbach	6085bf7565	[inductor] allocate non-blocking copy destinations in pinned memory (#155121 ) (#158758 ) Fixes #155121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758 Approved by: https://github.com/EikanWang, https://github.com/eellison	2025-08-04 21:22:11 +00:00
Natalia Gimelshein	8201dbf4bc	check driver to be >=12.4 to use fabric handles (#159697 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/159697 Approved by: https://github.com/malfet	2025-08-04 21:05:39 +00:00
atalman	26d045bb60	Linux py 3.14 wheel builds (#157559 ) Related to https://github.com/pytorch/pytorch/issues/156856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157559 Approved by: https://github.com/malfet, https://github.com/albanD	2025-08-04 20:55:19 +00:00
PyTorch MergeBot	356ac3103a	Revert "Stop parsing command line arguments every time common_utils is imported. (#156703 )" This reverts commit 310f901a71e53688866b14bb2f2b4c8eef9979b3. Reverted https://github.com/pytorch/pytorch/pull/156703 on behalf of https://github.com/izaitsevfb due to breaking tests internally with `assert common_utils.SEED is not None` ([comment](https://github.com/pytorch/pytorch/pull/156703#issuecomment-3152337518))	2025-08-04 20:37:39 +00:00
Kurt Mohler	d4109a0f99	[MPS] Add max_unpool1d/2d/3d (#159789 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159789 Approved by: https://github.com/malfet	2025-08-04 20:00:59 +00:00
Arsh Zahed	7ea789ccfb	Revert #156868 : Bring back symint check for sharding propagation cache (#159671 ) Fixes #159601 Unfortunately #156868 introduced a couple regressions (see #159590 and #159601). This reverts the commit while I am working on a permanent fix. This means the `in_compiled_autograd_initial_trace` global flag will be removed and the `_are_we_tracing()` will instead be replaced with the symint preprocessing step during sharding prop post init. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159671 Approved by: https://github.com/xmfan	2025-08-04 19:58:48 +00:00
PyTorch MergeBot	7e8197e34d	Revert "Migrate ScalarType to headeronly (#159416 )" This reverts commit 1371a98b0e727f8a8916dd473b6dd0cff78c0449. Reverted https://github.com/pytorch/pytorch/pull/159416 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D79452481 ([comment](https://github.com/pytorch/pytorch/pull/159416#issuecomment-3152138508))	2025-08-04 19:55:09 +00:00
Benjamin Glass	50eac811a6	[typing] Constrain OrderedSet generic to be Hashable (#159684 ) Ran across this typing bug while creating an OrderedSet from a type I didn't realize wasn't hashable, which failed at runtime. With this constraint, typing would've failed pre-runtime. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159684 Approved by: https://github.com/Skylion007	2025-08-04 18:08:01 +00:00
ILCSFNO	4e0f179d0b	Update the signature and test of torch.hamming_window() (#152682 ) Fixes #146590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152682 Approved by: https://github.com/albanD	2025-08-04 17:50:42 +00:00
Tan Hoang	36e59d9b12	[c10d][nvshmem] fix missing override compilation error for nvshmem symmetric code (#159557 ) Summary: Fix error when compiling nvshmem code section `NVSHMEMSymmetricMemory.cu` with BUCK ``` fbcode/caffe2/torch/csrc/distributed/c10d/symm_mem/NVSHMEMSymmetricMemory.cu:154:20: error: 'get_buffer' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 154 \| virtual at::Tensor get_buffer(int \| ^ fbcode/caffe2/torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp:56:20: note: overridden virtual function is here 56 \| virtual at::Tensor get_buffer(int rank, c10::IntArrayRef sizes, c10::ScalarType dtype, int64_t storage_offset) = 0; ``` Test Plan: Build test + CI Rollback Plan: Differential Revision: D78813586 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159557 Approved by: https://github.com/kwen2501	2025-08-04 17:46:30 +00:00
angelayi	fc340d0ca3	[export] Allow comparing device w/o index with device w/ index (#159665 ) In the case where we have expected device "cuda" and given device "cuda:0" I think we should succeed? Pull Request resolved: https://github.com/pytorch/pytorch/pull/159665 Approved by: https://github.com/yushangdi	2025-08-04 17:00:07 +00:00
Animesh Jain	53e47af0f7	[dynamo][guards] Read the attr name from GetAttrGuardAccessor (#159754 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159754 Approved by: https://github.com/jansel ghstack dependencies: #159752	2025-08-04 16:51:27 +00:00
Animesh Jain	66ad881fc7	[dynamo][guards][refactor] Simplify type extraction from GuardManager (#159752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159752 Approved by: https://github.com/jansel	2025-08-04 16:51:27 +00:00
amdfaa	1d3eef27ac	[ROCm CI] Migrate to MI325 Capacity (#159649 ) Migrate mi300s to gfx942. Related to https://github.com/pytorch/pytorch/pull/159059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159649 Approved by: https://github.com/huydhn	2025-08-04 16:48:12 +00:00
Xu Han	dd95900cec	[AOTI] normalize_path_separator file path for Windows. (#159726 ) `normalize_path_separator` file path for Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159726 Approved by: https://github.com/angelayi, https://github.com/jansel	2025-08-04 15:57:19 +00:00
yuchengliu1	1cdd665526	fix test_verbose_logs_dynamic_shapes with MSVC (#159573 ) Operator `typeid` have different outputs in different compiler. There is a good example in [cppreference](https://www.en.cppreference.com/w/cpp/language/typeid.html). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159573 Approved by: https://github.com/angelayi, https://github.com/jansel	2025-08-04 15:56:53 +00:00
Tan Hoang	7cb2dcd2dd	[c10d][nvshmem] modify is_nvshmem_available runtime check to work with static-linked library (#159558 ) (#159561 ) Summary: Currently this function rely on the logic that we load `libnvshmem_device.a` statically and load `libnvshmem_host.so` at runtime. For loading `libnvshmem.a` (the combine 2 thing together) statically this will fail. Add a section to check if the symbol from host API exist at runtime to check if nvshmem is loaded statically Test Plan: CI + sample run Rollback Plan: Differential Revision: D79177525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159561 Approved by: https://github.com/kwen2501	2025-08-04 15:40:29 +00:00
Aleksei Nikiforov	e5a81aa7ba	Fix conversion of values in libtorch agnostic tests (#155115 ) Due to different byteorder, when copying data, it has to be put into last bytes to ensure that int32_t converted to int64_t keeps same value. Same has to be done when it's converted back. This change fixes test TestLibtorchAgnosticCPU::test_my_ones_like_cpu from cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py on s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155115 Approved by: https://github.com/huydhn	2025-08-04 13:40:22 +00:00
Andrey Talman	3e2aa4b0e3	Update pin to include Python 3.14 support (#159725 ) Update Triton Pin to top of rel/3.4 branch : https://github.com/triton-lang/triton/tree/rel/3.4 . This is the same as release/3.4.x branch but also includes Python 3.14 support This should unblock enablement of Python 3.14 support in this PR: https://github.com/pytorch/pytorch/pull/157559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159725 Approved by: https://github.com/davidberard98	2025-08-04 13:30:12 +00:00
Aleksei Nikiforov	6646461764	S390X: fix detection of magic number placeholder in inductor (#157784 ) This change fixes multiple tests in test/inductor/test_aot_inductor_arrayref.py such as test_cond_with_parameters_cpu_with_stack_allocation, test_issue_140766_cpu_with_stack_allocation, test_model_modified_weights_cpu_with_stack_allocation, test_nested_tensor_from_jagged_cpu_with_stack_allocation. Enable tests in test/inductor/test_aot_inductor_arrayref.py This change is split off from https://github.com/pytorch/pytorch/pull/150116 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157784 Approved by: https://github.com/huydhn	2025-08-04 12:42:31 +00:00
PyTorch UpdateBot	f74da2a136	[xla hash update] update the pinned xla hash (#159758 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159758 Approved by: https://github.com/pytorchbot	2025-08-04 11:21:45 +00:00
eqy	d35b27dde5	[CUDA] Add some more missing `@serialTest` decorators (#159672 ) Seems to fix #159663 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159672 Approved by: https://github.com/Skylion007	2025-08-04 07:44:35 +00:00
anwang	a9dc1566d4	[MTIA Aten Backend] Migrate arange.start_out (#159540 ) Differential Revision: [D79317519](https://our.internmc.facebook.com/intern/diff/D79317519/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159540 Approved by: https://github.com/malfet, https://github.com/nautsimon	2025-08-04 07:38:05 +00:00
Jiang, Yanbing	33a1996714	Fix perf downgrad by reverting template use in use_mkldnn_matmul (#159024 ) This PR is to fix the performance downgrad by reverting template use in `use_mkldnn_matmul` in #157520 . Fix https://github.com/pytorch/pytorch/issues/159031 and https://github.com/pytorch/pytorch/issues/159551. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159024 Approved by: https://github.com/mingfeima	2025-08-04 05:49:46 +00:00
Animesh Jain	ee62177c19	[dynamo] Be consistent with storing func source for UserMethodVariable (#159696 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159696 Approved by: https://github.com/jansel ghstack dependencies: #159534	2025-08-04 05:12:44 +00:00
Animesh Jain	64cbaa876c	[dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159534 Approved by: https://github.com/jansel	2025-08-04 05:12:44 +00:00
Animesh Jain	4516c59f5f	[dynamo][source] Add special source for __code__ and __closure__ (#159722 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159722 Approved by: https://github.com/jansel	2025-08-04 05:02:05 +00:00
PyTorch UpdateBot	8bc843a9ec	[vllm hash update] update the pinned vllm hash (#159610 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159610 Approved by: https://github.com/pytorchbot	2025-08-04 04:06:09 +00:00
Jason Ansel	e39a62c70d	Fix warnings in triton_helpers.py (#159719 ) ``` /home/jansel/pytorch/torch/_inductor/runtime/triton_helpers.py:152: UserWarning: Logical operators 'and' and 'or' are deprecated for non-scalar tensors; please use '&' or '\|' instead equal \|= a_isnan and b_isnan ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159719 Approved by: https://github.com/Skylion007	2025-08-04 03:21:09 +00:00
Laith Sakka	978e3a9142	refresh expected results (#159727 ) Just regular update due to recent <10% changes CI is stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159727 Approved by: https://github.com/anijain2305	2025-08-03 22:47:50 +00:00
Nikita Shulga	e2a5c42e7e	[BE][MPS] Build metal kernels of MacOS-14+ (#159733 ) Which makes `#if __METAL_VERSION__ >= 310` guards for `bfloat` use support unnecessary. Rename `kernels_bfloat.metallib` into `kernels_basic` and remove custom build/selection logic. Part of https://github.com/pytorch/pytorch/issues/159275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159733 Approved by: https://github.com/dcci ghstack dependencies: #159731, #159732	2025-08-03 20:53:58 +00:00
Nikita Shulga	5116c49b52	[BE] Remove macos-13 guard from bench_mps_ops (#159732 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159732 Approved by: https://github.com/dcci ghstack dependencies: #159731	2025-08-03 20:53:58 +00:00
Nikita Shulga	fecdebe385	[CI][MPS] Fix compile benchmark correctness (#159731 ) By passing `fullgraph=True` attribute and increasing cache size limit to 2**16 Otherwise, compiler might decide not to fall back to eager to avoid recompilations Pull Request resolved: https://github.com/pytorch/pytorch/pull/159731 Approved by: https://github.com/dcci	2025-08-03 20:53:50 +00:00
Nikita Shulga	e136a9175b	[BE] Fix dev warning in `Dependencies.cmake` (#159702 ) Namely ``` CMake Warning (dev) in cmake/Dependencies.cmake: A logical block opening on the line /Users/nshulga/git/pytorch/pytorch/cmake/Dependencies.cmake:261 (if) closes on the line /Users/nshulga/git/pytorch/pytorch/cmake/Dependencies.cmake:263 (endif) with mis-matching arguments. ``` Introduced by https://github.com/pytorch/pytorch/pull/143846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159702 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-08-03 18:45:07 +00:00
Francisco Massa	9a680e14b7	[bucketing] Reduce CPU overhead for reduce_scatter_merge_fn_to_trace (#159723 ) The previous implementation was creating `n_gpu * n_tensors` intermediate tensors, which was adding a lot of CPU overhead, specially given that inductor was generating a number of individual tensor copy kernels for `torch.cat` . This PR changes the implementation so that only `n_tensors` are created, making the CPU overhead proportional to the number of tensors being bucketed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159723 Approved by: https://github.com/IvanKobzarev	2025-08-03 09:16:55 +00:00
PyTorch MergeBot	805a102beb	Revert "[dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534 )" This reverts commit 1616777cd2a3170ff76afa3e7860b0969420c445. Reverted https://github.com/pytorch/pytorch/pull/159534 on behalf of https://github.com/malfet due to Broke some inductor test and lint among other things, see `9c18901bfd/1` ([comment](https://github.com/pytorch/pytorch/pull/159534#issuecomment-3146983186))	2025-08-03 04:58:32 +00:00
PyTorch MergeBot	6e8d705a22	Revert "[dynamo] Be consistent with storing func source for UserMethodVariable (#159696 )" This reverts commit be71000ff5292293d1976f313218e2df4d5046d3. Reverted https://github.com/pytorch/pytorch/pull/159696 on behalf of https://github.com/malfet due to Broke some inductor test and lint among other things, see `9c18901bfd/1` ([comment](https://github.com/pytorch/pytorch/pull/159534#issuecomment-3146983186))	2025-08-03 04:58:32 +00:00
anwang	9c18901bfd	[MTIA Aten Backend] Migrate all.out (#159539 ) Differential Revision: [D79317033](https://our.internmc.facebook.com/intern/diff/D79317033/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159539 Approved by: https://github.com/malfet ghstack dependencies: #159098	2025-08-03 02:08:35 +00:00
Oguz Ulgen	a29ed5e1ac	Add torch compile force disable caches alias (#158072 ) Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072 Approved by: https://github.com/ezyang	2025-08-02 23:23:17 +00:00
Francisco Massa	d2792f51b2	[bucketing] Use max of input/output size for bucketing (#159717 ) The output of a reduce_scatter is n_gpu times smaller than its input, while the output of an all_gather is n_gpu times larger than its input. This means that in the current heuristic for bucketing reduce_scatter, we would need to use a bucket size which is n_gpu times larger than the bucket for all_gather, making it gpu-dependent and less intuitive. This PRs propose to use instead the max between the input and output sizes, so that one can use the same bucket_size value for both passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/159717 Approved by: https://github.com/wconstab	2025-08-02 22:42:22 +00:00
Animesh Jain	be71000ff5	[dynamo] Be consistent with storing func source for UserMethodVariable (#159696 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159696 Approved by: https://github.com/jansel ghstack dependencies: #159186, #159534	2025-08-02 21:40:38 +00:00
Aaron Orenstein	3f86076775	gc before warming up benchmarking (#159670 ) #158649 turned off automatic GCs during cudagraph recording. This is causing a small uptick in some internal benchmark numbers because of memory the benchmark is leaving around before the benchmark starts - so GC before warming up the model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159670 Approved by: https://github.com/oulgen	2025-08-02 19:37:24 +00:00
Animesh Jain	1616777cd2	[dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159534 Approved by: https://github.com/jansel ghstack dependencies: #159186	2025-08-02 18:04:35 +00:00
rajeshvshiyal	38895c0ac2	Update RuntimeError message in is_nonzero(input) method from bool to Boolean (#159712 ) RuntimeError message updated in is_nonzero(input) method from bool to Boolean. Case 1: t = torch.tensor([]) torch.is_nonzero(t) Case 2: t = torch.tensor([1,2]) torch.is_nonzero(t) Existing Error message in documentation: for case 1: RuntimeError: bool value of Tensor with no values is ambiguous for case 2: RuntimeError: bool value of Tensor with more than one value is ambiguous Proposed Error message in documentation: for case 1: RuntimeError: Boolean value of Tensor with no values is ambiguous for case 2: RuntimeError: Boolean value of Tensor with more than one value is ambiguous Fixes #159710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159712 Approved by: https://github.com/malfet	2025-08-02 17:23:45 +00:00
Anthony Barbier	310f901a71	Stop parsing command line arguments every time common_utils is imported. (#156703 ) Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs: https://github.com/pytorch/pytorch/pull/154612 https://github.com/pytorch/pytorch/pull/154628 https://github.com/pytorch/pytorch/pull/154715 https://github.com/pytorch/pytorch/pull/154716 https://github.com/pytorch/pytorch/pull/154725 https://github.com/pytorch/pytorch/pull/154728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703 Approved by: https://github.com/clee2000	2025-08-02 16:38:54 +00:00
Nichols A. Romero	e11b1cd97e	[ROCm] fix nightly wheel due to rocBLAS environment variable (#159570 ) Fixes #159070 The TunableOp failure is due to missing rocBLAS files in our manywheels packaging. This bug has been present since June 7-8 time frame. It was caused by a typo in the rocBLAS environment variable that stores the list of files. It was introduced in this PR: https://github.com/pytorch/pytorch/pull/155388 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159570 Approved by: https://github.com/malfet	2025-08-02 06:54:43 +00:00
Wenyuan Chi	b599d91738	Log autotune choices and benchmark result to scuba/chrome trace (#159496 ) Summary: Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. Example Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" } ``` Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS =1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Differential Revision: D79235037 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159496 Approved by: https://github.com/masnesral	2025-08-02 05:34:17 +00:00
Xiao, Wang	fd6a6658c3	Enable _int_mm on Intel GPU (#157769 ) # Moativation This PR is used to enable _int_mm on Intel GPU. And _int_mm is used by int8 quantization on torchao. # Model Test Result: We run meta-llama/Llama-3.1-8B-Instruct on Intel GPU and A100 using torchao int8-dynamic-quantization. The model configs as below: Precision : torch.bfloat16 quantization configuration : Int8DynamicActivationInt8WeightConfig dataset : wikitext Result: The perplexity values for Intel GPU and A100 are 9.582953453063965 and 9.57755184173584, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157769 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2025-08-02 05:16:01 +00:00
PyTorch UpdateBot	04973496a8	[audio hash update] update the pinned audio hash (#159611 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159611 Approved by: https://github.com/pytorchbot	2025-08-02 05:15:47 +00:00
Sam Larsen	1548b011ea	Fix rand_like decomposition to preserve strides (#159294 ) Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898. Test Plan: New unit test (fails before this PR; but fixed after) Differential Revision: [D79472604](https://our.internmc.facebook.com/intern/diff/D79472604) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294 Approved by: https://github.com/eellison	2025-08-02 03:54:41 +00:00
angelayi	e57a92734d	[export] Fix nn_module_stack of assert_tensor_metadata nodes (#159625 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/159625 Approved by: https://github.com/yushangdi	2025-08-02 02:52:42 +00:00
Dylan Maloy	79ff3b320b	Back out "[ez] get rid of unused var" (#159677 ) Summary: turns out i added this to reduce the frequency we'd call try_update_max_size_at_index when a new maximum is found before the replan is called. oops. Test Plan: backout Rollback Plan: Differential Revision: D79474114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159677 Approved by: https://github.com/georgiaphillips	2025-08-02 01:50:16 +00:00
nandesuka	426f249f20	Fix launch grid calculation (#159497 ) Summary: The launch grid calculation code is using a python trick to achieve CeilDiv() through negative integer division with FloorDiv(). This is language dependent behaviour that doesn't apply to all languages. In the FXIR backend we negate this behaviour and replace the experssion with CeilDiv() operation so the computation is correct regardless of language used. Not directly directly changing the orginal computation as it leads to a performance degredation. Test Plan: CI Rollback Plan: Differential Revision: D79275534 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159497 Approved by: https://github.com/blaine-rister	2025-08-02 01:12:58 +00:00
Edward Z. Yang	d33a484763	Use boxed_nop_preserve_node_meta for aot_export_joint_with_descriptors (#159545 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159545 Approved by: https://github.com/xmfan, https://github.com/wconstab ghstack dependencies: #159336, #159337	2025-08-02 00:33:41 +00:00
Natalia Gimelshein	a81ffbc5f5	improve shape checks for grouped_mm (#159666 ) Check that contraction dimension matches between tensors if it's known, and do device-side checks for correct offsets Pull Request resolved: https://github.com/pytorch/pytorch/pull/159666 Approved by: https://github.com/danielvegamyhre, https://github.com/eqy	2025-08-02 00:12:25 +00:00
Huy Do	465fe4d9f7	Enable sample nightly PT2 benchmark on B200 (#158011 ) Per the discussion with @nWEIdia, this resumes the work on https://github.com/pytorch/pytorch/pull/157870 to enable PT2 benchmark on B200 ### Testing https://github.com/pytorch/pytorch/actions/runs/16615101382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158011 Approved by: https://github.com/nWEIdia, https://github.com/atalman	2025-08-01 23:47:44 +00:00
Natalia Gimelshein	9477af1063	fix compilation on cuda < 12.3 (#159657 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/159657 Approved by: https://github.com/kwen2501	2025-08-01 23:40:55 +00:00
Lucas Kabela	dcc36e38bb	[Graph Breaks] Remove unsupported Additional Info field (#159658 ) Race condition when landing PR#158800 caused us to add this field when it is deprecated, so remove it Pull Request resolved: https://github.com/pytorch/pytorch/pull/159658 Approved by: https://github.com/williamwen42	2025-08-01 23:25:50 +00:00
Zain Rizvi	efd78584a8	[EZ] Add linux-aarch64.yml workflow to the viable/strict blocking set (#159668 ) Since it's required to be run on every PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/159668 Approved by: https://github.com/malfet	2025-08-01 23:19:08 +00:00
Oguz Ulgen	135762ea20	Unpin helion (#159579 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159579 Approved by: https://github.com/jansel	2025-08-01 23:08:06 +00:00
Sherlock Huang	e2ee9cfaa2	[NativeRT] Turn on enableStaticCPUKernels by default (#159422 ) Summary: As title. Test Plan: Need to manual test on production models. Rollback Plan: Differential Revision: D78747742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159422 Approved by: https://github.com/dolpm	2025-08-01 22:27:07 +00:00
Andy Lugo	06d28de17a	Update CK Kernel generation and update ck submodule (#157964 ) changes required to reduce the number of ck kernels generated. This change depends on https://github.com/ROCm/composable_kernel/pull/2480 to be merged first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157964 Approved by: https://github.com/842974287	2025-08-01 22:24:27 +00:00
anwang	df9720b8b5	[MTIA Aten Backend] Migrate all foreach ops (#159098 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate all foreach operators to in-tree, including: - _foreach_abs - _foreach_abs_ - _foreach_add.List - _foreach_add_.List - _foreach_add_.Scalar - _foreach_add_.Tensor - _foreach_addcmul.Scalar - _foreach_addcmul_.Scalar - _foreach_copy - _foreach_copy_ - _foreach_mul.List - _foreach_mul_.List - _foreach_mul_.Scalar - _foreach_mul.Tensor - _foreach_mul_.Tensor - _foreach_norm.Scalar - _foreach_sqrt_ Differential Revision: [D78913847](https://our.internmc.facebook.com/intern/diff/D78913847/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159098 Approved by: https://github.com/malfet	2025-08-01 22:10:12 +00:00
Sandeep Narendranath Karjala	85e74d5ace	[inductor] Add logging for distributed collective ops for multi‑rank diagnostics (#159190 ) This change introduces structured logging of the collective communication schedule, enabling downstream tools (e.g. TLParse) to ingest and analyze per‑rank collective‐order information for multi‑rank jobs. - Iterates over scheduler.nodes, filters for _CollectiveKernel nodes - Extracts each op’s python_kernel_name - Emits a structured JSON payload under the inductor_collective_schedule artifact name - Dumps the full schedule list to collective_schedule.json via the PyTorch trace‑structured artifact - Added comprehensive unit tests for collective schedule tracing: Created test_collective_schedule_empty() and test_collective_schedule_real() tests to verify structured trace logging works correctly for both empty collective schedules and real collective operations (like all_reduce and wait_tensor from _c10d_functional ops). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159190 Approved by: https://github.com/yushangdi, https://github.com/xmfan	2025-08-01 21:51:42 +00:00
Sheng Fu	0450f05658	Output tensor meta data for FX graph node (#159311 ) FX graph segment in CompiledFxGraph does not include tensor meta data, for example, tensor shape, tensor stride, tensor data type, tensor device. AI system co-design team requested to include these information in FX graph segment so they can use FX graph segment to project the performance on different hardware. This DIFF is to modify the Graph::Node::format_node to include tensor meta data. Before this DIFF, the triton kernel FX graph segment looks like the following: ``` # %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm] # %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1] # %sin : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {}) # %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {}) # %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 1111), kwargs = {}) # %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {}) # %cos : cuda:0"[num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%add,), kwargs = {}) # return %cos After this DIFF: # %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm] # %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1] # %sin : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {}) # %permute_1 : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {}) # %mul : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 1111), kwargs = {}) # %add : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {}) # %cos : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%add,), kwargs = {}) # return %cos ``` If format_node can not be changed, I can copy the code to caffe2/torch/_inductor/utils.py. Differential Revision: D77973076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159311 Approved by: https://github.com/angelayi	2025-08-01 21:40:29 +00:00
zeshengzong	595a65f5c2	[dynamo] Replace unimplemented with unimplemented_v2 in `torch/_dynamo/variables/script_object.py` (#159343 ) Fixes part of #147913 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159343 Approved by: https://github.com/williamwen42 Co-authored-by: William Wen <william.wen42@gmail.com>	2025-08-01 21:30:41 +00:00
tiandeyu-cs	8c6c2e40eb	Edit a test case to detect potential bugs in all-gathering noncontiguous inputs in the Gloo backend (#159542 ) As suggested in the pull request #158903 by @H-huang, this pull request edits a test case to detect potential bugs in all-gathering noncontiguous inputs in the Gloo backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159542 Approved by: https://github.com/d4l3k, https://github.com/H-Huang	2025-08-01 21:20:25 +00:00
henrylhtsang	32840d19f9	[cutlass backend] skip stream k if shape is dynamic (#159442 ) Differential Revision: [D79229210](https://our.internmc.facebook.com/intern/diff/D79229210/) Motivation is workspace size is hard to determine, and varies for different shape. What I observed is sometimes the shape got smaller, but the workspace can increase. So it is hard to upper bound it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159442 Approved by: https://github.com/ColinPeppler	2025-08-01 20:42:24 +00:00
Xuehai Pan	2040f00112	[BE][Easy] respect `os.environ` in subprocess calls in tools/nightly.py (#159572 ) Respect parent shell's envvars, such as `UV_INDEX_STRATEGY`, `http{,s}_proxy`, etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159572 Approved by: https://github.com/Skylion007	2025-08-01 20:40:31 +00:00
Lucas Kabela	c137f9da0b	[Dynamo][Better Engineering] Add type coverage to dynamo/compiled_autograd.py (#159518 ) As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to `torch/_dynamo/compiled_autograd.py` Running ``` mypy torch/_dynamo/compiled_autograd.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Annotated \| Lines Total \| % lines covered \| Funcs Annotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 425 \| 1553 \| 27.37% \| 17 \| 62 \| 27.42% \| \| This PR \| 1623 \| 1623 \| 100.00% \| 62 \| 62 \| 100.00% \| \| Delta \| +1198\| +0 \| +72.63% \| +45 \| 0 \| +72.58% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/159518 Approved by: https://github.com/xmfan	2025-08-01 20:24:58 +00:00
Howard Huang	5e8b95605f	[PP] Support OVERLAP_F_B computation type (#158978 ) Some changes to validation code and visualizer to support a new computation type that will be used in DualPipeV (see https://github.com/pytorch/pytorch/pull/159591) The IR looks like: ``` [0F0, 0F1, 0F2, 0F3, 0F4, 0F5, 0F6, 7F0, 7I0, 7W0, 7F1, 7I1, 7W1, 7F2, 7I2, 7W2, 7F3, (0F7;7B3)OVERLAP_F_B, (7F4;0B0)OVERLAP_F_B, (0F8;7B4)OVERLAP_F_B, (7F5;0B1)OVERLAP_F_B, (0F9;7B5)OVERLAP_F_B, (7F6;0B2)OVERLAP_F_B, 7B6, (7F7;0B3)OVERLAP_F_B, 7B7, (7F8;0B4)OVERLAP_F_B, 7B8, (7F9;0B5)OVERLAP_F_B, 7B9, 0I6, 0W6, 0I7, 0W7, 0I8, 0W8, 0I9, 0W9] [1F0, 1F1, 1F2, 1F3, 1F4, 6F0, 1F5, 6F1, 6I0, 6W0, 6F2, 6I1, 6W1, 6F3, (1F6;6B2)OVERLAP_F_B, (6F4;1B0)OVERLAP_F_B, (1F7;6B3)OVERLAP_F_B, (6F5;1B1)OVERLAP_F_B, (1F8;6B4)OVERLAP_F_B, (6F6;1B2)OVERLAP_F_B, (1F9;6B5)OVERLAP_F_B, (6F7;1B3)OVERLAP_F_B, 6B6, (6F8;1B4)OVERLAP_F_B, 6B7, (6F9;1B5)OVERLAP_F_B, 6B8, 1B6, 6I9, 1I7, 6W9, 1I8, 1W7, 1I9, 1W8, 1W9] [2F0, 2F1, 2F2, 5F0, 2F3, 5F1, 2F4, 5F2, 5I0, 5W0, 5F3, (2F5;5B1)OVERLAP_F_B, (5F4;2B0)OVERLAP_F_B, (2F6;5B2)OVERLAP_F_B, (5F5;2B1)OVERLAP_F_B, (2F7;5B3)OVERLAP_F_B, (5F6;2B2)OVERLAP_F_B, (2F8;5B4)OVERLAP_F_B, (5F7;2B3)OVERLAP_F_B, (2F9;5B5)OVERLAP_F_B, (5F8;2B4)OVERLAP_F_B, 5B6, (5F9;2B5)OVERLAP_F_B, 5B7, 2B6, 5B8, 2I7, 5I9, 2I8, 2W7, 2I9, 5W9, 2W8, 2W9] [3F0, 4F0, 3F1, 4F1, 3F2, 4F2, 3F3, 4F3, 3F4, 4B0, (4F4;3B0)OVERLAP_F_B, (3F5;4B1)OVERLAP_F_B, (4F5;3B1)OVERLAP_F_B, (3F6;4B2)OVERLAP_F_B, (4F6;3B2)OVERLAP_F_B, (3F7;4B3)OVERLAP_F_B, (4F7;3B3)OVERLAP_F_B, (3F8;4B4)OVERLAP_F_B, (4F8;3B4)OVERLAP_F_B, (3F9;4B5)OVERLAP_F_B, (4F9;3B5)OVERLAP_F_B, 4B6, 3B6, 4B7, 3B7, 4I8, 3I8, 4I9, 3I9, 4W8, 3W8, 4W9, 3W9] ``` In this PR, the schedule execution will just treat the OVERLAP_F_B as two separate operations of F and B (so there is no actual overlap). The next step is to allow users to create a custom function to plug in what this operation does. `814629043a/torch/distributed/pipelining/schedules.py (L1205-L1216)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158978 Approved by: https://github.com/wconstab	2025-08-01 20:22:30 +00:00
Jane Xu	8ea86a6e31	Actually test STD_TORCH_CHECK, add testfile to CMake (#159603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159603 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-08-01 19:53:41 +00:00
PyTorch MergeBot	acad808545	Revert "[inductor] consolidate common GEMM triton param retrieval (#159383 )" This reverts commit e7cc42df58a86bee05944f6e80c535aa1d099443. Reverted https://github.com/pytorch/pytorch/pull/159383 on behalf of https://github.com/jataylo due to sorry but rocm CI is broken due to this PR ([comment](https://github.com/pytorch/pytorch/pull/159383#issuecomment-3145604831))	2025-08-01 19:49:21 +00:00
PyTorch MergeBot	c687446374	Revert "Fix rand_like decomposition to preserve strides (#159294 )" This reverts commit 2c46922ce4b33c39b1c48c302604805510a3f889. Reverted https://github.com/pytorch/pytorch/pull/159294 on behalf of https://github.com/yangw-dev due to breaking internal test ([comment](https://github.com/pytorch/pytorch/pull/159294#issuecomment-3145541845))	2025-08-01 19:19:51 +00:00
Will Constable	dd22ba09b4	[C10D] Document barrier interaction with device_id (#159389 ) Addresses #159262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159389 Approved by: https://github.com/malfet, https://github.com/H-Huang, https://github.com/kwen2501, https://github.com/fduwjj	2025-08-01 18:12:21 +00:00
Yu, Guangye	c0e0126399	Remove unused input parameter in ExpandableSegment (#159356 ) # Motivation While refactoring the caching allocator, I noticed that the `ExpandableSegment` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion. # Additional Context I noticed that `ExpandableSegment` is defined in cpp file, so it should be safe to make this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159356 Approved by: https://github.com/ngimel, https://github.com/albanD ghstack dependencies: #159159	2025-08-01 17:47:51 +00:00
Ivan Zaitsev	e4b123b5e4	Revert direct updates (#159654 ) reverts: ``` commit 5711a8f06948eeee56ed5f53f171fa519f78491c (tag: trunk/5711a8f06948eeee56ed5f53f171fa519f78491c, origin/main, main) Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com> Date: Fri Aug 1 09:32:52 2025 -0700 Update test_utils.py commit b4b71d011ed07a41c2086ff0dec2988a63662877 (tag: trunk/b4b71d011ed07a41c2086ff0dec2988a63662877) Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com> Date: Fri Aug 1 09:27:54 2025 -0700 Update utils.py commit 52376b9b6fbf9fe24f5d82038dc520f0c64b6f8d (tag: trunk/52376b9b6fbf9fe24f5d82038dc520f0c64b6f8d) Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com> Date: Fri Aug 1 09:26:05 2025 -0700 ``` (commits pushed directly to main by mistake) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159654 Approved by: https://github.com/atalman	2025-08-01 16:54:51 +00:00
Jovian Anthony Jaison	5711a8f069	Update test_utils.py	2025-08-01 09:32:52 -07:00
Jovian Anthony Jaison	b4b71d011e	Update utils.py	2025-08-01 09:27:54 -07:00
Jovian Anthony Jaison	52376b9b6f	Update convert_frame.py	2025-08-01 09:26:05 -07:00
Jane Xu	1371a98b0e	Migrate ScalarType to headeronly (#159416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159416 Approved by: https://github.com/albanD ghstack dependencies: #159415, #159411	2025-08-01 16:07:01 +00:00
linmin	2a286cbdf4	Allow register_buffer with Tensor-like object (#159455 ) As torch allows extending the tensor with `__torch_function__`, it would be desirable to allow registering it as a buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159455 Approved by: https://github.com/mikaylagawarecki	2025-08-01 15:31:38 +00:00
Scott Todd	7c37b8e1e0	[ROCm][Windows] Switch __builtin_clz ifdef from WIN32 to MSC_VER. (#159273 ) PyTorch with ROCm on Windows is built with clang-cl and not MSVC. This code path is specific to the MSVC compiler so it should be checking for MSC_VER, not just WIN32. The change here is similar to https://github.com/pytorch/pytorch/pull/146606. This fixes downstream build errors using clang-cl like https://github.com/ROCm/TheRock/actions/runs/16569646709/job/46858176812 (patched and tested downstream at https://github.com/ROCm/TheRock/pull/1140): ``` [7099/7147] Building CXX object functorch\CMakeFiles\functorch.dir\csrc\dim\dim.cpp.obj FAILED: functorch/CMakeFiles/functorch.dir/csrc/dim/dim.cpp.obj C:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\clang-cl.exe /nologo -TP -DEXPORT_AOTI_FUNCTIONS -DFUNCTORCH_BUILD_MAIN_LIB -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNOMINMAX -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DROCM_ON_WINDOWS -DROCM_USE_FLOAT16 -DROCM_VERSION=70000 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -DTORCH_HIP_VERSION=700 -DUSE_EXTERNAL_MZCRC -DUSE_MIMALLOC -DUSE_PROF_API=1 -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_UCRT_LEGACY_INFINITY -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -Dfunctorch_EXPORTS -IB:\src\torch\build\aten\src -IB:\src\torch\aten\src -IB:\src\torch\build -IB:\src\torch -IB:\src\torch\nlohmann -IB:\src\torch\moodycamel -IB:\src\torch\third_party\mimalloc\include -IB:\src\torch\functorch -IB:\src\torch\torch\csrc\api -IB:\src\torch\torch\csrc\api\include -IB:\src\torch\c10\.. -IB:\src\torch\c10\hip\..\.. -IB:\src\torch\torch\.. -IB:\src\torch\torch\..\aten\src -IB:\src\torch\torch\..\aten\src\TH -IB:\src\torch\build\caffe2\aten\src -IB:\src\torch\build\third_party -IB:\src\torch\build\third_party\onnx -IB:\src\torch\torch\..\third_party\valgrind-headers -IB:\src\torch\torch\..\third_party\gloo -IB:\src\torch\torch\..\third_party\onnx -IB:\src\torch\torch\..\third_party\flatbuffers\include -IB:\src\torch\torch\..\third_party\kineto\libkineto\include -IB:\src\torch\torch\..\third_party\cpp-httplib -IB:\src\torch\torch\..\third_party\nlohmann\include -IB:\src\torch\torch\csrc -IB:\src\torch\torch\lib -IB:\src\torch\torch\standalone -IB:\src\torch\torch\lib\libshm_windows -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include -imsvcB:\src\torch\third_party\protobuf\src -imsvcB:\src\torch\third_party\XNNPACK\include -imsvcB:\src\torch\third_party\ittapi\include -imsvcB:\src\torch\cmake\..\third_party\eigen -imsvcB:\src\torch\third_party\ideep\mkl-dnn\include\oneapi\dnnl -imsvcB:\src\torch\third_party\ideep\include -imsvcB:\src\torch\INTERFACE -imsvcB:\src\torch\third_party\nlohmann\include -imsvcB:\src\torch\third_party\concurrentqueue -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include\hiprand -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include\rocrand -imsvcB:\src\torch\cmake\..\third_party\pybind11\include -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\include /DWIN32 /D_WINDOWS /EHsc /Zc:__cplusplus /bigobj /FS /utf-8 -DUSE_PTHREADPOOL -DNDEBUG -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE /wd4624 /wd4068 /wd4067 /wd4267 /wd4661 /wd4717 /wd4244 /wd4804 /wd4273 /O2 /Ob2 /DNDEBUG /bigobj -DNDEBUG -std:c++17 -MD -Z7 -Wmissing-prototypes -Werror=missing-prototypes /permissive- /d2implyavx512upperregs- /EHsc /bigobj -fms-runtime-lib=dll -D__HIP_PLATFORM_AMD__=1 -DCUDA_HAS_FP16=1 -DUSE_ROCM -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DTORCH_HIP_VERSION=700 -Wno-shift-count-negative -Wno-shift-count-overflow -Wno-duplicate-decl-specifier -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIPBLAS_V2 -DHIP_ENABLE_WARP_SYNC_BUILTINS -fms-extensions -Wno-ignored-attributes /showIncludes /Fofunctorch\CMakeFiles\functorch.dir\csrc\dim\dim.cpp.obj /Fdfunctorch\CMakeFiles\functorch.dir\ -c -- B:\src\torch\functorch\csrc\dim\dim.cpp clang-cl: warning: unknown argument ignored in clang-cl: '-std=c++17' [-Wunknown-argument] clang-cl: warning: argument unused during compilation: '/d2implyavx512upperregs-' [-Wunused-command-line-argument] In file included from B:\src\torch\functorch\csrc\dim\dim.cpp:36: B:\src\torch\functorch\csrc\dim\arena.h(14,21): error: functions that differ only in their return type cannot be overloaded 14 \| inline unsigned int __builtin_clz(unsigned int x) { \| ~~~~~~~~~~~~ ^ C:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\lib\llvm\lib\clang\20\include\ia32intrin.h(60,15): note: '__builtin_clz' is a builtin with type 'int (unsigned int) noexcept' 60 \| return 31 - __builtin_clz((unsigned int)__A); \| ^ 1 error generated. [7100/7147] Building CXX object caffe2\torch\CMakeFiles\torch_python.dir\csrc\utils\tensor_list.cpp.obj ``` > [!NOTE] > I haven't been able to reproduce those errors locally, but we have CI jobs that consistently fail when building for Python 3.11 but not 3.12 or 3.13. I'm not sure what is different between those builds, but the code fix seems correct. There are a few other variations on fixes to this floating around, such as: * `a97a957af0/lz4.c (L34-L43)` (checking with `__has_builtin`) * `c98c55ec7e/lj92.c (L31-L46)` (the same code as here, but with `_MSC_VER`) * `2760e5a2bb/def.h (L23-L25)` (using `__lzcnt` instead of a custom implementation) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159273 Approved by: https://github.com/Skylion007, https://github.com/m-gallus	2025-08-01 15:21:26 +00:00
Raphael Reme	ee2649219c	Fix max_width computation in _tensor_str._Formatter (#126859 ) Previous version of `torch._tensor_str._Formatter` was not using `PRINT_OPTS.sci_mode` for the `max_width` computation but was using it for the formatting of values leading to a weird discrepancy. Now, the code first checks if it should be in sci_mode, then compute `max_width` Here is an example to test the behavior: ```python A = torch.tensor([10, 1e-1, 1e-2]) B = torch.tensor([10, 1e-1, 1e-1]) print("================= Default =================") print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}") print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}") print("================= sci_mode=False =================") with torch._tensor_str.printoptions(sci_mode=False): print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}") print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}") print("================= sci_mode=True =================") with torch._tensor_str.printoptions(sci_mode=True): print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}") print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}") ``` In the current version this prints: ``` ================= Default ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=False ================= tensor([ 10.0000, 0.1000, 0.0100]) Formatter max_width: 10 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=True ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 7 ``` On can see that in `sci_mode=False`, the values of A are prefixed with unneeded 0 and does not have the same `max_width` as B (It keeps the `max_width` from `sci_mode = None`) Also in `sci_mode = True`, for B, the `max_width` is 7 but each value takes 10 chars... (But it is fine as the code that uses `max_width` do not rely much on it, but still, this is missleading) After this commit, this will print ``` ================= Default ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=False ================= tensor([10.0000, 0.1000, 0.0100]) Formatter max_width: 7 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=True ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 10 ``` This also allows to align A with B for `sci_mode=False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126859 Approved by: https://github.com/malfet	2025-08-01 15:05:41 +00:00
Howard Huang	b0b3e6e48b	[PP] Refactor test_schedule_multiproc (#158780 ) This refactors the pipelining schedule tests since a lot of them have the same repeated code of: 1. Create pipelined model and reference model 2. Run reference model and pipelined model 3. compare gradients So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes https://github.com/pytorch/pytorch/issues/154408). Pull Request resolved: https://github.com/pytorch/pytorch/pull/158780 Approved by: https://github.com/wconstab	2025-08-01 15:02:18 +00:00
Xilun Wu	3967dbedf4	[ContextParallel][FlexAttention] Prototype of supporting FlexAttention in Context Parallel (#158692 ) Summary This PR adds an all-gather based FlexAttention and uses TorchFunctionMode to dispatch `FlexAttentionHOP.__call__` to it. This PR makes the following changes: - add a user-facing API `create_cp_block_mask` for creating CP-specific `BlockMask` which masks over the attention result of Q shard and KV global. - add `_ContextParallelGlobalVars` to store all necessary global vars that CP FlexAttention requires. `torch_function_mode` is critical to maintain singleton mode to avoid dynamo recompilations. - add a dispatch path for `FlexAttentionForwardHOP.__call__` (TorchFunctionMode dispatch won't work correctly without this line) What's not in this PR: - QKV load balancing - Test on other masking besides `causal_mask`. - Support on small attention (i.e. qkv size is smaller than 128) because the block mask rewrite function requires `Q_BLOCK_SIZE == KV_BLOCK_SIZE == 128`. Test `pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention` Followup 1. create an issue to reproduce the error in `create_fw_bw_graph()` when trying to call `create_block_mask` to re-write `block_mask` in `FlexAttentionHOP` dispatch in `TorchFunctionMode`. 2. Merge `_ContextParallelGlobalVars` and `_cp_options`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158692 Approved by: https://github.com/drisspg	2025-08-01 06:49:01 +00:00
gaoyufeng	4396b15aa7	remove co_lnotab in favor of co_linetable (#159227 ) Fixes #158833 DeprecationWarning: remove co_lnotab in favor of co_linetable Pull Request resolved: https://github.com/pytorch/pytorch/pull/159227 Approved by: https://github.com/ezyang	2025-08-01 06:34:38 +00:00
zpcore	bb6766053b	fix strategy hashing arg mismatch (#159506 ) Reland https://github.com/pytorch/pytorch/pull/159289. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159506 Approved by: https://github.com/XilunWu	2025-08-01 05:42:40 +00:00
tiandeyu-cs	a4fc051c9a	Fix a bug of distributed 'gather' with noncontiguous tensors on the NCCL backend. (#159549 ) Fixes #159548 * Throw an error message when the input tensors for the distributed `gather` are noncontiguous. This behaviour is consistent with the distributed `all_gather`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159549 Approved by: https://github.com/d4l3k	2025-08-01 03:26:06 +00:00
PyTorch MergeBot	5cc6a0abc1	Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" This reverts commit dfacf11f66d6512396382bdf5088f0ba9de00406. Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))	2025-08-01 03:24:54 +00:00
PyTorch MergeBot	90f13f3b2a	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit 1fc010a9d8ea95bb74e54b31d17eba56ef16c27c. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))	2025-08-01 03:24:54 +00:00
PyTorch MergeBot	cb9b74872b	Revert "Generalize torch._C._set_allocator_settings to be generic (#156175 )" This reverts commit d3ce45012ed42cd1e13d5048b046b781f0feabe0. Reverted https://github.com/pytorch/pytorch/pull/156175 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))	2025-08-01 03:24:54 +00:00
Catherine Lee	c964204829	[CI] Disable executorch jobs (#159595 ) The current executorch pin needs to be updated The next time the docker image gets rebuilt, the executorch docker build is going to fail like https://github.com/pytorch/pytorch/actions/runs/16626853655/job/47137807966 The failure is that the pin uses a version of the nightly that has been removed from the nightly index ``` #62 72.30 ERROR: Could not find a version that satisfies the requirement torch==2.8.0.dev20250601 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 2.5.0, 2.5.1, 2.6.0, 2.7.0, 2.7.1, 2.8.0.dev20250602+cpu, 2.8.0.dev20250603+cpu, 2.8.0.dev20250604+cpu, 2.8.0.dev20250605+cpu, 2.8.0.dev20250606+cpu, 2.8.0.dev20250607+cpu, 2.8.0.dev20250608+cpu, 2.8.0.dev20250609+cpu, 2.8.0.dev20250610+cpu, 2.8.0.dev20250611+cpu, 2.8.0.dev20250612+cpu, 2.8.0.dev20250613+cpu, 2.8.0.dev20250614+cpu, 2.8.0.dev20250615+cpu, 2.8.0.dev20250616+cpu, 2.8.0.dev20250617+cpu, 2.8.0.dev20250618+cpu, 2.8.0.dev20250619+cpu, 2.8.0.dev20250620+cpu, 2.8.0.dev20250621+cpu, 2.8.0.dev20250622+cpu, 2.8.0.dev20250623+cpu, 2.8.0.dev20250624+cpu, 2.8.0.dev20250625+cpu, 2.8.0.dev20250626+cpu, 2.8.0.dev20250627+cpu, 2.9.0.dev20250628+cpu, 2.9.0.dev20250629+cpu, 2.9.0.dev20250630+cpu, 2.9.0.dev20250701+cpu, 2.9.0.dev20250702+cpu, 2.9.0.dev20250703+cpu, 2.9.0.dev20250704+cpu, 2.9.0.dev20250705+cpu, 2.9.0.dev20250706+cpu, 2.9.0.dev20250707+cpu, 2.9.0.dev20250708+cpu, 2.9.0.dev20250709+cpu, 2.9.0.dev20250710+cpu, 2.9.0.dev20250711+cpu, 2.9.0.dev20250712+cpu, 2.9.0.dev20250713+cpu, 2.9.0.dev20250714+cpu, 2.9.0.dev20250715+cpu, 2.9.0.dev20250716+cpu, 2.9.0.dev20250717+cpu, 2.9.0.dev20250718+cpu, 2.9.0.dev20250719+cpu, 2.9.0.dev20250720+cpu, 2.9.0.dev20250722+cpu, 2.9.0.dev20250723+cpu, 2.9.0.dev20250724+cpu, 2.9.0.dev20250725+cpu, 2.9.0.dev20250726+cpu, 2.9.0.dev20250727+cpu, 2.9.0.dev20250728+cpu, 2.9.0.dev20250729+cpu, 2.9.0.dev20250730+cpu, 2.9.0.dev20250731+cpu) #62 72.30 ERROR: No matching distribution found for torch==2.8.0.dev20250601 ``` The executorch hash update currently fails due to https://github.com/pytorch/pytorch/actions/runs/16636773244/job/47079169392 ``` 2025-07-31T01:56:57.0249165Z + echo 'expecting triton to not be installed, but it is' 2025-07-31T01:56:57.0249614Z expecting triton to not be installed, but it is 2025-07-31T01:56:57.0249969Z + exit 1 2025-07-31T01:58:27.6764352Z ##[error]Final attempt failed. Child_process exited with error code 1 ``` I believe the cause is https://github.com/pytorch/executorch/pull/11653 where the nightly pytorch is installed from our index, but then requirements-examples installs timm from pypi, which reinstalls pytorch, except its the release build for cuda from pypi? Which then causes triton to be installed. I don't know what the intended behavior is so I'm disabling the executorch docker build, executorch build, and the nightly hash update, and apparently the test was already disabled because it was failing Pull Request resolved: https://github.com/pytorch/pytorch/pull/159595 Approved by: https://github.com/malfet	2025-08-01 02:18:03 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	2ac45c2752	Fix autocast context manager when there is exception (#159565 ) Summary: When exception occurs inside context manager, we need to either return False OR properly propagage exceptions via __exit__(exc_type, exc_val). But previously while tracing, we don't actually run the exit node so we end up swallowing the exception in a very weird way as outlined in https://github.com/pytorch/pytorch/issues/153202. This PR fixes it Test Plan: new test case Rollback Plan: Differential Revision: D79348382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159565 Approved by: https://github.com/zou3519, https://github.com/yushangdi	2025-08-01 02:12:24 +00:00
Xia, Weiwen	83e2ea8135	[CPU] fix _weight_int8pack_mm with large output shape (#158341 ) Summary `_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by ```c++ auto* C_ptr = C_data + mb_start * N + nb_start; ``` where both `mb_start` and `N` are `int` and when they are large their product may overflow. The solution is simple: declare these variables as `int64_t` so that the product won't overflow. Test plan ``` pytest -sv test/test_linalg.py -k test__int8_mm_large_shape ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341 Approved by: https://github.com/mingfeima, https://github.com/drisspg	2025-08-01 01:55:48 +00:00
Raman Kumar	d994027a41	[Doc fix] fix spelling of enough (#159587 ) fixes typo in word `enought` to correct `enough` at 3 places in these files ``` aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu aten/src/ATen/native/cuda/CuFFTPlanCache.h ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159587 Approved by: https://github.com/ezyang	2025-08-01 01:50:57 +00:00
PyTorch MergeBot	cb4f41e125	Revert "[dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#157566 )" This reverts commit 8e07c9870d07c5a318ab21bb16b3fa27576851e6. Reverted https://github.com/pytorch/pytorch/pull/157566 on behalf of https://github.com/yangw-dev due to failed an odd internal test, please reach out to metamate to fix it, D79112610 ([comment](https://github.com/pytorch/pytorch/pull/157566#issuecomment-3141840110))	2025-08-01 01:27:45 +00:00
rzou	690fc9cf88	[merge_rules] add some expected failure and skips (#159581 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159581 Approved by: https://github.com/anijain2305	2025-08-01 01:18:40 +00:00
henrylhtsang	eb853e222b	[cutlass upgrade] Ignore unused-but-set-variable for AsyncMM.cu (#159578 ) Fixes inductor-perf-nightly-h100. This was caused by cutlass upgrade https://github.com/pytorch/pytorch/pull/158854. I missed it in https://github.com/pytorch/pytorch/pull/159276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159578 Approved by: https://github.com/Skylion007	2025-08-01 00:10:59 +00:00
Sam Larsen	06395276e4	Remove dynamo_timed from the CachingAutotuner.coordinate_descent_tuning() hot path. (#159588 ) Summary: When coordinate_descent_tuning==True, CachingAutotuner.coordinate_descent_tuning() is called for every call of CachingAutotuner.run() (at least for Triton templates), but immediately returns the launcher. Move the dynamo_timed call after the check for triton template so we don't incur the context manager overhead on every call. Fixes https://github.com/pytorch/pytorch/issues/159525 Test Plan: Used the repro in https://github.com/pytorch/pytorch/issues/159525 to make sure the overhead goes away. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159588 Approved by: https://github.com/eellison	2025-07-31 23:33:10 +00:00
Rob Timpe	8becf646ef	[dynamo] Make filter handle None as filter function (#159500 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159500 Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519 ghstack dependencies: #158774, #159102	2025-07-31 23:28:57 +00:00
Rob Timpe	fa68216ca1	[itertools] Implement itertools.cycle with a polyfill (#159102 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159102 Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519 ghstack dependencies: #158774	2025-07-31 23:28:57 +00:00
angelayi	25ef3d315d	[aoti][mps] Dynamic reductions (#159355 ) Dynamic kernel: ```cpp [[max_total_threads_per_threadgroup(1024)]] kernel void generated_kernel( device float* out_ptr0, constant float* in_ptr0, constant long& r0_numel, uint2 thread_pos [[thread_position_in_grid]], uint2 group_pos [[thread_position_in_threadgroup]] ) { auto xindex = thread_pos.x; auto r0_index = thread_pos.y; int x0 = xindex; threadgroup float tmp_acc_0[32]; float tmp_acc_1 = 0; for(auto r0_1_cnt = 0; r0_1_cnt < static_cast<int>(metal::floor(static_cast<float>(0.99902343750000000 + 0.00097656250000000000r0_numel))); ++r0_1_cnt) { int r0_1 = 1024 r0_1_cnt + r0_index; if (r0_1 >= r0_numel) break; auto tmp0 = in_ptr0[x0 + 5r0_1]; tmp_acc_1 += tmp0; } auto tmp1 = c10:🤘:threadgroup_sum(tmp_acc_0, tmp_acc_1, r0_index 1, metal::min(static_cast<decltype(1024+r0_numel)>(1024), static_cast<decltype(1024+r0_numel)>(r0_numel))); if (r0_index == 0) out_ptr0[x0] = static_cast<float>(tmp1); } void AOTInductorModel::run_impl(...) { ... auto arg0_1_size = arg0_1.sizes(); int64_t s77 = arg0_1_size[0]; inputs.clear(); [[maybe_unused]] auto& kernels = static_cast<AOTInductorModelKernels&>(this->kernels_.get()); static constexpr int64_t int_array_0[] = {5LL, }; static constexpr int64_t int_array_1[] = {1LL, }; AtenTensorHandle buf0_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_float32, cached_torch_device_type_mps, this->device_idx_, &buf0_handle)); RAIIAtenTensorHandle buf0(buf0_handle); auto mps_lib_0_func = mps_lib_0.getKernelFunction("generated_kernel"); auto mps_lib_0_func_handle = AOTIMetalKernelFunctionHandle(mps_lib_0_func.get()); mps_lib_0_func->runCommandBlock([&] { mps_lib_0_func->startEncoding(); aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 0, buf0); aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 1, arg0_1); aoti_torch_mps_set_arg_int(mps_lib_0_func_handle, 2, s77); mps_lib_0_func->dispatch({static_cast<uint64_t>(5LL), static_cast<uint64_t>(std::min(static_cast<int64_t>(1024LL), static_cast<int64_t>(s77)))}, {static_cast<uint64_t>(1), static_cast<uint64_t>(std::min(static_cast<int64_t>(1024LL), static_cast<int64_t>(s77)))}); }); arg0_1.reset(); output_handles[0] = buf0.release(); } // AOTInductorModel::run_impl ``` Static kernel: ```cpp kernel void generated_kernel( device float out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { int x0 = xindex; auto tmp0 = in_ptr0[x0]; auto tmp1 = in_ptr0[5 + x0]; auto tmp3 = in_ptr0[10 + x0]; auto tmp5 = in_ptr0[15 + x0]; auto tmp2 = tmp0 + tmp1; auto tmp4 = tmp2 + tmp3; auto tmp6 = tmp4 + tmp5; out_ptr0[x0] = static_cast<float>(tmp6); } void AOTInductorModel::run_impl(...) { ... static constexpr int64_t int_array_0[] = {5LL, }; static constexpr int64_t int_array_1[] = {1LL, }; AtenTensorHandle buf0_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_float32, cached_torch_device_type_mps, this->device_idx_, &buf0_handle)); RAIIAtenTensorHandle buf0(buf0_handle); auto mps_lib_0_func = mps_lib_0.getKernelFunction("generated_kernel"); auto mps_lib_0_func_handle = AOTIMetalKernelFunctionHandle(mps_lib_0_func.get()); mps_lib_0_func->runCommandBlock([&] { mps_lib_0_func->startEncoding(); aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 0, buf0); aoti_torch_mps_set_arg_tensor(mps_lib_0_func_handle, 1, arg0_1); mps_lib_0_func->dispatch({static_cast<uint64_t>(5LL)}); }); arg0_1.reset(); output_handles[0] = buf0.release(); } // AOTInductorModel::run_impl ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159355 Approved by: https://github.com/malfet	2025-07-31 23:15:02 +00:00
Xu Han	7e00f2ec9d	[AOTI] add zero size consts asm handler (#159225 ) Add `get_zero_consts_asm_code` to handle zero size consts to object. This function is used to handle zero consts situation. Because cpp standard does not allow zero size array: https://stackoverflow.com/questions/9722632/what-happens-if-i-define-a-0-size-array-in-c-c 1. On Windows, MSVC will report error C2466: https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2466?view=msvc-170 So, we can use assmbely compiler to handle this situation. 2. On Windows, why not use Win32 asm to handle all path? Because ml64 only supports up to align `16`, it is not aligned to pytorch's `64`. Reference: https://learn.microsoft.com/en-us/cpp/assembler/masm/ml-and-ml64-command-line-reference?view=msvc-170 ``` Packs structures on the specified byte boundary. The alignment can be 1, 2, 4, 8, or 16. ``` 3. It function can handle zero size case on both Windows and Linux, as that: A. On Linux, we added `-pedantic` to disable zero size array on C++ compiler. `8e07c9870d/torch/_inductor/cpp_builder.py (L580)` B. On Windows, msvc is not support zero size array by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159225 Approved by: https://github.com/desertfire	2025-07-31 22:46:33 +00:00
PyTorch MergeBot	490cb3f1a4	Revert "[inductor] Add logging for distributed collective ops for multi‑rank diagnostics (#159190 )" This reverts commit bb62e1f769ef51e2ec149d7256c135d09425aaa0. Reverted https://github.com/pytorch/pytorch/pull/159190 on behalf of https://github.com/clee2000 due to broke [GH job link](https://github.com/pytorch/pytorch/actions/runs/16658705097/job/47150840171) [HUD commit link](`bb62e1f769`) on mac ([comment](https://github.com/pytorch/pytorch/pull/159190#issuecomment-3141513921))	2025-07-31 22:22:13 +00:00
Jane Xu	b95cf5c91d	Move complex to headeronly (#159411 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159411 Approved by: https://github.com/albanD ghstack dependencies: #159415	2025-07-31 22:05:43 +00:00
Jane Xu	5e2ef2a465	Move Float8 variations to headeronly (#159415 ) This PR is a big copy pasta from `c10/util/Float8*` -> `torch/headeronly/util/` which is why we are breaking PR sanity :C (sorry @albanD!). Why is it not a clean copy paste? - For BC reasons, we have to keep the old c10 file around so that OSS devs relying on those files can still get the same APIs - Because we reexpose APIs that are headeronly through torch::headeronly, so there is an extra chunk of code in the new torch::headeronly files to do that. Outside of the copy paste, I: - changed the tests to call torch::headeronly instead of c10 - updated header_only_apis.txt - added `// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)` to pass lint (which was previously skipped for -inl.h files) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159415 Approved by: https://github.com/albanD	2025-07-31 22:05:43 +00:00
zpcore	9f753f8c0d	[DTensor] Improve `sort` strategy (#159189 ) - Sort strategy now supports sharding on non sorted dim. ~~- Fix histc xfail.~~ - ~~Previously `python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_histc_cpu_float32` will fail with `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=18`. However, if we run `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=18 python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_histc_cpu_float32`, the test will pass. This kind of error is due to DTensor reuses the strategy schema hashing. It turns out that not only the strategy, the result correctness also depends on `static_argnum` or the op will reuse the previous args from hashed schema and output wrong results. I updated the document also.~~ (fixed in https://github.com/pytorch/pytorch/pull/159289) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159189 Approved by: https://github.com/XilunWu	2025-07-31 21:52:42 +00:00
Jane Xu	db437690d1	Add myself as a reviewer for when someone touches headeronly or stable (#159583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159583 Approved by: https://github.com/mikaylagawarecki	2025-07-31 21:30:05 +00:00
Simon Fan	669009bcd1	[inductor] respect layout tags for ops with registered lowerings (#159134 ) scaled_grouped_mm's kernel only supports column-major on the second operand. I -think- this is just for efficiency reasons. But inductor treats that buffer as flexible and may tweak the strides to be row-major instead, as seen in the issue. ~Tagging the op as "needs_fixed_stride_order"/"needs_exact_strides" does not work. Inductor only considers those tags for ops that don't have registered lowering (not sure if this is intended). scaled_grouped_mm does have a lowering, so we never check its tags.~ From discussion below, the op tags are expected to work. FIXES https://github.com/pytorch/pytorch/issues/159097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159134 Approved by: https://github.com/eellison	2025-07-31 21:29:40 +00:00
Svetlana Karslioglu	e4e2701429	Add the RunLLM widget to the website (#152055 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152055 Approved by: https://github.com/albanD	2025-07-31 20:53:53 +00:00
Rob Timpe	64cc649275	[itertools] Fix accumulate (#158774 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158774 Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519	2025-07-31 20:32:02 +00:00
PyTorch MergeBot	b1fb552974	Revert "Fix ep deepcopy when there is python builitin name (#159478 )" This reverts commit de7376537f2a11783169fee2b3bc276d266898bf. Reverted https://github.com/pytorch/pytorch/pull/159478 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/159478#issuecomment-3141228423))	2025-07-31 20:20:53 +00:00
Sandeep Narendranath Karjala	bb62e1f769	[inductor] Add logging for distributed collective ops for multi‑rank diagnostics (#159190 ) This change introduces structured logging of the collective communication schedule, enabling downstream tools (e.g. TLParse) to ingest and analyze per‑rank collective‐order information for multi‑rank jobs. - Iterates over scheduler.nodes, filters for _CollectiveKernel nodes - Extracts each op’s python_kernel_name - Emits a structured JSON payload under the inductor_collective_schedule artifact name - Dumps the full schedule list to collective_schedule.json via the PyTorch trace‑structured artifact - Added comprehensive unit tests for collective schedule tracing: Created test_collective_schedule_empty() and test_collective_schedule_real() tests to verify structured trace logging works correctly for both empty collective schedules and real collective operations (like all_reduce and wait_tensor from _c10d_functional ops). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159190 Approved by: https://github.com/yushangdi, https://github.com/xmfan	2025-07-31 19:58:07 +00:00
Dylan Maloy	327e2ca580	[ez] get rid of unused var (#159571 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D79320299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159571 Approved by: https://github.com/houseroad, https://github.com/georgiaphillips	2025-07-31 19:11:57 +00:00
Neil Tenenholtz	1ebcba4e1b	Fix typo in link to torch memory_viz tool (#159214 ) Fixes a small typo in the torch_cuda_memory docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/159214 Approved by: https://github.com/yewentao256, https://github.com/HDCharles, https://github.com/Skylion007	2025-07-31 18:50:54 +00:00
Divyansh Khanna	5f7eae697d	Deprecate DataLoader pin_memory_device param (#158323 ) Build on top of https://github.com/pytorch/pytorch/pull/146821 - Moves enabling pin_memory back inside `_BaseDataLoaderIter` - This is required for `StatefulDataloader` which leveraged `_BaseDataLoaderIter` directly and not the `Dataloader` class init - Add a simple test for CPU only env where setting `pin_memory=True` is a no-op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158323 Approved by: https://github.com/ramanishsingh Co-authored-by: zeshengzong <zesheng.zong@outlook.com>	2025-07-31 18:42:07 +00:00
Sherlock Huang	c1722db0f7	[NativeRT] Make VariadicOpConverter and FuseListUnpackConverter for cpu nodes only (#159519 ) Summary: VariadicOpConverter and FuseListUnpackConverter would introduce ops that only have CPU kernels. Currently, the graph passes are ran if static_dispatch is enabled. As we plan to enable static_dispatch by default, this diff add the additional check for the graph pass to only work on the node that has all the inputs/outputs on CPU. Test Plan: CI Rollback Plan: Differential Revision: D79295640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159519 Approved by: https://github.com/dolpm, https://github.com/henryoier	2025-07-31 18:17:21 +00:00
PyTorch MergeBot	8a233d6000	Revert "[ContextParallel][FlexAttention] Prototype of supporting FlexAttention in Context Parallel (#158692 )" This reverts commit 07fad04181321d18963b71e9566d44f86a25c9f7. Reverted https://github.com/pytorch/pytorch/pull/158692 on behalf of https://github.com/yangw-dev due to failed some internal testapf.metrics.tests.generate_graph_def_test.GenerateGraphDefTest: test_aps_generate_inference_graph_def_with_justknobs1) AssertionError: Expected 'check' to be called once. Called 3 times., please fix the internal test and reland it ([comment](https://github.com/pytorch/pytorch/pull/158692#issuecomment-3140873894))	2025-07-31 18:00:30 +00:00
Aleksandar Samardžić	bf3ebd7ad4	Fix grouped MM load along K when TMA loads are not used (#159485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159485 Approved by: https://github.com/ngimel	2025-07-31 17:58:02 +00:00
PyTorch MergeBot	c07bb277a0	Revert "fix strategy hashing arg mismatch (#159506 )" This reverts commit 3a556762002ec0027b2120a7e6675182c0e50dbd. Reverted https://github.com/pytorch/pytorch/pull/159506 on behalf of https://github.com/yangw-dev due to failed the internal tests test_get_bwd_hook (torch.equal(output * 2, input_tensor.grad)) ([comment](https://github.com/pytorch/pytorch/pull/159506#issuecomment-3140858905))	2025-07-31 17:54:29 +00:00
Markus Hoehnerbach	f89c28cc6b	[inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160 ) (#158462 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158462 Approved by: https://github.com/eellison	2025-07-31 17:00:32 +00:00
Pian Pawakapan	8fedcfa59a	[export] _ccode for PythonMod (#158851 ) Summary: Adds ccode impl to PythonMod Test Plan: test_export Rollback Plan: Differential Revision: D76463347 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158851 Approved by: https://github.com/kalpit-meta-1	2025-07-31 16:46:51 +00:00
henrylhtsang	6662a76f59	[cutlass backend] Fix EVT tests post buf name change (#159541 ) Differential Revision: [D79317791](https://our.internmc.facebook.com/intern/diff/D79317791/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159541 Approved by: https://github.com/mlazos	2025-07-31 16:39:49 +00:00
eqy	05aade1b6d	[CUDA] Add `serialTest` decorator to `largeTensorTest` in `test_cuda.py` (#159271 ) Hopefully helps with disabled tests due to OOM such as https://github.com/pytorch/pytorch/issues/159069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159271 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-07-31 16:27:16 +00:00
Nikita Shulga	f946b25865	[MPS] Speedup `argmax`/`argmin` (#159524 ) By using efficient `threadgroup_arg[max\|min]` primitives. - Fixed bug in `simd_argmax` when result of the `simd_ballot` were prematurely cast to `ushort` and adjusted unit test - Fixed nan handling in compiled argmax, but can't reliably test it as MPS(eager) implementaiton of argmax is buggy Now according to `bench_mps_ops.py` `max(x, dim=0)` is reliably faster than eager implementaiton: ``` [--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------] \| eager-512x512 \| compile-512x512 \| eager-1024x1024 \| compile-1024x1024 \| eager-2048x2048 \| compile-2048x2048 \| eager-4096x4096 \| compile-4096x4096 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- max (torch.float16) \| 285.8 \| 272.2 \| 422.3 \| 354.5 \| 721.6 \| 683.5 \| 2224.0 \| 1979.1 max (torch.float32) \| 300.2 \| 267.0 \| 389.6 \| 342.5 \| 769.4 \| 682.6 \| 2995.7 \| 2609.8 max (torch.int32) \| 299.6 \| 275.4 \| 390.0 \| 361.7 \| 758.7 \| 686.1 \| 3103.4 \| 2646.5 max (torch.int64) \| 297.5 \| 275.5 \| 417.0 \| 382.1 \| 856.1 \| 722.6 \| 5467.7 \| 3156.8 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159524 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #158990	2025-07-31 16:18:32 +00:00
Bin Bao	d2e02585b8	[AOTI] Explicitly delete wait_tensor returned tensor (#159502 ) Summary: In the Python wrapper codegen, the returned tensor from wait_tensor is not assigned or used anywhere, because wait_tensor always returns its input, see more discussion in https://github.com/pytorch/pytorch/issues/126773. Similarly, we should just immediately delete the returned tensor handle from aoti_torch_cpu__c10d_functional_wait_tensor in the cpp wrapper codegen, otherwise it may cause tensor's lifetime expansion and even cause OOM in some cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159502 Approved by: https://github.com/yushangdi, https://github.com/jingsh ghstack dependencies: #159476, #159487	2025-07-31 15:33:36 +00:00
Bin Bao	3dd7ebf418	[BE] Fix buf name mismatch in test_c10d_functional_native.py (#159487 ) Summary: test_c10d_functional_native.py uses hard-coded buf names to check the generated code string. This is fragile given that Inductor can update its buffer naming implementation freely. Thus this PR uses name regex matching to find buffer names at the run time. This will solve issues like https://github.com/pytorch/pytorch/issues/147754. Currently we do name matching based on empty_strided_ calls. We can expand it later if needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159487 Approved by: https://github.com/yushangdi ghstack dependencies: #159476	2025-07-31 15:33:36 +00:00
Bin Bao	8273ee0646	[BE] Fix global config leak in test_c10d_functional_native.py (#159476 ) Summary: test_c10d_functional_native.py tests torch._inductor.config.cpp_wrapper as True and False. Currently torch._inductor.config.cpp_wrapper is set globally which can cause a problem when running the whole test file. This PR changes it to use patch context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159476 Approved by: https://github.com/yushangdi	2025-07-31 15:33:36 +00:00
Jane Xu	c57382a493	Move BFloat16.h to headeronly (#159412 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159412 Approved by: https://github.com/desertfire	2025-07-31 15:29:17 +00:00
Ruben Rodriguez Buchillon	e7cc42df58	[inductor] consolidate common GEMM triton param retrieval (#159383 ) \# Why - Make loop iteration simpler - Have a common spot where to make modifications that affect all the GEMM Triton templates, avoiding missed spots \# What - pull out commong logic of taking the BaseConfig objects and turning them into kwargs to feed into maybe_append_choice for Triton GEMM templates Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383 Approved by: https://github.com/jansel	2025-07-31 13:05:04 +00:00
cyy	72c69e731f	set MSVC debug information only on debug builds (#159533 ) Fixes: https://github.com/pytorch/pytorch/issues/159515 To reduce the binary size increment in release builds by removing debug information. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159533 Approved by: https://github.com/atalman	2025-07-31 12:57:33 +00:00
Tom Ritchford	78b9dea754	[inductor] Fix set_linter's handling of f-strings for Python 3.12 and up (fix #159056 ) (#159252 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159252 Approved by: https://github.com/Skylion007	2025-07-31 12:56:09 +00:00
LifengWang	838924436e	update the baseline for nightly max_autotune tests (#154973 ) Hi @desertfire, according to the latest test [results](https://github.com/pytorch/pytorch/actions/runs/15385952839) from the inductor nightly for max_autotune tests, we plan to update the baseline data: In the latest nightly test, two models require baseline updates: - vision_maskrcnn: This model shows improved graph breaks, so I’ve updated the baseline accordingly. - detectron2_fcos_r_50_fpn: This model has a different number of graph breaks. However, since its accuracy result still shows fail_accuracy, so I skipped the graph break check for this model. ``` vision_maskrcnn IMPROVED: graph_breaks=29, expected=30 Improvement: 1 models have fixed dynamo graph breaks: vision_maskrcnn ``` ``` detectron2_fcos_r_50_fpn XFAIL detectron2_fcos_r_50_fpn FAIL: graph_breaks=24, expected=22 Error: 1 models have new dynamo graph breaks: detectron2_fcos_r_50_fpn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154973 Approved by: https://github.com/desertfire	2025-07-31 11:38:55 +00:00
xinan.lin	2ffb510942	[Break XPU][Indutor UT] Fix failures introduced by community. (#159463 ) Fixes #159000, Fixes #159335, Fixes #159334, Fixes #159332, Fixes #159331, Fixes #159330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159463 Approved by: https://github.com/jansel	2025-07-31 08:37:41 +00:00
Michael Lazos	20b5f694f8	[Dynamo] Make frozen dataclasses hashable (#159529 ) Fixes https://github.com/pytorch/pytorch/issues/159424 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159529 Approved by: https://github.com/oulgen ghstack dependencies: #159513	2025-07-31 07:03:01 +00:00
Michael Lazos	447e300d55	[Dynamo] Frozen dataclass attr access test (#159513 ) Verifies https://github.com/pytorch/pytorch/issues/159424, but perhaps the issue is not fixed yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159513 Approved by: https://github.com/oulgen	2025-07-31 07:03:01 +00:00
Pian Pawakapan	5b2ad9279c	[draft export] logging (#159004 ) Summary: adds logging for draft export Test Plan: loggercli stage actualize-stage TorchDraftExportUsageLoggerConfig Rollback Plan: Differential Revision: D78308105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159004 Approved by: https://github.com/angelayi	2025-07-31 05:52:13 +00:00
Georgia Phillips	78d7f0cdec	disable execution frame cleanup (#159531 ) Summary: Want to disable execution frame cleanup until fix in D78621408 is merged Test Plan: CI Rollback Plan: Differential Revision: D79306602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159531 Approved by: https://github.com/SherlockNoMad	2025-07-31 05:02:36 +00:00
Xu Han	d5c719ec3c	[inductor] fix open temp file failed on Windows. (#159342 ) Fix open temp file failed on Windows. Error message: <img width="1181" height="239" alt="image" src="https://github.com/user-attachments/assets/e4a6f438-cb06-44c6-959b-0a6a49d2f44f" /> Here two option to fix this issue: https://stackoverflow.com/questions/66744497/python-tempfile-namedtemporaryfile-cant-use-generated-tempfile 1. `tempfile.NamedTemporaryFile` must setup `delete=False` on Windows 2. Use `WritableTempFile` to handle this case on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159342 Approved by: https://github.com/jansel	2025-07-31 04:58:02 +00:00
yewentao256	c44efc3755	[Refactor] Fix Compile Warning: `possibly dangling reference to a temporary` (#159517 ) ```bash DEBUG pytorch/torch/csrc/dynamo/compiled_autograd.h:1388:25: warning: possibly dangling reference to a temporary [-Wdangling-reference] DEBUG 1388 \| for (const at::IValue& elt : lst) { DEBUG \| ^~~ DEBUG pytorch/torch/csrc/dynamo/compiled_autograd.h:1388:1: note: the temporary was destroyed at the end of the full expression ‘__for_begin .c10::impl::ListIterator<c10::IValue, __gnu_cxx::__normal_iterator<c10::IValue, std::vector<c10::IValue> > >::operator().c10::impl::ListElementReference<c10::IValue, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator std::conditional_t<true, const c10::IValue&, c10::IValue>()’ DEBUG 1388 \| for (const at::IValue& elt : lst) { DEBUG \| ^ ``` This PR fixes this warning Pull Request resolved: https://github.com/pytorch/pytorch/pull/159517 Approved by: https://github.com/xmfan	2025-07-31 04:49:43 +00:00
Boyuan Feng	6b9473469f	[Graph Partition] add log for graph partition reasons and #partitions (#159425 ) Previously, we log `skipping cudagraphs due to [xxx reasons]` when there are cudagraph-unsafe ops. With graph partition, we will split off these ops and cudagraph remaining parts. But the log message is also skipped. In this PR, we add logs for graph partition reasons and the number of partitions to better understand the workload. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159425 Approved by: https://github.com/eellison	2025-07-31 04:21:06 +00:00
Natalia Gimelshein	7a4167a164	support fabric handles with symmetric memory (#159319 ) enable fabric handles for symmetric memory Enables handle exchange via CU_MEM_HANDLE_TYPE_FABRIC on the systems that support it. This is needed to enable symmetric memory on NVLS72 systems. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159319 Approved by: https://github.com/malfet, https://github.com/kwen2501	2025-07-31 04:16:20 +00:00
PyTorch UpdateBot	8e67a6ae89	[vllm hash update] update the pinned vllm hash (#159320 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159320 Approved by: https://github.com/pytorchbot	2025-07-31 04:08:14 +00:00
Animesh Jain	c68ad1bd6a	[dynamo][guards] Always record user.stack for informative tlparse guards (#159526 ) Before <img width="1146" height="280" alt="image" src="https://github.com/user-attachments/assets/4ddb11b2-dec8-4010-a28d-63b3cd4a7929" /> After <img width="1248" height="248" alt="image" src="https://github.com/user-attachments/assets/8aafc5be-92cd-4468-bb8f-ad966de8c717" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159526 Approved by: https://github.com/Lucaskabela	2025-07-31 03:18:33 +00:00
PyTorch MergeBot	3e5e094615	Revert "Fix large_tensor_test skipping cpu (#158617 )" This reverts commit debc0591b888f211bfe846bdc7cfa0626a5f6f6a. Reverted https://github.com/pytorch/pytorch/pull/158617 on behalf of https://github.com/ZainRizvi due to Sorry but this seems to be breaking trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/16631113381/job/47062415099) [HUD commit link](`debc0591b8`) ([comment](https://github.com/pytorch/pytorch/pull/158617#issuecomment-3138387762))	2025-07-31 02:57:22 +00:00
clr	c65efc8ea1	torch.compile: Record a pt2_compile_event for combo kernels (#159306 ) This is off by default, but some jobs have it on. Having this show up in perfetto and be globally queryable would be useful to see how expensive this is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159306 Approved by: https://github.com/masnesral	2025-07-31 02:51:38 +00:00
Animesh Jain	a9049413e2	[dynamo] Turn on recursive dict tag optimization (#159186 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159186 Approved by: https://github.com/jansel	2025-07-31 02:36:37 +00:00
ILCSFNO	d7a5ec9355	Fix the Doc of `padding` in `avg_poolnd` (#159142 ) Fixes #159141 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159142 Approved by: https://github.com/mikaylagawarecki	2025-07-31 02:02:48 +00:00
Sam Larsen	2c46922ce4	Fix rand_like decomposition to preserve strides (#159294 ) Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898. Test Plan: New unit test (fails before this PR; but fixed after) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294 Approved by: https://github.com/eellison	2025-07-31 01:36:50 +00:00
wengshiy	668d414ae7	[CPU] Fix bias dtype issue for FP8 qlinear (#159125 ) Fixes `RuntimeError: self and mat2 must have the same dtype, but got BFloat16 and Float` With bf16 autocast, bias converted into BFloat16, but fp8_qlinear_onednn_ref not support bf16 bias. In this pr, convert bias into bf16 on fp8_qlinear_onednn_ref. Add this case into ut and reproduce: `python test/test_quantization.py -k test_qlinear_fp8` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159125 Approved by: https://github.com/Xia-Weiwen, https://github.com/cyyever, https://github.com/CaoE	2025-07-31 01:26:45 +00:00
Nick Riasanovsky	4541509237	[Triton] [Inductor] Fix an incorrect descriptor (#159407 ) Summary: Fixes a clear template typo where `a_desc_ptr` was passed instead of `b_desc_ptr` to define `b_desc`. Test Plan: Found by inspection. Rollback Plan: Reviewed By: NoamPaz Differential Revision: D79178538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159407 Approved by: https://github.com/NikhilAPatel	2025-07-31 00:34:19 +00:00
eellison	6c7f88c2c9	Check addmm dtypes (#159509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159509 Approved by: https://github.com/eqy	2025-07-31 00:15:46 +00:00
Chris Thi	c400c8e2e0	[ROCm] Add FP8 rowwise support to _scaled_grouped_mm + Submodule update (#159075 ) Summary: In this PR we integrate the [FBGEMM AMD FP8 rowwise scaling grouped GEMM kernel](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped) to add support for the `_scaled_grouped_mm` API on AMD. `_scaled_grouped_mm` is [currently supported on Nvidia](`9faef3d17c/aten/src/ATen/native/cuda/Blas.cpp (L1614)`), this PR aims to bring parity to AMD. Related: [[RFC]: PyTorch Low-Precision GEMMs Public API](https://github.com/pytorch/pytorch/issues/157950#top) #157950. The kernel is developed using the Composable Kernel framework. Only MI300X is currently supported. In the near future we plan to add support for MI350X as well. For data types we support FP8 e3m4. The kernel support will be gated with the `USE_FBGEMM_GENAI` flag. We hope to enable this by default for relevant AMD builds. Note we also update submodule `third_party/fbgemm` to 0adf62831 for the required updates from fbgemm. Test Plan: Hipify & build ``` python tools/amd_build/build_amd.py USE_FBGEMM_GENAI=1 python setup.py develop ``` Unit tests ``` python test/test_matmul_cuda.py -- TestFP8MatmulCUDA Ran 488 tests in 32.969s OK (skipped=454) ``` Performance Sample \| G \| M \| N \| K \| Runtime Ms \| GB/S \| TFLOPS \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| 128 \| 1 \| 2048 \| 5120 \| 0.37\| 3590 \| 7.17 \| \| 128 \| 64 \| 2048 \| 5120 \| 0.51\| 2792 \| 338.34 \| \| 128 \| 128 \| 2048 \| 5120 \| 0.66\| 2272 \| 522.72 \| \| 128 \| 1 \| 5120 \| 1024 \| 0.21\| 3224 \| 6.43 \| \| 128 \| 64 \| 5120 \| 1024 \| 0.29\| 2590 \| 291.40 \| \| 128 \| 128 \| 5120 \| 1024 \| 0.40\| 2165 \| 434.76 \| \| 128 \| 1 \| 4096 \| 4096 \| 0.69\| 3126 \| 6.25 \| \| 128 \| 64 \| 4096 \| 4096 \| 0.85\| 2655 \| 324.66 \| \| 128 \| 128 \| 4096 \| 4096 \| 1.10\| 2142 \| 501.40 \| \| 128 \| 1 \| 8192 \| 8192 \| 2.45\| 3508 \| 7.01 \| \| 128 \| 64 \| 8192 \| 8192 \| 3.27\| 2692 \| 336.74 \| \| 128 \| 128 \| 8192 \| 8192 \| 4.04\| 2224 \| 543.76 \| \| 16 \| 1 \| 2048 \| 5120 \| 0.04\| 3928 \| 7.85 \| \| 16 \| 64 \| 2048 \| 5120 \| 0.05\| 3295 \| 399.29 \| \| 16 \| 128 \| 2048 \| 5120 \| 0.07\| 2558 \| 588.69 \| \| 16 \| 1 \| 5120 \| 1024 \| 0.03\| 3119 \| 6.23 \| \| 16 \| 64 \| 5120 \| 1024 \| 0.03\| 2849 \| 320.62 \| \| 16 \| 128 \| 5120 \| 1024 \| 0.05\| 2013 \| 404.11 \| \| 16 \| 1 \| 4096 \| 4096 \| 0.06\| 4512 \| 9.02 \| \| 16 \| 64 \| 4096 \| 4096 \| 0.09\| 3124 \| 381.95 \| \| 16 \| 128 \| 4096 \| 4096 \| 0.13\| 2340 \| 547.67 \| \| 16 \| 1 \| 8192 \| 8192 \| 0.32\| 3374 \| 6.75 \| \| 16 \| 64 \| 8192 \| 8192 \| 0.42\| 2593 \| 324.28 \| \| 16 \| 128 \| 8192 \| 8192 \| 0.53\| 2120 \| 518.36 \| - Using ROCm 6.4.1 - Collected through `triton.testing.do_bench_cudagraph` Binary size with gfx942 arch Before: 116103856 Jul 23 14:12 build/lib/libtorch_hip.so After: 118860960 Jul 23 14:29 build/lib/libtorch_hip.so The difference is 2757104 bytes (~2.6 MiB). Reviewers: @drisspg @ngimel @jwfromm @jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/159075 Approved by: https://github.com/drisspg	2025-07-30 23:53:58 +00:00
Eddie Yan	25c3a7e317	[CUDA][CUDA Graphs] Move cuda graphs test to subprocess to avoid polluting mempool tests (#159305 ) Otherwise mempool test will fail as the previous graph capture failed but doesn't have its state in the caching allocator fully cleaned up. See also #159301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159305 Approved by: https://github.com/eellison, https://github.com/BoyuanFeng, https://github.com/naromero77amd	2025-07-30 23:31:38 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	de7376537f	Fix ep deepcopy when there is python builitin name (#159478 ) Summary: title Test Plan: CI Rollback Plan: Differential Revision: D79261007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159478 Approved by: https://github.com/pianpwk	2025-07-30 23:14:31 +00:00
Shangdi Yu	fd2c64e286	Fix duplicated sources in inductor provenance tracking (#159484 ) Summary: The `replace_hook` is called once for each user of the replaced node. This fix avoids adding duplicated node sources. This also means that if there are two nested pass like: ``` with GraphTransformObserver(gm, "outer"): with GraphTransformObserver(gm, "inner"): ..... ``` We'll only see the outer pass's pass name recorded for the replaced node in the "from_node" node meta. I think this is fine. In practice, the outer pass usually contains a more meaningful name, e.g. `decompose_auto_functionalized`, and the inner pass name is just a default pass name like `pattern_matcher`. Test Plan: ``` buck2 run @mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer_replace ``` Rollback Plan: Differential Revision: D79203058 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159484 Approved by: https://github.com/angelayi	2025-07-30 23:03:11 +00:00
Lucas Kabela	2b1ae29960	[Dynamo][Better Engineering] Add typing annotations to guard and source (#158397 ) (#159491 ) Summary: X-link: https://github.com/pytorch/executorch/pull/12986 As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py` Running ``` mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 1227 \| 2208 \| 55.57% \| 207 \| 362 \| 57.18% \| \| This PR \| 2217 \| 2217 \| 100.00% \| 362 \| 362 \| 100.00% \| \| Delta \| +990 \| +9 \| +44.43% \| +155 \| 0 \| +42.82% \| cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 jerryzh168 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Test Plan: Imported from GitHub, without a `Test Plan:` line. Rollback Plan: Reviewed By: JacobSzwejbka, yangw-dev Differential Revision: D79199389 Pulled By: Lucaskabela Pull Request resolved: https://github.com/pytorch/pytorch/pull/159491 Approved by: https://github.com/anijain2305, https://github.com/yangw-dev	2025-07-30 22:57:50 +00:00
Nikita Shulga	1293405c8d	[MPS] Add `simd_[arg][max\|min]` (#158990 ) And add eager tests for those. Re-implement `threadgroup_[max\|min]` using those function as they are significantly faster (though much slower than eager, due to the arg part) than before, which could be verified by running the following script ```python import itertools import timeit import torch from torch.utils.benchmark import Compare, Measurement, Timer def bench_unary_op(func, x, label) -> Measurement: sync_cmd = "torch.mps.synchronize()" if "mps" in str(x.device) else "" t = Timer( stmt=f"f(x);{sync_cmd}", globals={"f": func, "x": x}, language="python", timer=timeit.default_timer, sub_label=f"{func.__name__} ({str(x.dtype)})", description=label, env=torch.__version__, ) return t.blocked_autorange() def bench_reduction( reduction_func, device: str = "mps", dtype: torch.dtype = torch.float32 ) -> list[Measurement]: rc = [] # Bench 2D with reduction over dim=0 def f(t): return reduction_func(t, dim=0)[0] f.__name__ = reduction_func.__name__ f_c = torch.compile(f, dynamic=False, fullgraph=True) for size in (512, 1024, 2048, 4096): x = torch.testing.make_tensor(size, size, device=device, dtype=dtype) rc_c, rc_e = f(x), f_c(x) rc_c, rc_e = (rc_c[0], rc_e[0]) if isinstance(rc_c, tuple) else (rc_c, rc_e) rc.append(bench_unary_op(f, x, f"eager-{size}x{size}")) rc.append(bench_unary_op(f_c, x, f"compile-{size}x{size}")) return rc def main() -> None: #dtypes = [torch.float16, torch.float32, torch.bfloat16, torch.int32, torch.int64] dtypes = [torch.float32, torch.int32, torch.int64] # Profile reduction ops rc = [] for op, dtype in itertools.product([torch.max], dtypes): rc.extend(bench_reduction(op, dtype=dtype)) Compare(rc).print() if __name__ == "__main__": torch._dynamo.config.cache_size_limit = 2**16 main() ``` Produces the following table before ``` [--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------] \| eager-512x512 \| compile-512x512 \| eager-1024x1024 \| compile-1024x1024 \| eager-2048x2048 \| compile-2048x2048 \| eager-4096x4096 \| compile-4096x4096 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- max (torch.float32) \| 297.3 \| 531.6 \| 394.1 \| 2550.5 \| 773.0 \| 4904.7 \| 3647.2 \| 9682.0 max (torch.int32) \| 297.8 \| 359.2 \| 387.7 \| 1179.4 \| 768.2 \| 2175.0 \| 3677.1 \| 4495.9 max (torch.int64) \| 278.7 \| 541.4 \| 410.2 \| 2873.3 \| 858.9 \| 5620.4 \| 6107.2 \| 11176.1 Times are in microseconds (us). ``` And after ``` [--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------] \| eager-512x512 \| compile-512x512 \| eager-1024x1024 \| compile-1024x1024 \| eager-2048x2048 \| compile-2048x2048 \| eager-4096x4096 \| compile-4096x4096 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- max (torch.float32) \| 307.9 \| 265.3 \| 401.0 \| 340.8 \| 766.5 \| 661.9 \| 3463.5 \| 2829.5 max (torch.int32) \| 293.5 \| 263.1 \| 405.0 \| 338.8 \| 761.4 \| 672.5 \| 3050.0 \| 2688.6 max (torch.int64) \| 308.2 \| 255.7 \| 417.4 \| 341.4 \| 877.0 \| 695.0 \| 5812.2 \| 5762.2 ``` `argmax`/`argmin` are much tricker due to the nan-handling logic that need to be added there. Also fixes `torch.max/min` compilation for half-precision types, added regression types for it. This PR also introduces a bunch of helper functions, such as `simd_broadcast` that works for int64 and `c10:🤘:pair` template, which are used by `simd_argmax` to return both value and index Pull Request resolved: https://github.com/pytorch/pytorch/pull/158990 Approved by: https://github.com/dcci, https://github.com/Skylion007	2025-07-30 21:57:25 +00:00
William Wen	3a65ff84b6	[dynamo, easy] add comment on skipping sys.monitoring frames (#159493 ) Add a comment so we know why we're doing this code (followup to https://github.com/pytorch/pytorch/pull/159369) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159493 Approved by: https://github.com/azahed98, https://github.com/Lucaskabela, https://github.com/zou3519, https://github.com/jingsh ghstack dependencies: #159369	2025-07-30 21:54:38 +00:00
tiandeyu-cs	acf13a9b75	Fix a bug of distributed 'gather' with uncontiguous tensors on the Gloo backend (#158903 ) Fixes #158902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158903 Approved by: https://github.com/H-Huang	2025-07-30 21:44:29 +00:00
zpcore	3a55676200	fix strategy hashing arg mismatch (#159506 ) Reland https://github.com/pytorch/pytorch/pull/159289. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159506 Approved by: https://github.com/XilunWu	2025-07-30 21:37:13 +00:00
Sam Larsen	af39144a93	Don't use torch.backends.cuda.matmul.allow_tf32 in inductor cache key (#159480 ) Summary: According to https://github.com/pytorch/pytorch/pull/158209, the API is deprecated and we should be using torch.backends.cuda.matmul.fp32_precision instead. Fixes https://github.com/pytorch/pytorch/issues/159440 Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/159480 Approved by: https://github.com/xmfan, https://github.com/oulgen	2025-07-30 21:29:38 +00:00
Aidyn-A	25343b343e	[ATen][CUDA][cuFFT] Guard against deprecated error codes (#159466 ) This PR adds a guard based on CUDA version, per latest cuFFT [documentation](https://docs.nvidia.com/cuda/cufft/index.html#return-value-cufftresult): >The following error codes are deprecated and will be removed in a future release: `CUFFT_INCOMPLETE_PARAMETER_LIST`, `CUFFT_PARSE_ERROR`, `CUFFT_LICENSE_ERROR`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159466 Approved by: https://github.com/albanD, https://github.com/eqy, https://github.com/Skylion007	2025-07-30 21:10:32 +00:00
Xilun Wu	07fad04181	[ContextParallel][FlexAttention] Prototype of supporting FlexAttention in Context Parallel (#158692 ) Summary This PR adds an all-gather based FlexAttention and uses TorchFunctionMode to dispatch `FlexAttentionHOP.__call__` to it. This PR makes the following changes: - add a user-facing API `create_cp_block_mask` for creating CP-specific `BlockMask` which masks over the attention result of Q shard and KV global. - add `_ContextParallelGlobalVars` to store all necessary global vars that CP FlexAttention requires. `torch_function_mode` is critical to maintain singleton mode to avoid dynamo recompilations. - add a dispatch path for `FlexAttentionForwardHOP.__call__` (TorchFunctionMode dispatch won't work correctly without this line) What's not in this PR: - QKV load balancing - Test on other masking besides `causal_mask`. - Support on small attention (i.e. qkv size is smaller than 128) because the block mask rewrite function requires `Q_BLOCK_SIZE == KV_BLOCK_SIZE == 128`. Test `pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention` Followup 1. create an issue to reproduce the error in `create_fw_bw_graph()` when trying to call `create_block_mask` to re-write `block_mask` in `FlexAttentionHOP` dispatch in `TorchFunctionMode`. 2. Merge `_ContextParallelGlobalVars` and `_cp_options`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158692 Approved by: https://github.com/drisspg	2025-07-30 21:01:53 +00:00
PyTorch MergeBot	7ac70ac4cd	Revert "Fix rand_like decomposition to preserve strides (#159294 )" This reverts commit a3a51282dbabe0220c2c3947a89f7d2ecc514d33. Reverted https://github.com/pytorch/pytorch/pull/159294 on behalf of https://github.com/yangw-dev due to failed internal build Failed to load config ([comment](https://github.com/pytorch/pytorch/pull/159294#issuecomment-3137796767))	2025-07-30 20:59:19 +00:00
drisspg	e221a1c853	[Code Motion]Restructure flex attention kernel into flex subdirectory (#159437 ) Mostly code motion, updating relative paths, moving some imports that had to be lazy before to top level scope now that we are free from the curse. This will make it easier to add newer templates and provide some organization Pull Request resolved: https://github.com/pytorch/pytorch/pull/159437 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng, https://github.com/eellison, https://github.com/Skylion007	2025-07-30 20:12:35 +00:00
Junjie Wang (PyTorch)	4defea1e2c	[c10d] Fix setGroupName and setGroupDesc in `group_split` and `merge_remote_group` (#159429 ) Summary: We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it. We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh Also ncclx needs to be aware of that its Option is a subclass of BackendOption Test Plan: CI Rollback Plan: Differential Revision: D79201132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159429 Approved by: https://github.com/xunnanxu	2025-07-30 19:55:55 +00:00
saienduri	53d68b95de	[ROCm CI] Migrate to MI325 Capacity. (#159059 ) This PR moves PyTorch CI capacity from mi300 to a new, larger mi325 cluster. Both of these GPUs are the same architecture gfx942 and our testing plans don't change within an architecture, so we pool them under the same label `linux.rocm.gpu.gfx942.<#gpus>` with this PR as well to reduce overhead and confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159059 Approved by: https://github.com/jithunnair-amd, https://github.com/atalman Co-authored-by: deedongala <deekshitha.dongala@amd.com>	2025-07-30 19:47:59 +00:00
Howard Huang	f74842d57f	[PP] Fix zero bubble schedules for eval() (#159475 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159475 Approved by: https://github.com/tianyu-l, https://github.com/Skylion007	2025-07-30 19:46:10 +00:00
Simon Fan	644fee2610	Fix TestAutogradFallback flaky tests under Dynamo: migrate to lib._destroy() (#159443 ) under dynamo, the libraries couldn't properly be cleared unless we manually did `gc.collect()`, but that's slow. it also worked if we just used the _destroy() method to tear down FIXES #159398 #159349 #159254 #159237 #159153 #159114 #159040 #158910 #158841 #158763 #158735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159443 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2025-07-30 19:30:55 +00:00
Jane Xu	7821fbc560	[BE] Clarify comment to not revert when command has been edited (#159495 ) This is mostly a nit. I was a bit confused when I saw <img width="1032" height="183" alt="image" src="https://github.com/user-attachments/assets/7a18f167-78c1-4c33-ba6f-3588914c642e" /> in https://github.com/pytorch/pytorch/pull/159172 So I decided I should clean up this message a bit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159495 Approved by: https://github.com/yangw-dev, https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/malfet	2025-07-30 19:23:33 +00:00
Justin Chu	73ee323380	[ONNX] RMS Norm (#159377 ) - Implement rms norm using onnx RMSNormalization-23 - Use the correct eps for float32 `eaadd1282c/aten/src/ATen/native/cuda/layer_norm_kernel.cu (L1844-L1866)` <img width="743" height="107" alt="image" src="https://github.com/user-attachments/assets/a6fd45aa-01d9-4667-924d-3012232cfcde" /> - Created facility to run tests with the reference runtime by extending ONNXProgram and assert_onnx_program. Fix https://github.com/pytorch/pytorch/issues/159257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159377 Approved by: https://github.com/titaiwangms	2025-07-30 18:55:47 +00:00
Justin Chu	176c6446f8	Update CODEOWNERS for ONNX (#159390 ) Update CODEOWNERS for ONNX to reflect current maintainers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159390 Approved by: https://github.com/titaiwangms, https://github.com/malfet	2025-07-30 18:54:25 +00:00
drisspg	debc0591b8	Fix large_tensor_test skipping cpu (#158617 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158617 Approved by: https://github.com/BoyuanFeng	2025-07-30 18:48:07 +00:00
Nikita Shulga	0df78f0c11	Remove `/d2implyavx512upperregs-` flag (#159431 ) And reopen https://github.com/pytorch/pytorch/issues/145702 As this flag is not documented anywhere, slows down sccache accelerated build and per https://developercommunity.visualstudio.com/t/Invalid-code-gen-when-using-AVX2-and-SSE/10527298#T-N10562579 it does not workaround a compiler bug, but rather disables some optimizations of AVX512 instructions which are being invoked in AVX2 codepath Fixes https://github.com/pytorch/pytorch/issues/159082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159431 Approved by: https://github.com/clee2000	2025-07-30 18:47:03 +00:00
Guilherme Leobas	d0e8a0ec4c	Add CPython test for heapq (#159370 ) Not used directly but used internally by `collections.Counter` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159370 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2025-07-30 18:43:06 +00:00
Aaron Gokaslan	22492848b6	[BE]: Update CUTLASS submodule to 4.1.0 (#158854 ) Update the CUTLASS submodule to the latest version with new supported architectures and new features we can use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158854 Approved by: https://github.com/henrylhtsang	2025-07-30 17:44:38 +00:00
Rohit Manav	5c14315b05	fixed typo error (#159451 ) Fixes #159375 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159451 Approved by: https://github.com/albanD	2025-07-30 17:41:30 +00:00
PaliC	1b99c1859c	[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427 ) This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter. 1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption. 2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`. For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from the `check_pyobj` call. ``` mpy::handle handle_from_tensor(Arena& A, TensorRef t) { - // fast case: tensor is live in python - std::optional<PyObject> mb_obj = - t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /ignore_hermetic_tls=/false); - if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) { - return mb_obj; - } - return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(t))); -} -} + // fast case: tensor is live in python + std::optional<PyObject> mb_obj = + t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj( + /ignore_hermetic_tls=/false); + if (mb_obj.has_value() && + !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) { + return mb_obj; + } + return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(t))); +} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427 Approved by: https://github.com/albanD	2025-07-30 17:29:43 +00:00
Boyuan Feng	435edbcb5d	[Graph Partition] add graph partition doc (#159450 ) This pr adds doc for graph partition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159450 Approved by: https://github.com/eellison	2025-07-30 17:01:10 +00:00
PyTorch MergeBot	6c6e11c206	Revert "Fix max_width computation in _tensor_str._Formatter (#126859 )" This reverts commit 1465757959dd7e63715b7621650896eca977aefa. Reverted https://github.com/pytorch/pytorch/pull/126859 on behalf of https://github.com/yangw-dev due to broke trunk with test distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_reduce_single - RuntimeError: Expected to find buf7 = empty but did not find it ([comment](https://github.com/pytorch/pytorch/pull/126859#issuecomment-3137137030))	2025-07-30 16:56:32 +00:00
Denghui Dong	a775c8e73e	[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 ) Hi team, Please help review this patch. This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable. I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by `257c413cd1` on 3.12.5. So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it. There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain. ~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446 Approved by: https://github.com/sraikund16	2025-07-30 16:35:51 +00:00
Arsh Zahed	24d07b3a67	[inductor] Fix mm decomposition evaluating symints (#158998 ) Fixes #154111 Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor. The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998 Approved by: https://github.com/jansel, https://github.com/BoyuanFeng	2025-07-30 16:34:15 +00:00
James Wu	90fd06be71	Various bugfixes for running NanoGPT training (#159166 ) Fix various small bugs with running nanogpt on torchbenchmark in OSS under python 3.10. After these changes, the following now succeeds: ``` tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --training --backend inductor --caching-precompile --warm-start-latency ``` Cold start: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp12LuZ5/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Warm start (we are invesigating the recompile): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpT5YTB2/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159166 Approved by: https://github.com/zhxchen17	2025-07-30 16:30:22 +00:00
Meet Vadakkanchery	002f18807e	[DCP] Improve error handling for process based async checkpointing (#159374 ) Summary: ### PR Context - Kill background process only when PG init fails or there is an explicit `TERMINATE` signal from main process. - When a checkpoint fails to save, log and return the error but continue the serving loop. Test Plan: CI Rollback Plan: Differential Revision: D79177410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159374 Approved by: https://github.com/sibuachu	2025-07-30 16:25:28 +00:00
Jane Xu	259e79e3ff	Move Half to headeronly (#159172 ) Essence of this copypasta: - combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h - Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy - Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly. - Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-30 16:11:58 +00:00
Aidyn-A	ee343ce60c	[RPC][TensorPipe] Fix import torch if compiled without TensorPipe (#159461 ) This is a follow up on the PR #154382, as the issue still persists: ``` File "/opt/pytorch/pytorch/torch/distributed/rpc/__init__.py", line 81, in <module> from . import api, backend_registry, functions File "/opt/pytorch/pytorch/torch/distributed/rpc/api.py", line 35, in <module> from .constants import DEFAULT_SHUTDOWN_TIMEOUT, UNSET_RPC_TIMEOUT File "/opt/pytorch/pytorch/torch/distributed/rpc/constants.py", line 3, in <module> from torch._C._distributed_rpc import ( ImportError: cannot import name '_DEFAULT_NUM_WORKER_THREADS' from 'torch._C._distributed_rpc' (unknown location) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159461 Approved by: https://github.com/lw	2025-07-30 16:04:02 +00:00
Avik Chaudhuri	ea5369113a	unflatten closure (#159418 ) Summary: Sometimes the call history recorded in a `nn_module_stack` does not have the stack property, where each FQN is a prefix of the next FQN. This can cause errors during `unflatten`. Instead of erroring we now drop entries from such a `nn_module_stack` to restore the stack property. This effectively leads to less unflattening: the last FQN in the call history before the stack property was broken keeps the entire flat subgraph of its call. Test Plan: added test, updated another Rollback Plan: Differential Revision: D79204669 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159418 Approved by: https://github.com/angelayi	2025-07-30 15:42:18 +00:00
Jane Xu	b268f22ab2	Move Float4 to headeronly (#159414 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159414 Approved by: https://github.com/desertfire	2025-07-30 15:34:01 +00:00
Animesh Jain	52a52d1b78	[dynamo][guards] Skip no tensor aliasing guard on inbuilt nn module buffers (#159453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159453 Approved by: https://github.com/jansel	2025-07-30 15:31:07 +00:00
PyTorch MergeBot	eaadd1282c	Revert "Move Half to headeronly (#159172 )" This reverts commit 6d0f4566e2b6e05369d8bb6c0d0e83a0eee982aa. Reverted https://github.com/pytorch/pytorch/pull/159172 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/16613893793/job/47002486679) [HUD commit link](`6d0f4566e2`). Note to self: why isn't Dr. CI updating ([comment](https://github.com/pytorch/pytorch/pull/159172#issuecomment-3136769493))	2025-07-30 15:10:26 +00:00
Raphael Reme	1465757959	Fix max_width computation in _tensor_str._Formatter (#126859 ) Previous version of `torch._tensor_str._Formatter` was not using `PRINT_OPTS.sci_mode` for the `max_width` computation but was using it for the formatting of values leading to a weird discrepancy. Now, the code first checks if it should be in sci_mode, then compute `max_width` Here is an example to test the behavior: ```python A = torch.tensor([10, 1e-1, 1e-2]) B = torch.tensor([10, 1e-1, 1e-1]) print("================= Default =================") print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}") print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}") print("================= sci_mode=False =================") with torch._tensor_str.printoptions(sci_mode=False): print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}") print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}") print("================= sci_mode=True =================") with torch._tensor_str.printoptions(sci_mode=True): print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}") print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}") ``` In the current version this prints: ``` ================= Default ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=False ================= tensor([ 10.0000, 0.1000, 0.0100]) Formatter max_width: 10 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=True ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 7 ``` On can see that in `sci_mode=False`, the values of A are prefixed with unneeded 0 and does not have the same `max_width` as B (It keeps the `max_width` from `sci_mode = None`) Also in `sci_mode = True`, for B, the `max_width` is 7 but each value takes 10 chars... (But it is fine as the code that uses `max_width` do not rely much on it, but still, this is missleading) After this commit, this will print ``` ================= Default ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=False ================= tensor([10.0000, 0.1000, 0.0100]) Formatter max_width: 7 tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7 ================= sci_mode=True ================= tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10 tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 10 ``` This also allows to align A with B for `sci_mode=False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126859 Approved by: https://github.com/malfet	2025-07-30 14:01:00 +00:00
Ke Wen	17b9c618dd	[a2av] not returning out tensor from ops (#159435 ) torch.compile of `all_to_all_vdev_2d` hits the following error: ``` torch._dynamo.exc.BackendCompilerFailed: backend='aot_eager' raised: RuntimeError: Found a custom (non-ATen) operator whose output has alias annotations: symm_mem::all_to_all_vdev_2d(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name, int? major_align=None) -> Tensor(a!). We only support functionalizing operators whose outputs do not have alias annotations (e.g. 'Tensor(a)' is a Tensor with an alias annotation whereas 'Tensor' is a Tensor without. The '(a)' is the alias annotation). The alias annotation specifies that the output Tensor shares storage with an input that has the same annotation. Please check if (1) the output needs to be an output (if not, don't return it), (2) if the output doesn't share storage with any inputs, then delete the alias annotation. (3) if the output indeed shares storage with an input, then add a .clone() before returning it to prevent storage sharing and then delete the alias annotation. Otherwise, please file an issue on GitHub. ``` This PR selects option (1). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159435 Approved by: https://github.com/ngimel, https://github.com/xmfan	2025-07-30 08:30:25 +00:00
Yu, Guangye	d3ce45012e	Generalize torch._C._set_allocator_settings to be generic (#156175 ) # Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312, #156165	2025-07-30 06:37:15 +00:00
Yu, Guangye	1fc010a9d8	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312	2025-07-30 06:37:15 +00:00
Yu, Guangye	dfacf11f66	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908	2025-07-30 06:37:06 +00:00
Yu, Guangye	c8cf811995	Enable AcceleratorAllocatorConfig key check (#157908 ) # Motivation Add a mechanism to ensure raise the key if the key is unrecognized in allocator config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157908 Approved by: https://github.com/albanD ghstack dependencies: #149601	2025-07-30 06:36:56 +00:00
Yu, Guangye	914b1a3873	Introduce AcceleratorAllocatorConfig as the common class (#149601 ) # Motivation This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path. # Design Rule ## Overall This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`). Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair ## Naming Convention: - Public API names in `AcceleratorAllocatorConfig` should be device-generic. - Members prefixed with `pinned_` are specific to the host/pinned allocator. - Environment variable names should be generic across backends. - Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]` ## Environment Variables: - The default environment variable for configuration is `PYTORCH_ALLOC_CONF`. - For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority. Differential Revision: [D79011786](https://our.internmc.facebook.com/intern/diff/D79011786) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601 Approved by: https://github.com/albanD	2025-07-30 06:36:46 +00:00
Animesh Jain	7eb5fdb358	[dynamo][guards] Recursive dict tag optimization (#159183 ) Design doc here - https://docs.google.com/document/d/1W29DrWID5miGWlZXspsQVN5U0zydE3kjZpziOXrhuaY/edit?tab=t.0#bookmark=id.sba04iw9sp68 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159183 Approved by: https://github.com/jansel	2025-07-30 06:01:32 +00:00
Sheng Fu	f1fb57d854	Add user annotation for FX graph cache key (#159318 ) Summary: AI system co-design team requested to add user annotation for FX graph cache key in PyTorch Kineto trace and Execution trace. With this annotation, they can know the FX graph to which the kernels belong. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA Rollback Plan: Differential Revision: D79019069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159318 Approved by: https://github.com/sraikund16, https://github.com/jansel	2025-07-30 05:52:50 +00:00
Jane Xu	6d0f4566e2	Move Half to headeronly (#159172 ) Essence of this copypasta: - combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h - Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy - Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly. - Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-30 05:02:13 +00:00
PyTorch UpdateBot	e785c087c5	[audio hash update] update the pinned audio hash (#159321 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159321 Approved by: https://github.com/pytorchbot	2025-07-30 04:35:01 +00:00
Svetlana Karslioglu	d214901133	Add a title to distributed._dist2.md (#159385 ) Sphinx likes titles and complains about them when they are not there. So adding a title to address this Wartning in the build: ``` WARNING: toctree contains reference to document 'distributed._dist2' that doesn't have a title: no link will be generated ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159385 Approved by: https://github.com/d4l3k	2025-07-30 04:09:41 +00:00
Jane Xu	96ac64d00c	Migrate easy q(u)int/bits stuff to torch/headeronly (#159302 ) Straightup copy pasta. Keeps APIs in c10 and reexposes them to torch::headeronly. It is arguable that we should just get rid of some of these unused dtypes but that is outside the scope of this PR, which is meant to build up to ScalarType moving to headeronly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159302 Approved by: https://github.com/malfet, https://github.com/albanD	2025-07-30 03:41:27 +00:00
Colin Peppler	46d34d6766	(should_fold) gso to guard_or_false when checking folding whether to 3d bmm into 2d mm (#159184 ) Switch from guard_size_oblivious to guard_or_false if you encounter a DDE, this would then avoid folding this 3d bmm into a mm. `806d9e3fe7/torch/_decomp/decompositions.py (L4506-L4512)` ## DDE ``` File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul elif should_fold(tensor1, tensor2, is_out): File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4472, in should_fold if guard_size_oblivious(t1.numel() == 0): torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(12((u0//2)), 0) (unhinted: Eq(12((u0//2)), 0)). (Size-like symbols: none) Caused by: (_decomp/decompositions.py:4472 in should_fold) ``` ``` File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul elif should_fold(tensor1, tensor2, is_out): File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4483, in should_fold return all( torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3((u0//2)), 3) (unhinted: Eq(3((u0//2)), 3)). (Size-like symbols: none) Caused by: (_decomp/decompositions.py:4483 in should_fold) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159184 Approved by: https://github.com/ezyang ghstack dependencies: #158894	2025-07-30 03:12:14 +00:00
clr	880249adbc	dynamo: handle AttributeErrors from nn_module when infer_paramaters throws. (#158501 ) This only handles AttributeError, but in general, any exception coming from here is a user exception. let me know if we prefer to catch all exceptions, and then reraise them as observed exceptions. ``` File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/symbolic_convert.py", line 2200, in CALL_FUNCTION self.call_function(fn, args, {}) File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/symbolic_convert.py", line 1210, in call_function self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type] File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/lazy.py", line 201, in realize_and_forward return getattr(self.realize(), name)(args, kwargs) File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/nn_module.py", line 472, in call_function initialize_lazy_module(tx, mod, args, kwargs) File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/nn_module.py", line 104, in initialize_lazy_module mod._infer_parameters(mod, fake_args, fake_kwargs) File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/nn/modules/lazy.py", line 261, in _infer_parameters module.initialize_parameters(args, *kwargs) ..., File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/nn/modules/module.py", line 1962, in __getattr__ raise AttributeError( torch._dynamo.exc.InternalTorchDynamoError: AttributeError: '...' object has no attribute '...' ``` Note that we crash with a sligthly different exception trace in the other test I added. Let me know if we want this to not throw directly to the end user. ``` ====================================================================== ERROR: test_lazy_module_bad_params (__main__.NNModuleTests.test_lazy_module_bad_params) ---------------------------------------------------------------------- Traceback (most recent call last): File "/data/users/clr/pytorch/torch/testing/_internal/common_utils.py", line 3223, in wrapper method(args, *kwargs) ~~~~~~^^^^^^^^^^^^^^^^^ File "/data/users/clr/pytorch/test/dynamo/test_modules.py", line 1683, in test_lazy_module_bad_params exp_res = opt_m(x, y) File "/data/users/clr/pytorch/torch/_dynamo/eval_frame.py", line 411, in __call__ return super().__call__(args, *kwargs) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(args, *kwargs) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(args, *kwargs) File "/data/users/clr/pytorch/torch/_dynamo/eval_frame.py", line 473, in _call_lazy_check self._orig_mod._infer_parameters(self._orig_mod, args, kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/clr/pytorch/torch/nn/modules/lazy.py", line 261, in _infer_parameters module.initialize_parameters(args, **kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/data/users/clr/pytorch/test/dynamo/test_modules.py", line 711, in initialize_parameters self.foo += 1 ^^^^^^^^ File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1962, in __getattr__ raise AttributeError( f"'{type(self).__name__}' object has no attribute '{name}'" ) AttributeError: 'LazyModuleBadInferParams' object has no attribute 'foo' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158501 Approved by: https://github.com/williamwen42, https://github.com/jansel	2025-07-30 02:41:41 +00:00
Xu Han	846ada4973	[AOTI] disable crashed AOTI UTs on Windows. (#159427 ) disable crashed AOTI UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159427 Approved by: https://github.com/angelayi	2025-07-30 02:23:27 +00:00
Yu, Guangye	badd0618e4	Remove unused paramter on CUDA AllocParams (#159159 ) # Motivation While refactoring the caching allocator, I noticed that the `AllocParams` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion. # Additional Context I noticed that `AllocParams` is defined in cpp file, so it should be safe to make this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159159 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-07-30 02:05:25 +00:00
PaliC	a753a72b14	[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407 ) This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407 Approved by: https://github.com/albanD ghstack dependencies: #158290, #158291	2025-07-30 01:36:03 +00:00
PaliC	b57d1ef110	[BE] Remove __reduce_deploy__ (#158291 ) This PR removes the integration point torch.fx had with torch::deploy (and another minor change). Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291 Approved by: https://github.com/albanD ghstack dependencies: #158290	2025-07-30 01:36:03 +00:00
PaliC	dd7c996d5c	[BE] Remove torch deploy \| remove torch deploy specific files (#158290 ) This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290 Approved by: https://github.com/albanD	2025-07-30 01:36:03 +00:00
Kurt Mohler	70d2e9ba45	[MPS] Avoid outputing zeros from `exponential_` for MPS (#159386 ) Fixes #159103 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159386 Approved by: https://github.com/malfet	2025-07-30 00:20:31 +00:00
eqy	62f98dbb44	[CUDA][Convolution] Add `tf32_on_and_off` decorator to `test_deconv_freezing_cuda` (#159280 ) Blackwell seems to select TF32 kernels for this case Pull Request resolved: https://github.com/pytorch/pytorch/pull/159280 Approved by: https://github.com/zou3519, https://github.com/jingsh, https://github.com/Skylion007	2025-07-29 23:44:10 +00:00
PyTorch MergeBot	e288c258f7	Revert "Remove tensorexpr tests (#158928 )" This reverts commit d742a2896c571a535003d5928fe80397325575a5. Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/yangw-dev due to this breaks bunch of internal dependency since some tests are still using the deleted test files from this pr, the internal reviewer please help fix this using codev ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3134378616))	2025-07-29 23:32:07 +00:00
William Wen	df58db8831	[dynamo, docs] add recompilation, observability, reporting issues docs (#159062 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159062 Approved by: https://github.com/svekars, https://github.com/zou3519, https://github.com/anijain2305	2025-07-29 23:23:51 +00:00
Nikita Shulga	15bb81ea4f	[2/N][CI] Remove MacOS-13 workarounds from tests (#159304 ) Part of https://github.com/pytorch/pytorch/issues/159275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159304 Approved by: https://github.com/dcci, https://github.com/cyyever ghstack dependencies: #159277, #159278	2025-07-29 23:12:13 +00:00
saienduri	8d37073bac	[ROCm] Update jit_utils.cpp trait modification based on HIP version. (#159292 ) The mi355 ci regression and hiprtc kernel compilation is failing due to duplicate definitions of traits leading to errors like `error: redefinition of 'integral_constant'`. This seems to be the culprit: https://github.com/pytorch/pytorch/pull/158868. Checking if using hip version instead of rocm version for the check would help with resolution here as rocm version and hip version aren't synced. ROCm 7.0 Alpha build used in CI is still on HIP 6.5. Confirmed that this patch works here: https://github.com/pytorch/pytorch/actions/runs/16579227179?pr=159292 Also, this PR increases the frequency of this MI355 CI to twice a day so we can catch and identify regressions easier if they happen for now. Jeff is on vacation, so Jithun asked me to reach out to y'all. Please help stamp and approve, so we can resolve the recent MI355 CI regression/timeout (https://github.com/pytorch/pytorch/actions/workflows/rocm-mi355.yml) :) @huydhn @malfet @atalman @seemethere Pull Request resolved: https://github.com/pytorch/pytorch/pull/159292 Approved by: https://github.com/malfet	2025-07-29 22:45:27 +00:00
AaronWang04	dc286aef61	Fused RMSNorm Housekeeping (#159317 ) Small PR to address comments that were made from the original fused rmsnorm PR that were not landed Changes: - Warning message when input.dtype doesn't match weight.dtype - Ensure default epsilon value is correct Comments: https://github.com/pytorch/pytorch/pull/153666#discussion_r2114735005 https://github.com/pytorch/pytorch/pull/153666#discussion_r2223518064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159317 Approved by: https://github.com/ngimel, https://github.com/Skylion007, https://github.com/eqy	2025-07-29 22:39:18 +00:00
Oguz Ulgen	b4619f0272	Pin Helion to 0.0.10 in PyTorch CI (#159420 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159420 Approved by: https://github.com/aorenste, https://github.com/malfet	2025-07-29 22:06:50 +00:00
William Wen	477c2273e1	[dynamo] better way to skip tracing sys.monitoring callables (#159369 ) Better approach to https://github.com/pytorch/pytorch/pull/158171, according to https://github.com/python/cpython/issues/137178#issuecomment-3131617493. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159369 Approved by: https://github.com/Skylion007	2025-07-29 21:54:58 +00:00
Will Constable	2176d481c1	[DTensor] dispatch to sharding prop over decomps (#159324 ) Fixes #159110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159324 Approved by: https://github.com/ezyang	2025-07-29 21:28:36 +00:00
Guilherme Leobas	b97274e8ac	[iter] Raise TypeError if iter arg cannot be iterable (#158410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158410 Approved by: https://github.com/XuehaiPan, https://github.com/zou3519 ghstack dependencies: #156371, #156416, #156460	2025-07-29 21:24:21 +00:00
Guilherme Leobas	f9be65cea4	[iter] Wrap iter(..) call in a ObjectIteratorVariable (#156460 ) This object keeps track when the iterator is exhausted (raise Stopiteration). Pull Request resolved: https://github.com/pytorch/pytorch/pull/156460 Approved by: https://github.com/zou3519 ghstack dependencies: #156371, #156416	2025-07-29 21:24:20 +00:00
Guilherme Leobas	4e3e3dc0a7	[iter] support `iter(callable, sentinel)` (#156416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156416 Approved by: https://github.com/XuehaiPan, https://github.com/zou3519 ghstack dependencies: #156371	2025-07-29 21:24:20 +00:00
Guilherme Leobas	fcf59df2b6	[iter] Add support for sequence protocol in `iter(..)` (#156371 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156371 Approved by: https://github.com/zou3519	2025-07-29 21:24:20 +00:00
Nick Riasanovsky	1bcb2f41e0	[BE] Eliminate workspace info in templates with new API (#159055 ) Summary: Moves the workspace info calculations to the old TMA API. Test Plan: NFC Rollback Plan: Differential Revision: D78904434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159055 Approved by: https://github.com/NikhilAPatel	2025-07-29 21:22:36 +00:00
Zhengxu Chen	8460131087	[nativert] Add OSS version of ModelRunner (#159268 ) Summary: Implement a ModelRunner from scratch with the minimum features for OSS only Test Plan: test_export -r NativeRT Rollback Plan: Differential Revision: D78979812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159268 Approved by: https://github.com/dolpm	2025-07-29 21:08:14 +00:00
PyTorch MergeBot	c0c24b61ff	Revert "Partitioner: Fix to align partition node order with original graph (#157892 )" This reverts commit 2d1e92307d3e67622f4fe8058d62e44fe4fa2f4e. Reverted https://github.com/pytorch/pytorch/pull/157892 on behalf of https://github.com/yangw-dev due to fails internal tests : [executorch/backends/xnnpack/partition/xnnpack_partitioner.py:101:24] Incompatible parameter type [6]: In call `Partition.__init__`, for argument `nodes`, expected `Optional[Iterable[Tuple[Node, Optional[int]]]]` but got `dict_keys[Node, str]`. ([comment](https://github.com/pytorch/pytorch/pull/157892#issuecomment-3134004881))	2025-07-29 20:41:45 +00:00
Sahan Paliskara	4fac43b21f	[BE] Move _freeze.py to torch/fb/utils (#159307 ) Summary: We are trying to deprecate torch deploy externally. However a bunch of legacy stuff still uses it. This PR allows the legacy tests to still run if neccessary Test Plan: It's a targets change so CI should suffice Rollback Plan: Differential Revision: D78910653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159307 Approved by: https://github.com/albanD	2025-07-29 20:07:17 +00:00
Aaron Orenstein	b794e77b7b	Disable cudagraph GCs by default (#158649 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158649 Approved by: https://github.com/eellison ghstack dependencies: #158193	2025-07-29 19:56:11 +00:00
PyTorch MergeBot	d987a6f7f0	Revert "[Dynamo][Better Engineering] Add typing annotations to guard and source (#158397 )" This reverts commit abcb24f4de11f8fedf2c2c9ff53b6092ef42306d. Reverted https://github.com/pytorch/pytorch/pull/158397 on behalf of https://github.com/yangw-dev due to Suggested to fix failing internal signals on D78911890 ([comment](https://github.com/pytorch/pytorch/pull/158397#issuecomment-3133823766))	2025-07-29 19:49:40 +00:00
PyTorch MergeBot	5d93127c87	Revert "[HOP, map] Rework of map autograd to the new interface (#153343 )" This reverts commit 24b1f10ca13d682430725c511812e43a35fcd6a6. Reverted https://github.com/pytorch/pytorch/pull/153343 on behalf of https://github.com/yangw-dev due to a older pr this pr dependes on needed to revert, rebase it after it's in ([comment](https://github.com/pytorch/pytorch/pull/153343#issuecomment-3133816812))	2025-07-29 19:46:42 +00:00
Sam Larsen	a3a51282db	Fix rand_like decomposition to preserve strides (#159294 ) Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898. Test Plan: New unit test (fails before this PR; but fixed after) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294 Approved by: https://github.com/eellison	2025-07-29 19:26:20 +00:00
PyTorch MergeBot	e557b3d5e5	Revert "[inductor] Fix mm decomposition evaluating symints (#158998 )" This reverts commit 52e180c3799a7638ee668b1291a711865ab8cfec. Reverted https://github.com/pytorch/pytorch/pull/158998 on behalf of https://github.com/yangw-dev due to it broke trunk with pr_time_benchmark test ([comment](https://github.com/pytorch/pytorch/pull/158998#issuecomment-3133696775))	2025-07-29 19:04:11 +00:00
eellison	f3a9e99036	Fix inductor cuda sort nan behavior (#159308 ) Fix for https://github.com/pytorch/pytorch/issues/152423 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159308 Approved by: https://github.com/isuruf	2025-07-29 19:02:45 +00:00
Animesh Jain	f7d6e9f500	[dynamo][guards] More small guard optimizations (#159345 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159345 Approved by: https://github.com/williamwen42 ghstack dependencies: #159288	2025-07-29 18:36:49 +00:00
Animesh Jain	e43e09e6c1	[dynamo][guards] Use lambda guards for object aliasing to improve object aliasing guards (#159288 ) # Note - On Lambda guarding of object aliasing # We previously installed object‑aliasing guards as relational guards, # but that undermined the recursive‑dict guard optimization: placing the # aliasing guard at a leaf prevented the parent dict node from # qualifying as a recursive‑dict guard root. Because aliasing guards are # rare, we now emit them as epilogue guards via a small Python lambda. # This repeats the access in Python—adding a bit of work—but the # overhead is outweighed by the gains from enabling recursive‑dict guard # optimization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159288 Approved by: https://github.com/StrongerXi	2025-07-29 18:36:49 +00:00
milliechen	2004f8aa10	FXConverter handling of generic output in inductor fallback kernel (#159002 ) (#159297 ) Summary: A fallback kernel's output may be a non-list/tuple but a `MultiOutput` with empty indices. Allow the `FXConverter` to handle such case. Test Plan: Modified the fxir test for fallbacks, then ran `buck2 test mode/dev-nosan caffe2/test/inductor:fxir_backend -- test_fallback`. Before this diff the modified test would fail with ``` File "/re_cwd/buck-out/v2/gen/fbcode/e2105f7329ead90a/caffe2/test/inductor/__fxir_backend__/fxir_backend#link-tree/torch/_inductor/codegen/wrapper_fxir.py", line 341, in generate line.codegen_fx(self)(line) File "/re_cwd/buck-out/v2/gen/fbcode/e2105f7329ead90a/caffe2/test/inductor/__fxir_backend__/fxir_backend#link-tree/torch/_inductor/codegen/wrapper_fxir.py", line 489, in _generate_multi_output inds = line.indices[0][1:] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: IndexError: list index out of range ``` (Full error paste in P1878839403) With this diff the error is no longer present. Rollback Plan: Differential Revision: [D79126619](https://our.internmc.facebook.com/intern/diff/D79126619) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159297 Approved by: https://github.com/blaine-rister	2025-07-29 18:29:01 +00:00
Edward Z. Yang	31b3b38e3a	Ensure export joint with descriptors + compile works (#159337 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159337 Approved by: https://github.com/wconstab ghstack dependencies: #159336	2025-07-29 17:43:52 +00:00
Edward Z. Yang	2f0db0444e	Track previous MetricsContext edits for ease of debugging. (#159336 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159336 Approved by: https://github.com/wconstab	2025-07-29 17:43:52 +00:00
PaliC	6162e650b0	[BE] remove torch deploy - conditionals (#158288 ) This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started. 1. Remove test_deploy_interaction as we no longer need to worry about this 2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1) 3. Remove `USE_DEPLOY` and switch to the default path always Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288 Approved by: https://github.com/albanD	2025-07-29 17:40:49 +00:00
Lucas Kabela	5d89634ca8	Graph break with error message (#158800 ) Fixes #157452 Test with ``` python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks ``` ### Release Notes Change to nn.Parameter Constructor Behavior in Dynamo Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging. Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800 Approved by: https://github.com/anijain2305	2025-07-29 17:34:49 +00:00
Arsh Zahed	52e180c379	[inductor] Fix mm decomposition evaluating symints (#158998 ) Fixes #154111 Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor. The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998 Approved by: https://github.com/jansel, https://github.com/BoyuanFeng	2025-07-29 17:29:38 +00:00
anwang	c55e72bea1	[Re-land][Inductor] Support native Inductor as backend for MTIA (#159211 ) The previous [diff/PR] (https://github.com/pytorch/pytorch/pull/158526) was reverted due to this docstring lint error: <img width="1736" height="722" alt="image" src="https://github.com/user-attachments/assets/216b1720-4002-48da-b5f3-32b5d48aaa54" /> I didn't add the docstring cause I thought I'm not supposed to add docstring for an EXISTING function. So this diff/PR is an exactly copy of the previous one, except for adding the docstring. ------------- This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly. The changes include: - Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc. - Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc. - MTIA specific codegen logic, for example, loading MTIA dynamic_library. - Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU. - Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend. - A change in Inductor runtime to avoid re-initialize MTIADriver. - BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag. - Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag. - Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose. Note: - This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead. - MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen. Internal: References: - [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/) - [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb) - [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w) - [early prototying diff](https://www.internalfb.com/diff/D75110196) - [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959) - [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678) Differential Revision: [D79040806](https://our.internmc.facebook.com/intern/diff/D79040806/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159211 Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/jansel	2025-07-29 17:03:24 +00:00
Sherlock Huang	750348b579	[NativeRT] Clean up use of TargetDevice in KernelFactory (#159298 ) Summary: Remove use of targetDevice in KernelFactory. AOTI would infer device when creating AOTIDelegateExecutor. Test Plan: CI Rollback Plan: Reviewed By: dolpm Differential Revision: D79007317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159298 Approved by: https://github.com/dolpm	2025-07-29 16:24:33 +00:00
Kurt Mohler	52b9af163c	Add `avg_pool3d` for MPS (#158877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158877 Approved by: https://github.com/malfet	2025-07-29 15:22:22 +00:00
James Wu	f4bfac11c7	[Precompile] [easy] API For Editable PrecompileCacheArtifacts (#158586 ) This adds an option for backend precompile artifacts to be editable, i.e. to not serialize them right away, but instead be able to apply a Callable edit_fn to them. This allows us to support editing the precompile artifact with more updated autotune results at a later time in the next PR. The goal flow here is: - User runs AOTAutograd -> Inductor -> Triton - User saves to AOTAutogradCache the normal results - User runs autotuning - User calls serialize(), it takes the new autotuning results at runtime and saves only the necessary triton kernels. This PR just implements the API for editing the cache artifacts. The next PR actually adds the autotuning saving support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158586 Approved by: https://github.com/zhxchen17	2025-07-29 14:53:21 +00:00
Howard Huang	8d00833fdb	[PP] Fix eval step under no_grad() (#159293 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159293 Approved by: https://github.com/tianyu-l, https://github.com/wconstab	2025-07-29 14:42:33 +00:00
Justin Chu	de529ef002	[ONNX] onnx.md to simplify deprecated entities (#159312 ) Simplify documentation of deprecated entities and remove the auto-generated page for JitScalarType Pull Request resolved: https://github.com/pytorch/pytorch/pull/159312 Approved by: https://github.com/titaiwangms	2025-07-29 14:24:17 +00:00
PyTorch MergeBot	61aa2ae20f	Revert "[CPU] fix _weight_int8pack_mm with large output shape (#158341 )" This reverts commit e469414b59ceeaae2860e36708de8852b9892776. Reverted https://github.com/pytorch/pytorch/pull/158341 on behalf of https://github.com/albanD due to Breaks slowtest ([comment](https://github.com/pytorch/pytorch/pull/158341#issuecomment-3132641530))	2025-07-29 13:56:20 +00:00
Mark Harfouche	9d32aa9789	Help fix numpy detection in cross compiled layouts (#137084 ) We had trouble at conda-forge getting numpy to get detected on aarch64 due to our splayed layout and cross compilation needs. see: * https://github.com/conda-forge/pytorch-cpu-feedstock/pull/256 * https://github.com/conda-forge/pytorch-cpu-feedstock/issues/266 * https://github.com/conda-forge/pytorch-cpu-feedstock/pull/267 This is my attempt at making an "upstreamable patch" that tries to follow your structure. It could introduce a new environment variable `Python_NumPy_INCLUDE_DIR` if you want, but CMake doesn't use it as an environment variable, so I feel like that would be weird. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137084 Approved by: https://github.com/atalman	2025-07-29 12:08:56 +00:00
Francisco Massa	5cf77a0ea2	Fix redistribution costs for slice_scatter (#159223 ) We were previously assuming that the `input_strategy == src_strategy`, which is not true in all cases. This should fix this. On the side, I also realized that for `slice_scatter` some DTensorSpecs don't have TensorMeta, e.g., https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_tensor_ops.py#L524 It would be good to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159223 Approved by: https://github.com/ezyang, https://github.com/wconstab	2025-07-29 12:00:39 +00:00
Xuehai Pan	efcf87654e	[CI] update flake8 and mypy lint dependencies (#158720 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158720 Approved by: https://github.com/Skylion007	2025-07-29 08:05:56 +00:00
Laith Sakka	2523e58781	unbacked handling for view_copy (#159244 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159244 Approved by: https://github.com/bobrenjc93	2025-07-29 07:10:46 +00:00
Jane Xu	222fa451a2	Move some of vec into headeronly in preparation for Half.h (#158976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-29 05:43:53 +00:00
Jason Ansel	6de24135e5	Fix flaky test_inductor_multiple_specializations (#159264 ) Summary: This test was using do_bench, so it was flaky performance is non-deterministic. Test Plan: buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:compile_subprocess -- --exact 'caffe2/test/inductor:compile_subprocess - test_inductor_multiple_specializations_cuda (caffe2.test.inductor.test_compile_subprocess.GPUTests)' --run-disabled Rollback Plan: Differential Revision: D79098692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159264 Approved by: https://github.com/jingsh	2025-07-29 05:16:55 +00:00
henrylhtsang	27ae72036d	[cutlass] Prep for cutlass upgrade by ignoring Wunused-but-set-variable (#159276 ) Differential Revision: [D79106238](https://our.internmc.facebook.com/intern/diff/D79106238/) This is in prep for cutlass upgrade. More context: https://github.com/NVIDIA/cutlass/issues/2487 Tested in https://github.com/pytorch/pytorch/pull/159115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159276 Approved by: https://github.com/adamomainz, https://github.com/njriasan, https://github.com/Skylion007	2025-07-29 04:40:24 +00:00
Sherlock Huang	e924df23a6	[NativeRT] Strengthen matcher check for StaticDispatch kernel (#159187 ) Summary: Strength matcher for StaticDispatch kernels: all input, output tensor must be on CPU, all Device-typed attribute must be CPU. Previously, we only check output tensor on CPU. This will miss catching the case where we do DeviceToHost aten._to_copy. Prepare for turning on static dispatch kernel by default. Test Plan: I should add some test before land. Rollback Plan: Differential Revision: D78747600 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159187 Approved by: https://github.com/dolpm	2025-07-29 04:03:49 +00:00
fduwjj	67e68e0785	[c10d] Cleanup split_group logic using the newly built splitGroup (#158488 ) with https://github.com/pytorch/pytorch/pull/157716 merged we want to further clean up the code on the python side for `split_group` API. We do need to keep some old global book keeping for bc. The rest of logic is now all in cpp. Regarding the change brought in https://github.com/pytorch/pytorch/pull/152175, we did clean up in https://github.com/pytorch/pytorch/pull/158790 (including internal changes) so that we can safely remove it. Differential Revision: [D78777152](https://our.internmc.facebook.com/intern/diff/D78777152) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158488 Approved by: https://github.com/d4l3k	2025-07-29 03:27:11 +00:00
Xuehai Pan	775788f93b	[BE][PYFMT] migrate PYFMT for `test/[i-z]*/` to `ruff format` (#144556 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144556 Approved by: https://github.com/ezyang	2025-07-29 03:26:09 +00:00
Mu-Chu Lee	19ce1beb05	[AOTInductor] Add test for enabling CUDACachingAllocator for AOTInductor's Weight (#159279 ) Summary: Add test for enabling CUDACachingAllocator for AOTInductor's Weight. Implementation TBD Test Plan: N/A, commit is adding a test. Rollback Plan: Differential Revision: D79107507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159279 Approved by: https://github.com/desertfire, https://github.com/jingsh	2025-07-29 02:52:10 +00:00
Guilherme Leobas	a91ddea61f	Add CPython tests for `collections` module (#158950 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158950 Approved by: https://github.com/zou3519	2025-07-29 02:24:27 +00:00
William Wen	ffccb90ff4	[dynamo, docs] add fullgraph=False docs (#159050 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159050 Approved by: https://github.com/svekars, https://github.com/anijain2305 ghstack dependencies: #157985, #158055, #158531	2025-07-29 01:53:47 +00:00
William Wen	f916f34739	[dynamo, docs] non-strict programming model docs (#158531 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158531 Approved by: https://github.com/AlannaBurke, https://github.com/mlazos, https://github.com/anijain2305 ghstack dependencies: #157985, #158055 Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-07-29 01:53:47 +00:00
William Wen	c32994ce4b	[docs, dynamo] add fullgraph=True, common graph breaks docs (#158055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158055 Approved by: https://github.com/AlannaBurke, https://github.com/anijain2305 ghstack dependencies: #157985 Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-07-29 01:53:41 +00:00
William Wen	433e43cbec	[dynamo, docs] programming model dynamo core concepts (#157985 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157985 Approved by: https://github.com/svekars, https://github.com/anijain2305	2025-07-29 01:53:34 +00:00
Xia, Weiwen	e469414b59	[CPU] fix _weight_int8pack_mm with large output shape (#158341 ) Summary `_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by ```c++ auto* C_ptr = C_data + mb_start * N + nb_start; ``` where both `mb_start` and `N` are `int` and when they are large their product may overflow. The solution is simple: declare these variables as `int64_t` so that the product won't overflow. Test plan ``` pytest -sv test/test_linalg.py -k test__int8_mm_large_shape ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341 Approved by: https://github.com/mingfeima, https://github.com/drisspg	2025-07-29 01:14:50 +00:00
rzou	657e5e9aa6	All custom operators go through Inductor's graph.call_function (#159174 ) Fixes #158892 All custom operators should go through the graph.call_function path. The other fallback path is for aten/prim operations that don't have support for things (like torch.float8_e8m0fn). Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/159174 Approved by: https://github.com/eellison	2025-07-29 00:31:57 +00:00
Nikita Shulga	f02b783aae	[1/N] Remove MacOS-13 MPS testing (#159278 ) Starts addressing https://github.com/pytorch/pytorch/issues/159275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159278 Approved by: https://github.com/dcci ghstack dependencies: #159277	2025-07-28 23:52:47 +00:00
Xu Han	8ad96a563c	[inductor] normalize path of the code. (#159255 ) Error stack: <img width="1361" height="345" alt="image" src="https://github.com/user-attachments/assets/50fb2baa-34fd-4a48-a3e7-76e3185391d4" /> After fix: <img width="1103" height="398" alt="image" src="https://github.com/user-attachments/assets/ece5a9ba-a085-46fe-b061-0c2ebda3a2df" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159255 Approved by: https://github.com/desertfire	2025-07-28 23:42:11 +00:00
PyTorch MergeBot	59e261bbd8	Revert "[CI] update flake8 and mypy lint dependencies (#158720 )" This reverts commit f5130bf339f12ccf5c6296130c47685bdc4858e4. Reverted https://github.com/pytorch/pytorch/pull/158720 on behalf of https://github.com/yangw-dev due to this pr failed internally when build torchgen due to rror: fail: Unknown PyPI project: pyyaml, it seems like this is caused by change PyYAML into pyyaml, please fix it ([comment](https://github.com/pytorch/pytorch/pull/158720#issuecomment-3129995414))	2025-07-28 22:02:10 +00:00
Catherine Lee	08ea8fccaf	[ez][docker] Remove some unused vars and scripts (#158680 ) `CUDNN_VERSION` isn't used in any Dockerfiles, it's picked automatically based on the cuda version in `install_cuda.sh` `install_cudnn.sh` isn't used anywhere, cudnn installation happens in `install_cuda.sh` I didn't find any mentions of `GRADLE_VERSION` or `TENSORRT_VERSION` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158680 Approved by: https://github.com/janeyx99, https://github.com/atalman, https://github.com/malfet	2025-07-28 21:44:47 +00:00
atalman	41754539be	Add 3.14 triton wheel build (#159261 ) Related to https://github.com/pytorch/pytorch/issues/156856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159261 Approved by: https://github.com/malfet, https://github.com/albanD	2025-07-28 20:34:16 +00:00
Nikita Shulga	716d52779f	[BE] Delete non-existing labels (#159277 ) As no such runners has been online for last 2+ month Pull Request resolved: https://github.com/pytorch/pytorch/pull/159277 Approved by: https://github.com/clee2000	2025-07-28 20:28:57 +00:00
Michael Lazos	3bf41f26c8	[cutlass] rename EVT args within kernels for code caching (#159243 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159243 Approved by: https://github.com/henrylhtsang	2025-07-28 19:01:40 +00:00
Eddie Yan	19aa8eb4f5	[TF32][Flex Attention] Turn off TF32 for reference computation in `test_flex_decoding` (#158979 ) Seems to avoid threshold (fudge factor) twiddling games as this causes the checks to go down the "very small ref error" path instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158979 Approved by: https://github.com/drisspg, https://github.com/BoyuanFeng, https://github.com/nWEIdia	2025-07-28 18:38:23 +00:00
Animesh Jain	8c0c5c58c7	[benchmarks] Set model name early to keep warmup and main model same (#159231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159231 Approved by: https://github.com/williamwen42 ghstack dependencies: #159209	2025-07-28 18:18:16 +00:00
Xiaochang Wu	2d1e92307d	Partitioner: Fix to align partition node order with original graph (#157892 ) Fixes #157891 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157892 Approved by: https://github.com/ezyang	2025-07-28 17:36:29 +00:00
Lucca Bertoncini	399c89e15c	fix torch/distributed contributing doc (#158934 ) both pointers are pointing to a page of empty github issues. I'm moving this to point to all issues tagged with `pt_distributed_rampup` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158934 Approved by: https://github.com/d4l3k	2025-07-28 17:01:05 +00:00
PyTorch MergeBot	14d67eec05	Revert "[dynamo][fsdp] Consistent behavior of int attributes (#157262 )" This reverts commit 9b4d938f04c95cebe0fbd96974f64c935567e039. Reverted https://github.com/pytorch/pytorch/pull/157262 on behalf of https://github.com/ZainRizvi due to This was reverted internally. Somehow this PR didn't get reverted alongside it. See D78772867. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/157262#issuecomment-3128148475))	2025-07-28 16:58:27 +00:00
Benson Ma	9ad7dd54f9	[fbgemm_gpu] Upgrade KernelLauncher kernelLaunchCheck to print help string (#158896 ) Summary: - Upgrade KernelLauncher kernelLaunchCheck to print help string, following D78440016 Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher ``` Rollback Plan: Differential break Revision: D78572009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158896 Approved by: https://github.com/atalman	2025-07-28 16:11:13 +00:00
Deepak Seshadri	387db86ef1	Name Inductor's Subproc pool threads. (#158815 ) Differential hack Revision: D78710371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158815 Approved by: https://github.com/d4l3k	2025-07-28 16:08:08 +00:00
dolpm	e5a1d839c5	[nativert] ensure planner once flag is class-local, not static. (#159116 ) Summary: att - otherwise only one global planner will be made even though we need it to be per-model if models are colocated. Differential hack Revision: D78939141 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159116 Approved by: https://github.com/SherlockNoMad	2025-07-28 16:06:21 +00:00
zhxchen17	c06164a9c5	[nativert][ez] Remove unused dist collectives ops. (#159220 ) Removing dependency to c10d/ in ExecutionFrame.h. We don't need c10d::Work in the frame. Differential Revision: [D79041618](https://our.internmc.facebook.com/intern/diff/D79041618/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159220 Approved by: https://github.com/SherlockNoMad, https://github.com/dolpm	2025-07-28 16:03:14 +00:00
Karim Abou Zeid	c7586d4ed3	typo (#156560 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/156560 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-07-28 15:40:06 +00:00
thenumberouscode	8e07c9870d	[dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#157566 ) inside torch.compile.disable function always triggers recompilation. because a user inside function decorated with torch._dynamo.disable would be used as an argument in the resume_in_xx function. In the current implementation, it will always be a new object, resulting in the ID_MATCH guard always failing and triggering recompilation. Fixes https://github.com/pytorch/pytorch/issues/157399 @xmfan Pull Request resolved: https://github.com/pytorch/pytorch/pull/157566 Approved by: https://github.com/mlazos, https://github.com/anijain2305	2025-07-28 12:44:22 +00:00
PyTorch UpdateBot	a76147c9e0	[xla hash update] update the pinned xla hash (#158223 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158223 Approved by: https://github.com/pytorchbot	2025-07-28 11:19:05 +00:00
pzzp	f3913ea641	[CUDA] fix nansum in non-JIT build (#158633 ) This change fix crash of ``` import torch a = torch.tensor([[1, 2]], dtype=torch.complex32).to('cuda') b = torch.nansum(a, dim=0) print(b) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158633 Approved by: https://github.com/ngimel	2025-07-28 08:11:32 +00:00
Sherlock Huang	1abff80fae	Reland D78841818 (#159216 ) Summary: Relanding D78841818 with fixes Test Plan: Tested all failing tests buck build --config fbcode.use_link_groups=true --flagfile fbcode//mode/dev-nosan fbcode//sigmoid/core/executor/memory/test:layout_planner_tests buck test 'fbcode//mode/opt' fbcode//sigmoid/inference/test:test_passes Rollback Plan: Reviewed By: hl475 Differential Revision: D79038615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159216 Approved by: https://github.com/dolpm	2025-07-28 07:39:35 +00:00
zeshengzong	799303f655	Fix atleast_{1,2,3}d() with no arguments description (#156042 ) Fixes #130667 ## Test Result ### Before ![image](https://github.com/user-attachments/assets/7e3a6764-872a-4573-8bec-e7219f920a15) ![image](https://github.com/user-attachments/assets/194be00c-9a29-44cf-b6bc-4d261a12d04e) ![image](https://github.com/user-attachments/assets/21cd6a4f-0793-44e3-9073-7b8b801f997c) ### After ![image](https://github.com/user-attachments/assets/fdbaa2ff-f13c-4fa9-bf52-0810faa698bd) ![image](https://github.com/user-attachments/assets/0374b474-4c6b-4b7d-abea-70e3df0c0a06) ![image](https://github.com/user-attachments/assets/9f9dc188-60e2-4c0f-9e23-36a39310008c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156042 Approved by: https://github.com/zou3519	2025-07-28 06:25:23 +00:00
PyTorch MergeBot	d26ab281d2	Revert "Setup TorchBench in Docker (#158613 )" This reverts commit d72ebefe3fa7d3ee0e9c9b399f5c07611e790664. Reverted https://github.com/pytorch/pytorch/pull/158613 on behalf of https://github.com/XuehaiPan due to checkout_install_torchbench function is removed but still referenced in trunk ([comment](https://github.com/pytorch/pytorch/pull/158613#issuecomment-3125695250))	2025-07-28 06:19:00 +00:00
PyTorch MergeBot	1cffb217ef	Revert "[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 )" This reverts commit e88f804a2eecf967dbbf95c5643248352626dafd. Reverted https://github.com/pytorch/pytorch/pull/155446 on behalf of https://github.com/XuehaiPan due to Breaks Windows wheels ([comment](https://github.com/pytorch/pytorch/pull/155446#issuecomment-3125566269))	2025-07-28 05:29:37 +00:00
PyTorch UpdateBot	c8342b7231	[vllm hash update] update the pinned vllm hash (#159235 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159235 Approved by: https://github.com/pytorchbot	2025-07-28 04:16:31 +00:00
Animesh Jain	f63673626d	[dynamo][guards] Skip guards on constant func.__defaults__ elements (#159209 ) Func.__defaults__ is a tuple. Therefore, we can skip guards on immutable elements. Mutable elements are still guarded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159209 Approved by: https://github.com/jansel	2025-07-27 22:46:17 +00:00
Sampath Victor	37638c303e	Addressing some linter errors (#158670 ) Summary: Addressing the linter errors reported in the changed files. Test Plan: ``` buck test mode/opt deeplearning/fbgemm:QuantUtilsTest ``` https://www.internalfb.com/intern/testinfra/testrun/11821949118528688 ``` buck test mode/opt caffe2/torch/fb/model_transform/splitting/tests:split_dispatcher_test ``` https://www.internalfb.com/intern/testinfra/testrun/7881299627525465 Rollback Plan: Differential Revision: D78352311 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158670 Approved by: https://github.com/excelle08, https://github.com/cyyever, https://github.com/digantdesai	2025-07-27 21:55:50 +00:00
Max Podkorytov	ee2edf3d37	[ROCm][CK][Inductor] enable gfx950 for max autotune with CK (#159195 ) + update inductor config for new gfx arch + fixes in codegen for conv2d and ck-tile matmul + use appropriate fp8 dtypes + test cleanup Pull Request resolved: https://github.com/pytorch/pytorch/pull/159195 Approved by: https://github.com/chenyang78	2025-07-27 20:47:13 +00:00
Chinmay Shrivastava	51eb41a57e	Enable dynamic shapes for foreach operations by default (#158985 ) ## Summary This PR changes the default value of `combo_kernel_foreach_dynamic_shapes` from `False` to `True` in `torch/_inductor/config.py`. ## Context The `combo_kernel_foreach_dynamic_shapes` configuration was introduced in PR #134477 (August 2024) to support dynamic shapes for foreach and combo kernels. It was initially disabled by default as a conservative approach to avoid disrupting production workflows. ## Why This Change? After several months of the feature being available and stable, it's time to enable it by default. This improves the user experience for developers using `torch.compile(dynamic=True)` with foreach operations. ### Current behavior: - Users must manually discover and enable `combo_kernel_foreach_dynamic_shapes` - Without this flag, foreach operations may fail with dynamic shapes - This creates friction and confusion ### With this change: - Foreach operations work seamlessly with dynamic compilation - No manual configuration needed - Better "it just works" experience ## Testing Extensive testing was performed with PyTorch 2.5.0+ and 2.7.1: - ✅ Various tensor sizes (8, 16, 32, 64, 128) - ✅ Multiple tensors in operations (tested up to 20) - ✅ Nested foreach operations - ✅ Mixed operations (foreach + standard operations) - ✅ Both CPU and CUDA devices - ✅ Symbolic shapes with dynamic compilation ## Impact Assessment - Performance: No impact - this only affects compilation behavior - Backward Compatibility: Fully maintained - users can still set to `False` - Risk: Minimal - feature has been stable since August 2024 ## References - Original implementation: PR #134477 by @qchip - This completes the feature rollout by making it available by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/158985 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-07-27 19:56:07 +00:00
Howard Huang	ede6186c86	[PP] Allow intermediate nodes in ZB to have multiple grads (#159084 ) Fixes a ZB regression (https://github.com/pytorch/torchtitan/actions/runs/16478292562/job/46585646792) Previously we only allowed an intermediate node to have 1 gradient. Recently a torchtitan ZB test started failing and I tracked to back to FusedRMSNorm grad_fn having two values `(grad, None)` (see https://github.com/pytorch/pytorch/pull/153666) and it started breaking our ZB tests. This PR allows `stage_backward_weight` intermediate nodes to have multiple grads (it sums them together or if the grad value is None, then ignores it). Here is an example where the backward would have two grad values (gI1, gI2): ```python class Func(torch.autograd.Function): @staticmethod def forward(ctx, x): return x, 2 @staticmethod def backward(ctx, gI1, gI2): assert gI2 is None return gI1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159084 Approved by: https://github.com/tianyu-l	2025-07-27 19:16:51 +00:00
Nikita Shulga	6d071bd65d	Remove numpy dependency from onnx (#159177 ) One should not expect numpy to be there during onnx import Forward fix for : https://github.com/pytorch/pytorch/pull/157734 Added regression test to `test_without_numpy` function Test plan: Run `python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch; import torch.onnx"` with/without this fix Pull Request resolved: https://github.com/pytorch/pytorch/pull/159177 Approved by: https://github.com/atalman, https://github.com/justinchuby, https://github.com/titaiwangms, https://github.com/cyyever, https://github.com/Skylion007, https://github.com/andrewboldi	2025-07-27 13:23:03 +00:00
cyy	d742a2896c	Remove tensorexpr tests (#158928 ) The tests are not maintained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928 Approved by: https://github.com/albanD, https://github.com/malfet	2025-07-27 07:13:27 +00:00
Xu Han	11d6559a58	[inductor] disable failed UTs of test_misc.py (#159210 ) Disable failed UTs. <img width="1195" height="118" alt="image" src="https://github.com/user-attachments/assets/da0933fb-3c4c-44c9-ba85-45971f03405f" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159210 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-07-27 05:41:44 +00:00
PyTorch UpdateBot	e7667e5702	[vllm hash update] update the pinned vllm hash (#159217 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159217 Approved by: https://github.com/pytorchbot	2025-07-27 04:16:35 +00:00
cyy	f6c89c1ef3	Detach tensor before clone in SGD optimiser and other code (#159204 ) Reverse the pattern of tensor clone followed by detach in SGD and other code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159204 Approved by: https://github.com/Skylion007	2025-07-27 03:31:12 +00:00
Huy Do	d72ebefe3f	Setup TorchBench in Docker (#158613 ) Signed-off-by: Huy Do <huydhn@gmail.com>	2025-07-26 12:56:03 -07:00
Anatoly Myachev	46b925681c	[inductor] Update `to(tl.int8).to(tl.uint8)` workaround from #94717 to handle entire range of `torch.uint8` (#158567 ) https://github.com/pytorch/pytorch/pull/94717/files#r2210265070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158567 Approved by: https://github.com/ngimel, https://github.com/jansel	2025-07-26 19:11:37 +00:00
PyTorch MergeBot	fe0ff12dab	Revert "[Inductor] Support native Inductor as backend for MTIA (#158526 )" This reverts commit cd68559d0451185f8521912c23e77b83d76b87cf. Reverted https://github.com/pytorch/pytorch/pull/158526 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158526#issuecomment-3122186057))	2025-07-26 17:58:00 +00:00
Francisco Massa	7dafab6a93	Fix SDPA sharding when `return_debug_mask` is False (#159205 ) If `return_debug_mask` is False (which is the default value for SDPA), the attention tensor returned is an empty tensor (which has 0 dimensions). This means that the shardings for the batch and CP case are that are passed can yield invalid dimensions. This PR fixes it for `scaled_dot_product_flash_attention_strategy`. Note that `scaled_dot_product_cudnn_attention_strategy` doen't have this issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/159205 Approved by: https://github.com/wconstab	2025-07-26 17:41:42 +00:00
Xuehai Pan	f5130bf339	[CI] update flake8 and mypy lint dependencies (#158720 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158720 Approved by: https://github.com/Skylion007	2025-07-26 17:12:29 +00:00
PyTorch MergeBot	f62772f365	Revert "Remove tensorexpr tests (#158928 )" This reverts commit 517eebc1dd4ae6430a95818b16c5f8b4b10fd1bc. Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks trunk test_jit_fuser_te.py::TestNNCOpInfoCPU::test_nnc_correctness_frac_cpu_bfloat16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/16534544469/job/46768022799) [HUD commit link](`517eebc1dd`) ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3122158944))	2025-07-26 17:01:54 +00:00
Xu Han	e2b2685f84	[inductor] enable compiled autograd on CPU windows - v2 (#159185 ) The first version: https://github.com/pytorch/pytorch/pull/158432 compiled autograd on windows is disabled in PR #144707 because cuda windows cannot compile this code. However these code can be compiled on CPU. This PR enable these code on CPU windows. But the first version changed ifdef block logical, and caused torch audio build fail: https://github.com/pytorch/audio/issues/3992 Here is the version two, which keep the original logical. # Local test torch audio build pass: <img width="874" height="1043" alt="image" src="https://github.com/user-attachments/assets/9657be86-04f7-4c66-b8c6-802ec2a7c5c8" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159185 Approved by: https://github.com/xmfan	2025-07-26 16:21:28 +00:00
PyTorch MergeBot	3db8623dcb	Revert "[NativeRT] Apply Device placement once when loading the graph (#158996 )" This reverts commit 28ee8be5bfeebb2e44daace6551462b52557e451. Reverted https://github.com/pytorch/pytorch/pull/158996 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158996#issuecomment-3121540050))	2025-07-26 09:05:26 +00:00
anwang	cd68559d04	[Inductor] Support native Inductor as backend for MTIA (#158526 ) This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly. The changes include: - Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc. - Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc. - MTIA specific codegen logic, for example, loading MTIA dynamic_library. - Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU. - Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend. - A change in Inductor runtime to avoid re-initialize MTIADriver. - BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag. - Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag. - Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose. Note: - This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead. - MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen. Internal: References: - [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/) - [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb) - [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w) - [early prototying diff](https://www.internalfb.com/diff/D75110196) - [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959) - [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678) Differential Revision: [D78458745](https://our.internmc.facebook.com/intern/diff/D78458745/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158526 Approved by: https://github.com/blaine-rister, https://github.com/jansel, https://github.com/eellison	2025-07-26 08:16:34 +00:00
PyTorch UpdateBot	62a49d929b	[vllm hash update] update the pinned vllm hash (#159198 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159198 Approved by: https://github.com/pytorchbot	2025-07-26 04:44:38 +00:00
Laith Sakka	c6b479bc09	remove guard_or_x from allowlist_for_publicAPI (#159181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159181 Approved by: https://github.com/albanD	2025-07-26 01:22:17 +00:00
cyy	517eebc1dd	Remove tensorexpr tests (#158928 ) The tests are not maintained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928 Approved by: https://github.com/albanD, https://github.com/malfet	2025-07-26 01:21:01 +00:00
zpcore	7f266020de	add softmax_backward_strategy missing field (#159167 ) Add input_specs in softmax_backward_strategy, as is needed by AutoParallel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159167 Approved by: https://github.com/XilunWu	2025-07-26 00:53:53 +00:00
Meet Patel	e06798191b	Split out C++ code from fused adagrad PR (#159008 ) The original fused Adagrad pull request was: PR#153038 This PR contains only the c++ code of that original PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159008 Approved by: https://github.com/janeyx99	2025-07-26 00:36:59 +00:00
eqy	c89fa88acb	[conv][cuDNN][64-bit indexing] reduce memory usage of depthwise conv 64-bit indexing test (#158981 ) Use half instead for reduced memory usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/158981 Approved by: https://github.com/soulitzer, https://github.com/Skylion007	2025-07-25 23:58:45 +00:00
gaoyufeng	f5cf05c983	Throw invalid_argument instead of RuntimeError when parameters exceed… (#158267 ) Throw invalid_argument instead of RuntimeError when parameters exceed limits (for torch.int32 dtype) Fixes #157707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158267 Approved by: https://github.com/albanD	2025-07-25 23:49:46 +00:00
NikhilAPatel	21a95bdf7c	[Inductor] [Triton] Enabling TMA for flex-attention for supported device types (#157822 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157822 Approved by: https://github.com/drisspg ghstack dependencies: #159123	2025-07-25 23:45:26 +00:00
Colin Peppler	fb029accb7	(is_non_overlapping_and_dense) gso to guard_or_false in when checking length 1 (#158894 ) Switch from `guard_size_oblivious` to `guard_or_false` if you encounter a DDE, this would then fallback to computing elementwise strides. `2dccff7dcf/torch/_prims/__init__.py (L1919-L1923)` We think it's safe because Laith tested whether this fallback would fail any tests. It did not. https://github.com/pytorch/pytorch/pull/158157 ## Data-dependent exceptions (DDE) ``` File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 2139, in _to_copy x_tensor = torch._prims.convert_element_type(x_tensor, dtype) ... File "/data/users/colinpeppler/pytorch/torch/_prims/__init__.py", line 1920, in _convert_element_type_meta if torch._prims_common.is_non_overlapping_and_dense(a): File "/data/users/colinpeppler/pytorch/torch/_prims_common/__init__.py", line 494, in is_non_overlapping_and_dense if guard_size_oblivious(length == 1): GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u0 - 4, 1) (unhinted: Eq(u0 - 4, 1)). (Size-like symbols: u0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158894 Approved by: https://github.com/pianpwk, https://github.com/laithsakka	2025-07-25 23:43:38 +00:00
drisspg	26f4dd5160	Scaled MM Fix NVfp4 (#159170 ) Fixes mm on B200: Before: ```Shell def _addmm_nvfp4_dispatch( a: NVFP4Tensor, b: NVFP4Tensor, aten_op, bias: Optional[torch.Tensor] = None ) -> torch.Tensor: """ Core implementation shared between nvfp4_mm, nvfp4_addmm, and nvfp4_linear. The only difference is whether bias is None or not. """ assert a._data.is_contiguous() assert b._data.t().is_contiguous() assert a._block_size == 16, f"NVFP4 requires block_size=16, got {a._block_size}" assert b._block_size == 16, f"NVFP4 requires block_size=16, got {b._block_size}" M, K = a.shape[0], a.shape[1] N = b.shape[1] # Swizzle Dizzle if a._is_swizzled_scales: a_scale_blocked = a._scale_e4m3 # Already swizzled else: a_scale = a._scale_e4m3.view(M, K // a._block_size) a_scale_blocked = to_blocked(a_scale) if b._is_swizzled_scales: b_scale_blocked = b._scale_e4m3 # Already swizzled else: b_scale = b._scale_e4m3.view(N, K // b._block_size) b_scale_blocked = to_blocked(b_scale) # Merge double quant scales into 1 scale for Scale_In^D if a._per_tensor_scale is not None: assert b._per_tensor_scale is not None scale_result = a._per_tensor_scale * b._per_tensor_scale else: assert b._per_tensor_scale is None and a._per_tensor_scale is None scale_result = None # THIS IS A WORKAROUND: # RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling # When we have per-tensor scaling, we need to apply it before bias # since bias is not quantized should_add_bias_separately = (scale_result is not None) and (bias is not None) # should_add_bias_separately = bias is not None > result = torch._scaled_mm( a._data.view(torch.float4_e2m1fn_x2), b._data.view(torch.float4_e2m1fn_x2), a_scale_blocked.view(torch.float8_e4m3fn), b_scale_blocked.view(torch.float8_e4m3fn), bias=None if should_add_bias_separately else bias, out_dtype=a._orig_dtype, # scale_result=scale_result, # Not supported yet ) E RuntimeError: Invalid scaling configuration. E - For TensorWise scaling, a and b should be float8, scales should be float and singletons. E - For RowWise scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be contiguous. E - For BlockWise 1x128 scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be outer-dim-major. E - For BlockWise 128x128 scaling, a and b should be float8, scales should be float, scale_a should be (2, 1) and scale_b should be (1, 2), and both should be near-inner-dim-major (with 16-byte aligned strides). E - For Blockwise 1x32 scaling, a and b should be float8, scales should be float8_e8m0fnu, scale_a should have 1024 elements and scale_b should have 1024 elements, and both should be contiguous. E - For Blockwise 1x16 scaling, a and b should be float4 (packed 2x), scales should be float8_e4m3fn, scale_a should have 3072 elements and scale_b should have 3072 elements, and both should be contiguous. E Got a.dtype()=Float4_e2m1fn_x2, scale_a.dtype()=Float8_e4m3fn, scale_a.size()=[256, 12], scale_a.stride()=[12, 1], b.dtype()=Float4_e2m1fn_x2, scale_b.dtype()=Float8_e4m3fn, scale_b.size()=[256, 12] and scale_b.stride()=[12, 1] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159170 Approved by: https://github.com/ngimel	2025-07-25 23:34:03 +00:00
Menglu Yu	b9e3eb64a7	[Optimus] Support decompose mm with dynamic shapes (#158821 ) Summary: The current implementation will not do the decompose for GEMM with dynamic shapes, thus we add one more option for users to enable this feature Test Plan: ### how to enable Step 1: Set decompose_mem_bound_mm = false Step 2: Add the decompose_mm_pass pattern to the post_grad_fusion_options json config example: "post_grad_fusion_options": { "decompose_mm_pass": { "min_first_dimension_decomposition": 10240, -> default value "max_other_dimention_decomposition": 32, -> default value "skip_dynamic_shape_dim_check": true, -> default is false } }, yaml config example ``` post_grad_fusion_options: decompose_mm_pass: skip_dynamic_shape_dim_check: true ``` Note that all these hyper-parameters can be set by the users, if nothing gives, a default value will be used ### unit test ``` buck2 test @mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm -- test_dynamic_shape_decompose_addmm ``` Buck UI: https://www.internalfb.com/buck2/a98eb4b3-da1d-4450-9e49-472ba98b2267 Test UI: https://www.internalfb.com/intern/testinfra/testrun/6473924745731095 Network: Up: 86KiB Down: 1.3MiB (reSessionID-96cf35cc-5189-4372-8f25-1fc6a52a3963) Executing actions. Remaining 0/3 1.4s exec time total Command: test. Finished 2 local Time elapsed: 2:00.6s Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0 ### E2E before: aps-DPA_new_v0_amd_20250716-e7927755df after: aps-DPA_new_v0_amd_20250716_optimus-f2175fc9fb tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-DPA_new_v0_amd_20250716_optimus-f2175fc9fb/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 ### qps and NE {F1980635506} {F1980635505} - 12.5% qps improvement with NE neutral ### trace analysis baseline:https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Faps-DPA_new_v0_amd_20250716-e7927755df%2F0%2Frank-1.Jul_22_22_28_01.4592.pt.trace.json.gz&bucket=aps_traces {F1980633952} proposal:https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Faps-DPA_new_v0_amd_20250716_optimus-f2175fc9fb%2F0%2Frank-1.Jul_24_14_37_59.4576.pt.trace.json.gz&bucket=aps_traces {F1980633966} ``` unsqueeze_default: "bf16[32s54, 8, 1][8, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(constant_pad_nd_default_2, 2) unsqueeze_default_1: "bf16[1, 8, 8][64, 8, 1]cuda:0" = torch.ops.aten.unsqueeze.default(constant_pad_nd_default_3, 0); constant_pad_nd_default_3 = None mul_tensor: "bf16[32s54, 8, 8][64, 8, 1]cuda:0" = torch.ops.aten.mul.Tensor(unsqueeze_default, unsqueeze_default_1); unsqueeze_default = unsqueeze_default_1 = None ``` ### what have been decomposed P1880443593 Rollback Plan: Differential Revision: D78716034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158821 Approved by: https://github.com/Yuzhen11	2025-07-25 23:19:53 +00:00
PyDevC	69cc99525c	[nn]: updated type alias for padddingmode in module/conv.py (#158843 ) Fixes #152280 Changed type of `padding_mode` from `str` to `Literal["zeros", "reflect", "replicate", "circular"]` cc @Skylion007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158843 Approved by: https://github.com/mikaylagawarecki	2025-07-25 23:05:02 +00:00
Edward Z. Yang	72af19dadf	Add aot_autograd.fx_utils (#159005 ) See docblock for details. The API here has been validated by use in autoparallel but I'm always open to suggestions for tweaks. One particular choice I made is to make most of the functions return dicts by default; this isn't strictly necessary for inputs but it is very convenient for outputs as the output desc lives on the output node, not the argument that feeds into the node. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159005 Approved by: https://github.com/wconstab	2025-07-25 22:52:33 +00:00
IvanKobzarev	8aebf01287	[bucketing] Rewrite all_gather, reduce_scatter passes via tracing merge_fn (#158663 ) Rewriting bucketing of all_gather and reduce_scatter with defining of "merge graph" via torch function. `all_gather_merge_fn_to_trace` `reduce_scatter_merge_fn_to_trace` (Instead of creating nodes and doing FakeTensor prop manually) This allows to experiment with merge function. Used foreach_copy_ in merging function for all_gather - added lowering for inductor for `foreach_copy_` Adding topological sort after bucketing passes (comment in post_grad.py): ``` # Fx collectives bucketing passes require topological sort for the cases: # when bucketed collectives have users before the last collective in the bucket # AND when inputs of bucketed collective have ancestors after the first collective in the bucket. # # In this case we can not manually pick the place for bucketed collective insertion. # But we are guaranteed by the bucketing (independent collectives in the bucket), # that it is possible to reorder nodes to satisfy all ordering requirements. # # --- before bucketing --- # in0 = ... # wait_ag0 = ag(in0) # user0(wait_ag0) # ... # pre_in1 = ... # in1 = transform(pre_in1) # wait_ag1 = ag(in1) # user1(wait_ag1) # # --- after bucketing --- # # in0 = ... # user(wait_ag0) <--- wait_ag0 is defined only after bucketed collective. # # pre_in1 = ... # in1 = transform(pre_in1) # ag_bucket(in0+in1) # wait_bucket # wait_ag0 = wait_bucket[0] # wait_ag1 = wait_bucket[1] # user1(wait_ag1) ```` Correctness of the passes verified by loss curve for llama3 8b for simple_fsdp and for autoparallel: <img width="1364" height="495" alt="Screenshot 2025-07-22 at 14 27 28" src="https://github.com/user-attachments/assets/67b2cabb-3206-450b-b529-e23c24292fc6" /> <img width="1355" height="509" alt="Screenshot 2025-07-22 at 14 27 56" src="https://github.com/user-attachments/assets/4d0e6b25-2eb1-47b2-8d68-dcec185239c4" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158663 Approved by: https://github.com/wconstab	2025-07-25 22:49:51 +00:00
Yu Guo	bc5dbbbb78	support scalar tensor for functional all_gather (#149913 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149913 Approved by: https://github.com/H-Huang ghstack dependencies: #149912	2025-07-25 22:38:08 +00:00
Mikayla Gawarecki	36cf8f1ed8	[BE] Use .md instead of .rst for nn.aliases doc (#158666 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158666 Approved by: https://github.com/janeyx99 ghstack dependencies: #158491, #158654	2025-07-25 22:03:55 +00:00
Mikayla Gawarecki	1e79872f2e	[BE] More torch.nn docs coverage test (except for torch.nn.parallel) (#158654 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158654 Approved by: https://github.com/janeyx99 ghstack dependencies: #158491	2025-07-25 22:03:55 +00:00
Mikayla Gawarecki	9e8f27cc79	[BE] Make torch.nn.modules.* satisfy the docs coverage test (#158491 ) Options to address the "undocumented python objects": 1. Reference the functions in the .rst via the torch.nn.modules namespace. Note that this changes the generated doc filenames / locations for most of these functions! 2. [Not an option] Monkeypatch `__module__` for these objects (broke several tests in CI due to `inspect.findsource` failing after this change) 3. Update the .rst files to also document the torch.nn.modules forms of these functions, duplicating docs. #### [this is the docs page added](https://docs-preview.pytorch.org/pytorch/pytorch/158491/nn.aliases.html) This PR takes option 3 by adding an rst page nn.aliases that documents the aliases in nested namespaces, removing all the torch.nn.modules.* entries from the coverage skiplist except - NLLLoss2d (deprecated) - Container (deprecated) - CrossMapLRN2d (what is this?) - NonDynamicallyQuantizableLinear This mostly required adding docstrings to `forward`, `extra_repr` and `reset_parameters`. Since forward arguments are already part of the module docstrings I just added a very basic docstring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158491 Approved by: https://github.com/janeyx99	2025-07-25 22:03:55 +00:00
Mikayla Gawarecki	e65ab9a868	Enable generating generic c_shim that doesn't bypass dispatcher (#158974 ) Adds `c_shim_aten.{h/cpp}` and use this for `fill_` This is the generated `c_shim_aten.cpp` for reference ```cpp // WARNING: THIS FILE IS AUTOGENERATED BY torchgen. DO NOT MODIFY BY HAND. // See `7e86a7c015/torchgen/gen.py (L2424-L2436)` for details // This file corresponds to the aten_shimified_ops list in torchgen/aoti/fallback_ops.py #include <torch/csrc/inductor/aoti_torch/generated/c_shim_aten.h> #include <torch/csrc/inductor/aoti_torch/utils.h> #ifndef AT_PER_OPERATOR_HEADERS #include <ATen/Functions.h> #include <ATen/CompositeExplicitAutogradFunctions.h> #include <ATen/CompositeExplicitAutogradNonFunctionalFunctions.h> #include <ATen/CompositeImplicitAutogradFunctions.h> #else #include <ATen/ops/fill.h> #endif // AT_PER_OPERATOR_HEADERS using namespace torch::aot_inductor; AOTITorchError aoti_torch_aten_fill__Scalar(AtenTensorHandle self, double value) { AOTI_TORCH_CONVERT_EXCEPTION_TO_ERROR_CODE({ at::fill_( *tensor_handle_to_tensor_pointer(self), value ); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158974 Approved by: https://github.com/albanD, https://github.com/janeyx99	2025-07-25 21:59:14 +00:00
Pian Pawakapan	bfe6765d6b	[export] assert fix in serdes (#159060 ) Summary: catch asserts on True Test Plan: T232064560 Rollback Plan: Differential Revision: D78907485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159060 Approved by: https://github.com/yiming0416	2025-07-25 21:46:20 +00:00
Denghui Dong	e88f804a2e	[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 ) Hi team, Please help review this patch. This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable. I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by `257c413cd1` on 3.12.5. So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it. There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain. ~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446 Approved by: https://github.com/sraikund16	2025-07-25 21:44:57 +00:00
raghavhrishi	7ef3c3357d	NUMA binding integration with elastic agent and torchrun (#149334 ) Implements #148689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149334 Approved by: https://github.com/d4l3k Co-authored-by: Paul de Supinski <pdesupinski@gmail.com>	2025-07-25 21:19:49 +00:00
Thomas Bohnstingl	24b1f10ca1	[HOP, map] Rework of map autograd to the new interface (#153343 ) This PR reworks the current autograd implementation of map to the new interface. @pytorchbot label "topic: not user facing" Pull Request resolved: https://github.com/pytorch/pytorch/pull/153343 Approved by: https://github.com/ydwu4	2025-07-25 21:17:06 +00:00
Yidi Wu	0006dd5c43	[test][torchbind] don't allow set torchbind attr at runtime (#158608 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158608 Approved by: https://github.com/zou3519 ghstack dependencies: #158583, #158606, #158607	2025-07-25 20:55:41 +00:00
Yidi Wu	0f31e9a656	[torchbind] fix fakifying a staitc tensor returns dynamic accidentally (#158607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158607 Approved by: https://github.com/zou3519 ghstack dependencies: #158583, #158606	2025-07-25 20:55:41 +00:00
Yidi Wu	0427e439aa	[test][torchbind] turn on inductor backend for compile torchbind tests (#158606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158606 Approved by: https://github.com/zou3519 ghstack dependencies: #158583	2025-07-25 20:55:41 +00:00
Yidi Wu	4aa69ae336	[torchbind] support register_autocast for torchbind custom op (#158583 ) Fix https://github.com/pytorch/pytorch/issues/158414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158583 Approved by: https://github.com/zou3519	2025-07-25 20:55:41 +00:00
dolpm	14c314b30d	[nativert] make per-node benchmark work with memory planning (#159117 ) Summary: this will use-after-free otherwise Rollback Plan: Differential Revision: D78934104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159117 Approved by: https://github.com/SherlockNoMad	2025-07-25 20:46:17 +00:00
Pian Pawakapan	0b01e11416	[ez][export] add sym_sum to verified ops (#159111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159111 Approved by: https://github.com/angelayi	2025-07-25 20:42:42 +00:00
NikhilAPatel	806d9e3fe7	[Inductor][TMA] Split config-gated and pure compatibility logic for TMA template eligibility checks (#159123 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159123 Approved by: https://github.com/drisspg	2025-07-25 20:35:49 +00:00
Yu Guo	d90ce83027	add a util function _make_all_gather_out_tensor to reduce code duplication (#149912 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149912 Approved by: https://github.com/H-Huang	2025-07-25 20:29:01 +00:00
Xu Han	dfcb07bdfa	[Inductor] disable windows failed UTs temporary. (#159163 ) Disable windows failed UTs temporary. <img width="1238" height="107" alt="image" src="https://github.com/user-attachments/assets/c8a40408-a793-4016-99bb-19c1bb09860a" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159163 Approved by: https://github.com/desertfire	2025-07-25 20:25:36 +00:00
Sam Larsen	fa0355c18d	Fix full_like decomposition to preserve strides (#158898 ) Summary: See original PR at: https://github.com/pytorch/pytorch/pull/144765, which landed internally but was reverted due to test failures. Addressing reviewer comments and trying again. Rollback Plan: Differential hack Revision: D78783627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158898 Approved by: https://github.com/eellison	2025-07-25 20:21:36 +00:00
Sherlock Huang	28ee8be5bf	[NativeRT] Apply Device placement once when loading the graph (#158996 ) Summary: Placement is leaked to too many classes! In this diff, we consolidate all placement lookup into one place: Graph::ApplyDevicePlacement. After applying placement, the in-memory graph, tensorMeta, weightMeta would already have the re-mapped device. The subsequence weight loading, sample input loading, target device inference would look up the re-mapped device from graph's tensorMeta. graph's tensorMeta becomes the only ground truth! Test Plan: Need to add some tests before landing. This is a big change. Rollback Plan: Differential Revision: D78841818 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158996 Approved by: https://github.com/henryoier	2025-07-25 20:11:35 +00:00
Yidi Wu	ed472257d1	[associative_scan] stop manually set example inputs in dynamo (#159065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159065 Approved by: https://github.com/zou3519 ghstack dependencies: #159063, #159064	2025-07-25 20:08:08 +00:00
Yidi Wu	57eea56a9a	[scan] stop manually set example inputs in dynamo (#159064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159064 Approved by: https://github.com/zou3519 ghstack dependencies: #159063	2025-07-25 20:08:08 +00:00
Yidi Wu	dd681f7f59	[while_loop] stop manually setting example inputs in dynamo (#159063 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159063 Approved by: https://github.com/zou3519	2025-07-25 20:08:08 +00:00
Howard Huang	0d4d3e8a89	[TCPStore] Allow ping to be retried (#159165 ) On client setup we retry connections with server: `f8fafdc7a6/torch/csrc/distributed/c10d/TCPStore.cpp (L313-L350)` I noticed `ping()` raises `TORCH_INTERNAL_ASSERT` AKA a runtime error rather than a `DistNetworkError`. So updating that so it can be retried as well. We have seen this pop up internally: - https://fb.workplace.com/groups/319878845696681/permalink/1478849733132914/ - https://fb.workplace.com/groups/319878845696681/permalink/1479368959747658/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/159165 Approved by: https://github.com/d4l3k	2025-07-25 20:03:00 +00:00
cz2h	ee4c5c7cd2	Add torchcheck for replication_pad3d_backward (#151986 ) Fixes #142833 Add check on channel dimension, logic same to the CUDA implementation `78bbb468c6/aten/src/ATen/native/cuda/ReplicationPadding.cu (L347)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151986 Approved by: https://github.com/mikaylagawarecki	2025-07-25 19:48:51 +00:00
Sameer	51cd6697cd	Fix: Use memory_order_relaxed instead of memory_order_relaxed (#159105 ) Addresses #159074 by using `memory_order_release` instead of `memory_order_relaxed` here: `9c10760662/c10/core/DeviceType.cpp (L161)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159105 Approved by: https://github.com/colesbury	2025-07-25 19:39:04 +00:00
Xu Han	ba949c54a7	[inductor] fix test_save_graph_repro on Windows. (#159148 ) The issue is caused by Windows path separator work as escape character. Fixed by `normalize_path_separator` in torch front end codegen. Error message: <img width="855" height="542" alt="image" src="https://github.com/user-attachments/assets/ad08b521-05e6-4c93-9507-ad19c68ac7b5" /> Fixed: <img width="855" height="312" alt="image" src="https://github.com/user-attachments/assets/4a0a142a-2dbe-4226-a4cb-8eacfab2c3fc" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159148 Approved by: https://github.com/desertfire	2025-07-25 19:11:08 +00:00
morrison-turnansky	2a528e80ce	Add more type hints for _inductor/ir.py (#159049 ) Fixes #146167 Incremental step to add type hints for _inductor/ir.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/159049 Approved by: https://github.com/Skylion007	2025-07-25 18:56:30 +00:00
Edward Z. Yang	56c45f863b	Add aot_export_joint_with_descriptors and aot_compile_joint_with_descriptors (#158715 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158715 Approved by: https://github.com/fmassa, https://github.com/wconstab, https://github.com/xmfan ghstack dependencies: #158624, #158708, #158734	2025-07-25 18:49:00 +00:00
albanD	d30f89b9b8	Add host protoc script back (#159157 ) Following comment from https://github.com/pytorch/pytorch/pull/158475#issuecomment-3116518904 Also this is a fake issue as protoc is dead anyways: https://github.com/pytorch/pytorch/issues/159156 Also also, macos cross compilation is not something that is tested :/ But I guess we're ok with that given how niche it it... Pull Request resolved: https://github.com/pytorch/pytorch/pull/159157 Approved by: https://github.com/janeyx99	2025-07-25 18:44:20 +00:00
PyTorch MergeBot	3fb78501f0	Revert "enable compiled autograd on CPU windows (#158432 )" This reverts commit a369350065493109d1abfbb994695777ab11bcf4. Reverted https://github.com/pytorch/pytorch/pull/158432 on behalf of https://github.com/atalman due to Broke audio cuda windows builds see: https://github.com/pytorch/audio/issues/3992 ([comment](https://github.com/pytorch/pytorch/pull/158432#issuecomment-3119912177))	2025-07-25 18:29:16 +00:00
angelayi	8a0508335f	[export] Fix public bindings (#159109 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/159109 Approved by: https://github.com/jbschlosser	2025-07-25 18:18:52 +00:00
rajeshvshiyal	4c0d5ad4be	Fix docstring for clip_grads_with_norm_ to reflect clamping behavior (#158200 ) Fix docstring for clip_grads_with_norm_ to reflect clamping behavior This PR updates the docstring for torch.nn.utils.clip_grads_with_norm_ to accurately reflect the implementation behavior. The current documentation suggests that gradients are always scaled by: grad = grad * (max_norm / (total_norm + eps)) However, the actual implementation clamps the scale coefficient to a maximum of 1.0, ensuring gradients are only scaled down, not up. This PR corrects the formula and adds a clarifying note to avoid confusion for users. Updated the formula in the docstring to: grad = grad * min(max_norm / (total_norm + eps), 1.0) Added a note explaining the rationale for clamping (to prevent gradient amplification). Ensured consistency with the behavior of clip_grad_norm_. Fixes #151554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158200 Approved by: https://github.com/mikaylagawarecki	2025-07-25 18:07:41 +00:00
Joel Schlosser	316c188a5e	Remove torch.functional entries from the doc ignore list (#158581 ) Options to address the "undocumented python objects": 1. Reference the functions in the .rst via the `torch.functional` namespace. Note that this changes the generated doc filenames / locations for most of these functions! 2. Document these functions by referencing them from the `torch.` namespace instead, in line with common usage. This would also require setting the `__module__` for these functions and moving entries from `torch.functional`'s `__all__` -> `torch`'s `__all__`, which is BC-breaking. 3. Update the .rst files to also document the `torch.functional` forms of these functions, duplicating docs. This PR takes option (3) above and: * Removes all 20 `torch.functional` entries from the doc ignore list * Removes `torch.functional.align_tensors()` entirely, since we don't want to document it. * This is technically BC-breaking, although the previous impl simply errored out. This change could be moved to a separate isolated PR for safety. * Introduces `torch.aliases.md` as a hidden page for the `torch.functional` aliases to the `torch` analogue functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/158581 Approved by: https://github.com/janeyx99	2025-07-25 17:19:01 +00:00
Edward Z. Yang	191eca0bf0	Use simple_wraps instead of functools.wraps in AOTAutograd (#158734 ) Wrapping is load bearing for things that introspect argument signatures, but use of functools.wraps to do this is undesirable as this overrides the name/module of the wrapping function, which is bad for tracking down exactly what code is actually being run at runtime. simple_wraps is like wraps but it doesn't override the name information, so you still get an appropriate printout. To see the stack of all functions wrapping each other, there is now a helper fn_stack. I also make some assertions tighter in the descriptor PR. These didn't catch any bugs but I figure might as well. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158734 Approved by: https://github.com/wconstab ghstack dependencies: #158624, #158708	2025-07-25 17:08:54 +00:00
Sheng Fu	74f64d3c84	Add inputs and outputs in Triton Kernel FX Graph segment (#158174 ) Summary: Add inputs and outputs in Triton Kernel FX Graph segment The FX graph segment in Triton kernel does not include the input tensors and return tensors, for example Python code: ``` @torchdynamo.optimize("inductor") def fn(a, b, c): x = torch.nn.functional.linear(a, b) x = x.sin() x = x.t() + c * 2 return x ``` ``` # %sin : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {}) # %permute_1 : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {}) # %mul : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 2), kwargs = {}) # %add : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {}) ``` The fix is to add the input and output tensors into FX graph segment ``` # %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm] # %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1] # %sin : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {}) # %permute_1 : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {}) # %mul : "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 2), kwargs = {}) # %add : "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {}) # return %add ``` Differential Revision: D78131358 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158174 Approved by: https://github.com/jansel	2025-07-25 17:01:17 +00:00
PyTorch MergeBot	f8fafdc7a6	Revert "[BE] remove torch deploy - conditionals (#158288 )" This reverts commit ab26d4fbeb5bc4b4e6ef1c37fbec9fab6e5a9edd. Reverted https://github.com/pytorch/pytorch/pull/158288 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks. @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))	2025-07-25 16:09:39 +00:00
PyTorch MergeBot	c8316d0e79	Revert "[BE] Remove torch deploy \| remove torch deploy specific files (#158290 )" This reverts commit 6ed2cb6ccd00e64f67fd414d42dff54393140c8f. Reverted https://github.com/pytorch/pytorch/pull/158290 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks. @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))	2025-07-25 16:09:39 +00:00
PyTorch MergeBot	a9f6770edd	Revert "[BE] Remove __reduce_deploy__ (#158291 )" This reverts commit 9c68c4d08f4c4da49f0086b80e382f0cdd518f60. Reverted https://github.com/pytorch/pytorch/pull/158291 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks. @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))	2025-07-25 16:09:39 +00:00
PyTorch MergeBot	5620e617c9	Revert "[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407 )" This reverts commit 255c0545e7eac2ec6d00a41a3fc9d6d8201f8f39. Reverted https://github.com/pytorch/pytorch/pull/158407 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks. @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))	2025-07-25 16:09:39 +00:00
Huy Do	ee84ba42ea	[Experiment] Run PT2 benchmark twice a day (#159162 ) Running every 4 hours seems too many, lower it to twice a day. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159162 Approved by: https://github.com/desertfire, https://github.com/eellison	2025-07-25 15:58:29 +00:00
Catherine Lee	561193e5f2	[CI][testing] Use 3 processes for testing on sm89 and sm90 jobs (#158691 ) 3 procs were used for sm86, but we switched to sm89 and the check failed so it switched back to 2 sm90 is H100, but idk what unittests we have running there, but I assume they also have a lot of memory They use larger runners, which have more GPU memory, so its usually ok. I think it's ~22GB -> 10GB per proc if 2, 6GB per proc if 3 (cuda context maybe 1GB) I've applied skips to the ones that OOMed Time decreases from ~2.7hr per test job -> ~2hr Pull Request resolved: https://github.com/pytorch/pytorch/pull/158691 Approved by: https://github.com/huydhn	2025-07-25 15:26:29 +00:00
PyTorch MergeBot	9535995bbc	Revert "Remove tensorexpr tests (#158928 )" This reverts commit a0bc865123dba047aa1507e281bf2462780cf271. Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/clee2000 due to broke cpp static runtime test? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16517697273/job/46715871457) [HUD commit link](`a0bc865123`) ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3118554478))	2025-07-25 15:22:51 +00:00
William Wen	6fcb2b4413	[dynamo] unimplemented -> unimplemented_v2 for user_defined.py (#156652 ) For https://github.com/pytorch/pytorch/issues/147913 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156652 Approved by: https://github.com/zou3519 Co-authored-by: Sidharth <ssubbarao8@meta.com>	2025-07-25 15:04:17 +00:00
Edward Z. Yang	204eb4da5e	Add expanded_def option for FX printing, render descriptor, update tests (#158708 ) ---- - First, we add a new expanded_def to FX, which will expand the definitions of variables into multiple lines, one per variable definition. This makes extremely long args/return lists much more readable. - Next, we extend this mechanism to also print out descriptors on placeholders and return values, as comments, if available. This is how we will test descriptors. - We update tlparse for AOTAutograd to use this format. - We update expect tests to use this format and update their formats, so you can inspect what it can look at. There may be other tests I should update, open to suggestions. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158708 Approved by: https://github.com/wconstab ghstack dependencies: #158624	2025-07-25 13:22:32 +00:00
Edward Z. Yang	bf311141d6	Track descriptors for all inputs/outputs of AOTAutograd traced graph (#158624 ) One of the recurring challenges of working with FX graphs produced by AOTAutograd is that there is a very intricate input/output calling convention that is essentially impossible to understand without actually reverse engineering the AOTAutograd code. It is so bad that there is a bit of logic for stashing indices of relevant arguments/outputs in TracingContext so Inductor can figure out what the correct arguments are. This PR introduces the necessary scaffolding to keep track of "descriptors" of every input/output to a (joint) FX graph produced by AOTAutograd. First read through descriptors.py to get a sense for what is available: for inputs, you can figure out if you have a plain input, tangent, parameter, or something more exotic like one of the fields of a subclass or view base. For outputs, you can determine if you have a plain output or grad, or something more exotic like the contents of a mutated input or an intermediate base of several views that were returned. There are two distinct parts of this patch: AOTInput tracking, and AOTOutput tracking. AOTInput tracking. The way this works is that AOTAutograd starts of with some Tensor `flat_args` that are the inputs to the graph being traced, and then updates these arguments as it modifies the input calling convention. Anywhere these `args` are passed around, we now add a news argument `args_descs` which is updated in synchrony with args. Add a new arg? Add a new AOTInput to `args_descs`. AOTOutput tracking. Originally, I wanted to also add an `outs_descs` analogous to `args_descs` tracking output metadata. However, it is often difficult to compute what the output will be until you're actually tracing the function for real (and are able to peek at the real outputs). So we only compute `outs_desc` when we actually trace. To do this, we change the calling convention of the function we trace to return not just outputs, but a tuple of `outs` and `outs_descs`. Before we bottom out at the `make_fx` invocation, we save `outs_descs` to a nonlocal and bottom out. To actually make use of this information in a useful way, see the next PR. Potentially the two PRs could be combined together but I think it's actually clearer for them to be separate. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158624 Approved by: https://github.com/xmfan	2025-07-25 13:22:32 +00:00
thenumberouscode	92e93bb580	[inductor][cpu] Stop lowering div to reciprocal multiplication to preserve precision when the divisor is a scalar and device is on cpu (#158231 ) ## Fixes https://github.com/pytorch/pytorch/issues/157959 ## mini repro from issue ```c++ import torch from torch import nn class Foo(nn.Module): def __init__( self, use_parameter: bool ) -> None: super().__init__() self.b = 101 if use_parameter: self.b = nn.Parameter(torch.Tensor([self.b]), requires_grad=False) def forward(self, x: torch.Tensor) -> torch.Tensor: # return x + self.b # return x - self.b return x / self.b # return x * self.b torch.manual_seed(42) x = torch.rand((5, 5)) expected = Foo(False)(x) models = [ Foo(False), Foo(True), torch.compile(Foo(False), fullgraph=True), torch.compile(Foo(True), fullgraph=True), ] for m in models: print((m(x) - expected).sum()) ``` all outputs equal zero except the result of torch.compile(Foo(False), fullgraph=True) ## summary: when divisor is a scalar, inductor will lower div to mul the scalar's reciprocal. this could lead precision lost in c++ kernel. but not in triton kernel ## why: Generated C++ kernel; thanks to @xmfan for supplying the code. ```c++ #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(25L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(16L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = static_cast<float>(0.009900990099009901); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 * tmp2; tmp3.store(out_ptr0 + static_cast<int64_t>(x0)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(16L) && x0 < static_cast<int64_t>(25L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L)); auto tmp1 = static_cast<float>(0.009900990099009901); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 * tmp2; tmp3.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L)); } } } } } ``` The float type in C typically has 6 to 7 significant digits, while the double type has 15 to 16 significant digits. ```c++ #include <iostream> #include <iomanip> int main() { auto tmp1 = static_cast<float>(0.009900990099009901); auto tmp2 = static_cast<double>(0.009900990099009901); std::cout << std::setprecision(20) << "tmp1 = " << tmp1 << std::endl; std::cout << std::setprecision(20) << "tmp2 = " << tmp2 << std::endl; return 0; } ``` the ouput is ```bash tmp1 = 0.0099009899422526359558 tmp2 = 0.0099009900990099011103 ``` `auto tmp1 = static_cast<float>(0.009900990099009901);` This will cause tmp1 to become 0.0099009, resulting in a loss of precision, so the final result will not match the expected value. I also found that the bug occurred at that position `86d8af6a6c/torch/_inductor/lowering.py (L6238)` The commit states that the precision lost is expected in cuda implementation. original commit `03439d4c1c` cuda implementation `0636c11811/aten/src/ATen/native/cuda/BinaryDivTrueKernel.cu (L36-L38)` What is interesting is that the Triton kernel works correctly due to the precision of float type in python. ```python def triton_poi_fused_div_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 25 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = 0.009900990099009901 tmp2 = tmp0 * tmp1 tl.store(out_ptr0 + (x0), tmp2, xmask) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158231 Approved by: https://github.com/eellison	2025-07-25 08:57:17 +00:00
cyy	a0bc865123	Remove tensorexpr tests (#158928 ) The tests are not maintained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928 Approved by: https://github.com/albanD	2025-07-25 08:37:51 +00:00
Laith Sakka	aaa384b2d4	move view_meta to fake impl (#158406 ) Python dispatcher is not always enabled in fake tensors and have to be called explicitly. While it should be, it requires some work to get all tests working. I have been running in several issues where I add to add enable_python_dispatcher ex XLA, Helom ..etc to avoid issues related to that for the view specifically i moved it to fake tensor impl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158406 Approved by: https://github.com/bobrenjc93	2025-07-25 08:21:27 +00:00
Jeff Daily	0fd5f1c294	[ROCm][CI] upgrade wheels to 6.4.2 patch release (#158886 ) Upgrade wheels to ROCm 6.4.2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158886 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-25 08:11:41 +00:00
Xu Han	e38a2b3d0f	[inductor] add missing ignore_errors parameter for Windows. (#159025 ) The origin code comemnts: ```python # Let's not fail if we can't clean up the temp dir. Also note that for # Windows, we can't delete the loaded modules because the module binaries # are open. ``` But we are missing the `ignore_errors` parameter for Windows. I help to add it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159025 Approved by: https://github.com/jansel	2025-07-25 07:58:22 +00:00
Robert Burke	ae183d6092	Aten vector default constructors set to 0, add fnmadd and fnmsub (#158508 ) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 jerryzh168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158508 Approved by: https://github.com/swolchok	2025-07-25 06:55:37 +00:00
Animesh Jain	659f8fb115	[dynamo][guards] Add some relational guard helpers (#159077 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159077 Approved by: https://github.com/jansel ghstack dependencies: #158995	2025-07-25 06:28:10 +00:00
Animesh Jain	05a748d287	[dynamo][guards] Expand is_immutable_object to have None (#158995 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158995 Approved by: https://github.com/Lucaskabela, https://github.com/jansel	2025-07-25 06:12:05 +00:00
Han, Chao1	02ca965560	Device agnostic for DCP (#158337 ) Enable device-agnostic implementation of DCP-related functionality, allowing the new DCP features to be supported on XPU as well. use_cuda_non_blocking_copy to use_non_blocking_copy because non-blocking copy is supported by most GPUs and is not exclusive to CUDA devices. Test plan: test cases have not yet been updated to be fully device agnostic; this will be addressed in future work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158337 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/Saiteja64 Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-07-25 05:24:09 +00:00
Dylan Maloy	511d987378	only call re-plan if historic max's were updated. (#159016 ) Summary: wasteful. only update the plan if a new maximum has been found. Test Plan: ci Rollback Plan: Reviewed By: SherlockNoMad Differential Revision: D78859344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159016 Approved by: https://github.com/SherlockNoMad	2025-07-25 05:07:30 +00:00
zeshengzong	9685fc36d4	Add missing optional for tensor ops (#159028 ) ## Test Result <img width="872" height="340" alt="image" src="https://github.com/user-attachments/assets/20c3f1a2-0160-4ea3-b9f3-14630b4ec06d" /> <img width="906" height="429" alt="image" src="https://github.com/user-attachments/assets/68f8d8da-0570-4ae8-8e45-573b2c64cae5" /> <img width="906" height="429" alt="image" src="https://github.com/user-attachments/assets/42d133f6-94eb-4a38-8b4b-5586f52bff88" /> <img width="878" height="285" alt="image" src="https://github.com/user-attachments/assets/d3ad8950-81fa-4c4c-a5b5-621b0d9df99b" /> <img width="889" height="430" alt="image" src="https://github.com/user-attachments/assets/9aabeaff-bb8f-4990-b253-1bb053e72aca" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159028 Approved by: https://github.com/Skylion007	2025-07-25 04:36:55 +00:00
PyTorch UpdateBot	9e5cfd3ee5	[audio hash update] update the pinned audio hash (#159108 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159108 Approved by: https://github.com/pytorchbot	2025-07-25 04:35:21 +00:00
Nikita Shulga	cdf8e9ec1a	[MPS] Add support for unsigned types (#159094 ) As both Metal and MPS support uint16/uint32 and uint64 Test plan: `python3 -c "import torch;print(torch.randint(55, 66, (16,), device='mps', dtype=torch.uint16)[10:])"` Fixes https://github.com/pytorch/pytorch/issues/159076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159094 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-07-25 04:31:42 +00:00
PyTorch UpdateBot	bcf34d24eb	[vllm hash update] update the pinned vllm hash (#159107 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159107 Approved by: https://github.com/pytorchbot	2025-07-25 04:03:39 +00:00
Jeff Daily	9b29166f57	[ROCm] add flag torch.backends.miopen.immediate (#158951 ) The MIOpen integration has changed over the years. In the past, the MIOpen default for benchmark was True and if it were set to False it would use MIOpen Immediate Mode. But with #145294 the MIOpen benchmark default changed to False and to activate immediate mode you would set the deterministic flag to True. This has proved too restrictive because benchmark and deterministic flags are independent from immediate mode. Thus, immediate mode needs its own flag. Though MIOpen still masquerades behind torch.backends.cudnn and its flags, it seemed inappropriate to add an miopen-exclusive flag to the set of cudnn flags. This PR adds the first miopen-only flag to control its immediate mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158951 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-25 04:01:51 +00:00
Jeff Daily	1fced0c7d5	[ROCm] enable hipblaslt on gfx908 for ROCm >= 6.3 (#159092 ) Fixes #159030. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159092 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-25 03:54:30 +00:00
Nichols A. Romero	16c0ccd669	[ROCm][CI] upgrade to 6.4.2 patch release (#158887 ) Upgrade to ROCm 6.4.2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158887 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-25 03:45:44 +00:00
Xuehai Pan	f5e2de928b	[BE] fix remaining flake8 v7 warnings (#159044 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044 Approved by: https://github.com/Skylion007 ghstack dependencies: #159043	2025-07-25 02:56:34 +00:00
Xuehai Pan	f903bc475c	[BE] add noqa for flake8 rule B036: found `except BaseException` without re-raising (#159043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159043 Approved by: https://github.com/Skylion007	2025-07-25 02:56:34 +00:00
FFFrog	4261e26a8b	[OpenReg] move fallback tests into test_openreg.py (#158441 ) ---- - move fallback tests into test_operneg - remove the test_cpp_extensions_open_device_registration.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/158441 Approved by: https://github.com/albanD ghstack dependencies: #158415, #158440	2025-07-25 02:39:41 +00:00
FFFrog	b635359e4c	[OpenReg] add pyproject.toml for openreg (#158440 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158440 Approved by: https://github.com/albanD ghstack dependencies: #158415	2025-07-25 02:39:41 +00:00
FFFrog	f1a1aa9490	[OpenReg] Improve README.md and optimize some codes for OpenReg (#158415 ) ---- - add description for DSO dependencies - remove unnecessary code Pull Request resolved: https://github.com/pytorch/pytorch/pull/158415 Approved by: https://github.com/albanD	2025-07-25 02:39:41 +00:00
FFFrog	6fc0ad22f0	Using the latest torch.library.register_fake API instead of torch.library.impl_abstract (#158839 ) As the title stated. `torch.library.impl_abstract` have beed deprecated in PyTorch2.4, so change to use the new API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158839 Approved by: https://github.com/jingsh, https://github.com/zou3519 ghstack dependencies: #158838	2025-07-25 02:37:30 +00:00
FFFrog	c60d382870	Add tests for torch.ops.load_library (#158838 ) According to this [comment](https://github.com/pytorch/pytorch/pull/157524#issuecomment-3097899129), adding a related test to keep BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158838 Approved by: https://github.com/zou3519	2025-07-25 02:37:30 +00:00
Chuan Jiang	64cb349b81	Extract a method that filters frames in the captured stack trace (#158266 ) Summary: The subclass can override the filtering logic to customize which frames to keep or drop. Test Plan: ``` buck run caffe2/test:test_export -- -r test_stack_trace buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:others -- -r test_constant_random buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_custom_obj_list_out buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r class_member_back_compat ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158266 Approved by: https://github.com/ezyang, https://github.com/yushangdi	2025-07-25 02:22:03 +00:00
PyTorch MergeBot	a53db90e21	Revert "[inductor] consolidate common GEMM triton param retrieval (#158015 )" This reverts commit 9faef3d17c2e422d5d62f62b266155e2deb52c40. Reverted https://github.com/pytorch/pytorch/pull/158015 on behalf of https://github.com/henrylhtsang due to breaking tests ([comment](https://github.com/pytorch/pytorch/pull/158015#issuecomment-3115384824))	2025-07-25 00:16:50 +00:00
Ke Wen	9c10760662	[SymmMem] Use host/nvshmem_api.h for backward compat (#159061 ) Resolves #159045 `nvshmem_host.h` was introduced in 3.3.9. Use `host/nvshmem_api.h` and `host/nvshmemx_api.h` for prior versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159061 Approved by: https://github.com/ngimel, https://github.com/fduwjj, https://github.com/fegin	2025-07-24 22:56:26 +00:00
PyTorch MergeBot	8d2a1d6e18	Revert "Graph break with error message (#158800 )" This reverts commit cae4746952afbb6d26ecf7599cb7c6c449c69ef4. Reverted https://github.com/pytorch/pytorch/pull/158800 on behalf of https://github.com/clee2000 due to broke some tests on main inductor/test_distributed_patterns.py::DistributedPatternTests::test_nn_param_return4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/16507837934/job/46685704688) [HUD commit link](`cae4746952`), note to self: bad TD, but also dynamo/test_repros failed but didn't get skipped by TD so maybe a landrace, or I just blaming the wrong commit entirely.. ([comment](https://github.com/pytorch/pytorch/pull/158800#issuecomment-3115224608))	2025-07-24 22:45:58 +00:00
PyTorch MergeBot	751285cb22	Revert "Move some of vec into headeronly in preparation for Half.h (#158976 )" This reverts commit 5564f2ca2e0836d75c4ee45899b1b981582c3e2d. Reverted https://github.com/pytorch/pytorch/pull/158976 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D78924504 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158976#issuecomment-3115198443))	2025-07-24 22:31:49 +00:00
Lucas Kabela	efc810c7d0	[Bugfix] Fix circular import between export and dynamo from tensor fn map (#158931 ) Fixes #158120 The issue was caused by populating a builtin tensor fn map at import time; if torch.export.export was called before any dynamo imports with the `meta` device, this map would not be populated, and so would populate on import time which would try to call `torch.disable`, which would not yet be initialized Fix is to populate this map lazily ``` python test/dynamo/imports_non_circular_repro.py TestImports.test_circular_import_with_export_meta ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158931 Approved by: https://github.com/StrongerXi, https://github.com/mlazos, https://github.com/anijain2305	2025-07-24 22:24:57 +00:00
Xu Han	abb0bf45df	[AOTI] skip crashed case on Windows temporary. (#158929 ) skip crashed case on Windows temporary. This case will crashed application: <img width="1053" height="275" alt="image" src="https://github.com/user-attachments/assets/3225e9c8-cbe7-4998-86da-f20fbb12ead2" /> Quick analysis: <img width="1400" height="261" alt="image" src="https://github.com/user-attachments/assets/9c21fefc-9ed8-40f2-84c5-edde2004777c" /> 1. It is crashed on OpenMP. 2. stack is dameged, need consider how to debug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158929 Approved by: https://github.com/desertfire	2025-07-24 22:08:19 +00:00
PyTorch MergeBot	b533f12120	Revert "[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 )" This reverts commit da94023b0205bf98c3da366f2f86e0a443f4db17. Reverted https://github.com/pytorch/pytorch/pull/155446 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @sraikund16 can you please help validate the fix? (See D78845227 for details). You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/155446#issuecomment-3115072504))	2025-07-24 21:46:00 +00:00
Aaron Orenstein	e20736bf1d	Dont't GC as often when collecting cudagraphs (#158193 ) TL;DR: Cuts vLLM cudagraph collection from 80s -> 24s Stop garbage collecting by default on every cudagraph recording. The old behavior can be re-enabled by setting `TORCH_CUDAGRAPH_GC=1` or the config `force_cudagraph_gc`. We were previously garbage collecting at the beginning of each cudagraph capture. vLLM collects 5427 graphs and most of those garbage collections weren't actually collecting any memory (CPU or GPU). This changes it to not collect more than every 10s so if we're capturing in a loop we don't burn all our cycles looking for garbage. (These number have a lot of variance from run to run but give the correct general scale) ``` \| calls \| total \| synchronize \| gcs \| collect \| empty cache \| sys freed \| cuda freed \| -------+-------+-------+-------------+------+---------+-------------+-----------+------------+ before \| 5427 \| 78s \| 1.48s \| 5427 \| 53.22s \| 1.21s \| 145855 \| 1539309568 \| -------+-------+-------+-------------+------+---------+-------------+-----------+------------+ after \| 5427 \| 24s \| 0s \| 3 \| 1.53s \| 0.84s \| 592 \| 1539309568 \| -------+-------+-------+-------------+------+---------+-------------+-----------+------------+ ``` total - this is the total time reported by vLLM's "Graph capturing finished" log. The rest of these are measured in torch.cuda.graphs.graph.__enter__(): calls - number of times torch.cuda.graphs.graph.__enter__ was called synchronize - this is the duration taken by the cuda.synchronize call gcs - number of times gc.collect was called collect - this is the duration taken by the gc.collect call empty cache - this is the duration taken by the torch.cuda.empty_cache call sys freed - the number of bytes reported freed by gc.collect cuda freed - the number of bytes reported freed by torch.cuda.memory_reserved So it seems like the heavy lifting is done by torch.cuda.empty_cache() which is fairly quick. Cudagraph results from the TorchInductor Performance DashBoard (this is from the original version using the GC clock so the real results will be slightly better than this): <img width="1494" height="382" alt="image" src="https://github.com/user-attachments/assets/69b705ef-47ce-4b6e-9733-1ec941cad93d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158193 Approved by: https://github.com/ngimel	2025-07-24 21:37:11 +00:00
Lucas Kabela	cae4746952	Graph break with error message (#158800 ) Fixes #157452 Test with ``` python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks ``` ### Release Notes Change to nn.Parameter Constructor Behavior in Dynamo Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging. Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800 Approved by: https://github.com/anijain2305	2025-07-24 21:05:17 +00:00
Jithun Nair	4a13d4d7d0	[ROCm] Update jit_utils.cpp for compatibility with ROCm7.0 (#158868 ) Resolves error when running tests such as `test_nn.py::TestNN::test_L1Loss_no_reduce_complex_cuda` etc. on ROCm7.0: ``` /tmp/comgr-4cd8ad/input/CompileSourceU53Ndb:1016:7: error: no template named 'is_floating_point'; did you mean '__hip_internal::is_floating_point'? 1016 \| is_floating_point<_Tp>::value, \| ^~~~~~~~~~~~~~~~~ \| __hip_internal::is_floating_point /tmp/comgr-4cd8ad/include/hiprtc_runtime.h:1481:31: note: '__hip_internal::is_floating_point' declared here 1481 \| template<typename _Tp> struct is_floating_point : public false_type {}; \| ^ /tmp/comgr-4cd8ad/input/CompileSourceU53Ndb:1017:16: error: too few template arguments for class template '__libcpp_complex_overload_traits' 1017 \| typename __libcpp_complex_overload_traits<_Tp>::_ComplexType \| ^ /tmp/comgr-4cd8ad/input/CompileSourceU53Ndb:850:10: note: template is declared here 847 \| template <class _Tp, bool = is_integral<_Tp>::value, \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 848 \| bool = is_floating_point<_Tp>::value \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 849 \| > \| ~ 850 \| struct __libcpp_complex_overload_traits {}; \| ^ fatal error: too many errors emitted, stopping now [-ferror-limit=] 20 errors generated when compiling for gfx90a. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158868 Approved by: https://github.com/jeffdaily	2025-07-24 21:00:37 +00:00
Ti-Tai Wang	da35562bba	[ONNX] Filter out torchscript sentences (#158850 ) Fixes #157300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158850 Approved by: https://github.com/justinchuby, https://github.com/svekars	2025-07-24 20:59:06 +00:00
Guilherme Leobas	de85ee73ae	Update context in `unimplemented_v2` when exception bubbles up to the interpreter (#158924 ) Before: ``` .Observed exception Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region. Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled. Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues. Developer debug context: ``` After: ``` Observed exception Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region. Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled. Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues. Developer debug context: raised exception TypeError([ConstantVariable(str: "unhashable type: <class 'torch._dynamo.variables.dicts.SetVariable'>")]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158924 Approved by: https://github.com/williamwen42, https://github.com/zou3519	2025-07-24 20:50:22 +00:00
eqy	8573a2beda	[CUDA] Fix missing `__syncthreads` in MultiMarginLoss backward (#158994 ) Turns out issue in #158921 is detectable with a simple unit test and adding the missing sync fixes it Pull Request resolved: https://github.com/pytorch/pytorch/pull/158994 Approved by: https://github.com/malfet, https://github.com/Skylion007 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-07-24 20:47:29 +00:00
PyTorch MergeBot	13398dab79	Revert "Remove tensorexpr tests (#158928 )" This reverts commit a3f9f79f591102afa93145bb67dc7e34df44f9a4. Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/clee2000 due to Theres still some references to the things removed in this PR in test.sh, the jobs on this PR are failing because of that but log classifier is probably pointing to a wrong line, should be an easy fix tho ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3114873706))	2025-07-24 20:45:30 +00:00
Jane Xu	5564f2ca2e	Move some of vec into headeronly in preparation for Half.h (#158976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158976 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-24 20:32:33 +00:00
Sidharth	f3edcac23a	[dynamo] Added back weblink generation (#159011 ) Added back weblink generation for v2.9 development Note: It is fine to bring the weblink generation back since v2.9 isn't released for a while Pull Request resolved: https://github.com/pytorch/pytorch/pull/159011 Approved by: https://github.com/williamwen42	2025-07-24 20:27:11 +00:00
zhxchen17	90c241dedd	[precompile] Support user defined function calls from bytecode. (#158947 ) Previously precompile was implemented under the assumption that dynamo always inlines the user code and generate resume functions when a graph break is hit. In cases like nanogpt training, there exists nontrivial amount of code causing dynamo to fail the speculation and stop inlining certain type of user function. This results in more code objects to be tracked by CompilePackage. Since these new code objects are user defined, we need to also serialize the location of these code so that we can load the precompile entries to the these code objects in another process. With this fix, we are able to run nanogpt inference+training with precompile under torchbench. Differential Revision: [D78691422](https://our.internmc.facebook.com/intern/diff/D78691422/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158947 Approved by: https://github.com/jamesjwu	2025-07-24 20:10:57 +00:00
Luca Wehrstedt	5ab0eb28f7	Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037 ) cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack. The scaling format is still detected from the sizes of the scale tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037 Approved by: https://github.com/eqy, https://github.com/drisspg	2025-07-24 20:10:51 +00:00
Laith Sakka	0b2ef76e85	DDE-Free select with unbacked index. (#157605 ) When select has data dependent input, we cant tell if the actual index shall be index+size or index. to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the output view and we compute its value dynamically at runtime when inductor is lowered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605 Approved by: https://github.com/ColinPeppler	2025-07-24 20:08:05 +00:00
Ruben Rodriguez Buchillon	9faef3d17c	[inductor] consolidate common GEMM triton param retrieval (#158015 ) \# Why - Make loop iteration simpler - Have a common spot where to make modifications that affect all the GEMM Triton templates, avoiding missed spots \# What - pull out commong logic of taking the BaseConfig objects and turning them into kwargs to feed into maybe_append_choice for Triton GEMM templates Differential Revision: [D78081314](https://our.internmc.facebook.com/intern/diff/D78081314) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158015 Approved by: https://github.com/PaulZhang12, https://github.com/jansel	2025-07-24 19:17:48 +00:00
namgyu-youn	aeaa20083f	[profiler] update CUDA runtime kernel identification logic (#157890 ) Update CUDA kernel detection to exclude memory API calls References: - https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html - https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/157890 Approved by: https://github.com/sraikund16	2025-07-24 19:14:08 +00:00
zpcore	5be7e187ba	Support `sort` and `scatter_add` strategy (#159022 ) Add `sort`, `scatter_add` strategy. I am reusing the strategy for `scatter` related ops for a quick support. The strategy can be potential improved after we fix index related strategies. Minor fix: fix `replicate_op_strategy` to support output multiple tensors, which is required by aten.sort. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159022 Approved by: https://github.com/XilunWu, https://github.com/wconstab	2025-07-24 18:33:18 +00:00
Nikita Shulga	347a97da66	[MPS] Enable dlpack integration (#158888 ) Though testing is a lie and dependent on https://github.com/pytorch/pytorch/pull/153835 Fixes https://github.com/pytorch/pytorch/issues/153789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158888 Approved by: https://github.com/albanD ghstack dependencies: #158874	2025-07-24 18:05:41 +00:00
Conan Truong	78aa3bd6b6	Added Emscripten __assert_fail declaration to Macros.h (#158580 ) Summary: __assert_fail is declared slightly differently in the Emscripten stdlib. This may cause errors when compiling with Emscripten. Test Plan: N/A Rollback Plan: Differential Revision: D78500790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158580 Approved by: https://github.com/JacobSzwejbka	2025-07-24 17:10:29 +00:00
Jeff Daily	ee97dbf2e7	[ROCm][CI] update HIP patch for 6.4.1, again (#159001 ) Another fix for hipGraph capture of MIOpen OCL kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159001 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-24 16:36:19 +00:00
zeshengzong	b7d41729e0	Add `zerotensor` design description in code (#158837 ) Fix `TODO: add a note explaining the design decisions` by adding design description in https://github.com/pytorch/pytorch/issues/69687 to codebase. Make it easier to get and read by other developers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158837 Approved by: https://github.com/soulitzer	2025-07-24 16:35:42 +00:00
Lucas Kabela	abcb24f4de	[Dynamo][Better Engineering] Add typing annotations to guard and source (#158397 ) As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py` Running ``` mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 1227 \| 2208 \| 55.57% \| 207 \| 362 \| 57.18% \| \| This PR \| 2217 \| 2217 \| 100.00% \| 362 \| 362 \| 100.00% \| \| Delta \| +990 \| +9 \| +44.43% \| +155 \| 0 \| +42.82% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/158397 Approved by: https://github.com/anijain2305	2025-07-24 15:55:18 +00:00
fduwjj	fd48681b6a	[DeviceMesh][ez] Make the logic within flatten simpler (#158999 ) While looking at the code of device mesh I find that this logic can be simplified. Also the naming needs to be correct. Because this mesh is not "flattened" yet, so we can just call it flatten. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158999 Approved by: https://github.com/wz337, https://github.com/wconstab ghstack dependencies: #158900	2025-07-24 15:40:13 +00:00
cyy	a3f9f79f59	Remove tensorexpr tests (#158928 ) The tests are not maintained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928 Approved by: https://github.com/albanD	2025-07-24 15:38:36 +00:00
Ke Wen	2fc0b1605e	[a2av] Make test input more random (#157029 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Use torch.randn to fill input buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157029 Approved by: https://github.com/fegin, https://github.com/ngimel ghstack dependencies: #158234, #158235, #156743, #156881, #157026	2025-07-24 15:35:12 +00:00
PyTorch MergeBot	11ea3736dd	Revert "[CI][testing] Use 3 processes for testing on sm89 and sm90 jobs (#158691 )" This reverts commit 0c0fcb53ff5ee1eb5f0d1f535ed3726d01f8abb5. Reverted https://github.com/pytorch/pytorch/pull/158691 on behalf of https://github.com/ZainRizvi due to Sorry but these are causing jobs to fail with out of memory errors on trunk ([comment](https://github.com/pytorch/pytorch/pull/158691#issuecomment-3113922186))	2025-07-24 15:31:53 +00:00
Ke Wen	43d4ff6851	[a2av] Test dispatch-then-combine (#157026 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Putting both the dispatch API and combine API in battlefield, one following the other, i.e. ``` all_to_all_vdev_2d(inp, out, inp_splits, out_splits_offsets, ...) all_to_all_vdev_2d_offset( input=out, out=combine_out, in_splits_offsets=out_splits_offsets, out_splits_offsets=combine_out_splits_offsets ) ``` Here the `out_splits_offsets` from dispatch perfectly serves as the `in_splits_offsets` argument for combine. Then we assert that the output of combine is exactly the same as the original input to shuffle, and combine's output splits are exactly the same as the original input splits. It works! Pull Request resolved: https://github.com/pytorch/pytorch/pull/157026 Approved by: https://github.com/Skylion007, https://github.com/ngimel ghstack dependencies: #158234, #158235, #156743, #156881	2025-07-24 15:21:02 +00:00
Ke Wen	83957d1c03	[a2av] Add token combine operator (#156881 ) Added `all_to_all_vdev_2d_offset`, which: Perform a 2D AllToAllv operation, with input split and offset information provided on device. The input offsets need not to be exact prefix sum of the input splits, i.e. paddings are allowed between the splitted chunks. The paddings, however, will not be transferred to peer ranks. In Mixure of Experts models, this operation can be used to combine tokens processed by experts on remote ranks. This operation can be viewed as an "reverse" operation to the `all_to_all_vdev_2d` operation (which shuffles tokens to experts). The change may seem a bit dense, sorry. But it is mainly two changes: 1. templating existing device functions (to use provided input offset or calculate it) 2. generalizing variable names, e.g. npes, ne --> minor_size, major_size, so that I can use the same alltoall function for matrix of (nranks, ne) as well as matrix of (ne, nranks). Pull Request resolved: https://github.com/pytorch/pytorch/pull/156881 Approved by: https://github.com/ngimel ghstack dependencies: #158234, #158235, #156743	2025-07-24 15:08:04 +00:00
Pian Pawakapan	48fe4ff247	[export] set enable_gqa in export flash->math decomp (#158604 ) Differential Revision: D78524147 For `scaled_dot_product_attention(..., enable_gqa=True)`: - the Math backend passes the flag through, performing the extra [KV broadcast](`6e07d6a0ff/aten/src/ATen/native/transformers/attention.cpp (L902)`) if set to True - the Flash backend has no flag, and relies on correct indexing in the C++ kernel - Export used to default to Math for `enable_gqa=True`, but https://github.com/pytorch/pytorch/pull/157893 landed and enabled Flash. At the same time, there's an export-only [decomp](`6e07d6a0ff/torch/_decomp/decompositions.py (L4968)`) redirecting flash -> math, calling with `enable_gqa` unset, because that info isn't available. This led to https://fb.workplace.com/groups/1028545332188949/posts/1264609398582540 crashing, calling the Math non-GQA variant, with GQA inputs. This assumes GQA for seqlen mismatches in the export decomp, setting `enable_gqa = <q seqlen> != <kv seqlen>`, relying on prior backend checks to raise on invalid input shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158604 Approved by: https://github.com/angelayi, https://github.com/drisspg	2025-07-24 14:46:13 +00:00
James Wu	f55c5d085e	[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847 ) This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks. The following bugfixes are in this PR to make all of this work: - Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes) - Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming. - log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file. ## Test Plan After this PR, the following now works: ``` TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --inference --backend inductor --caching-precompile --warm-start-latency ``` tlparse result (internal): Cold Start (6 seconds): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Warm Start (~1 s): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847 Approved by: https://github.com/zhxchen17	2025-07-24 14:09:54 +00:00
Nick Westlake	a3025e17b2	Fix inductor non-stable argsort/sort test (#146622 ) - Prevent the inductor test for argsort/sort from wrongly failing when the argsort/sort output with stable=False differs from pytorch but is still a valid argsort output. - Add functionality to allow alternative assert_equal functions in inductor tests for future cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146622 Approved by: https://github.com/eellison Co-authored-by: George Wigley <georgewi@graphcore.ai>	2025-07-24 14:02:12 +00:00
atalman	afd6eb0d49	[docker release] Remove build layer as not used (#158988 ) [docker release] Remove build layer as not used in any of the : https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=Build%20Official Pull Request resolved: https://github.com/pytorch/pytorch/pull/158988 Approved by: https://github.com/oulgen, https://github.com/malfet	2025-07-24 12:22:55 +00:00
IvanKobzarev	3ced1079a4	[inductor] Fix collectives_reordering overwrite real_dep with fake_dep with the same name (#158960 ) Differential Revision: [D78839734](https://our.internmc.facebook.com/intern/diff/D78839734) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158960 Approved by: https://github.com/wconstab	2025-07-24 11:08:58 +00:00
Hameer Abbasi	3e954d3943	better testing for subclasses + compile (#158742 ) Fixes #114398 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158742 Approved by: https://github.com/ezyang	2025-07-24 10:28:44 +00:00
Sherlock Huang	fb067de550	[NativeRT] Remove device_ member from OpKernel base class (#158944 ) Summary: In general, device_ is not very useful in OpKernel. Remove it to avoid misuse. Also, the meaning of `device_` is also ambiguous in the OpKernel. For StaticDispatch kernels, we always call cpu kernel. For C10Kernel, we rely on input tensor's device and dispatcher to determine which device to run on. For ops involves multiple device, e.g. aten._to_copy(device), the meaning of device is ill-defined. Test Plan: CI Rollback Plan: Reviewed By: henryoier, dolpm, kqfu, zhxchen17 Differential Revision: D78704840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158944 Approved by: https://github.com/dolpm	2025-07-24 09:21:37 +00:00
Wei (Will) Feng	693197eed6	[doc] remove FSDP1 developer note (#158991 ) this resolve pytorch doc audit - we remove fsdp1 doc and promote fsdp2 https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/158991 Approved by: https://github.com/svekars, https://github.com/mori360 ghstack dependencies: #158989	2025-07-24 08:21:54 +00:00
cyy	65c1109ca2	Remove CUDA 11 CMake code (#156795 ) CUDA 11 is no longer supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156795 Approved by: https://github.com/atalman, https://github.com/malfet	2025-07-24 08:00:41 +00:00
Ke Wen	70fb5bb6fb	[CI] Add smoke test for NVSHMEM availability (#158938 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158938 Approved by: https://github.com/huydhn, https://github.com/atalman	2025-07-24 06:34:21 +00:00
zero000064	30bb7636da	removed zero dim cpu logic from fake_tensor.py (#147501 ) Fixes #144748 In #144748, the inconsistency between the eager mode and the inductor mode is reported as a bug. The root cause is fake_tenosr.py's find-common-device method, `0b0da81021/torch/_subclasses/fake_tensor.py (L833)`, takes zero dim cpu tensor into account but the device check in adaption.h doesn't. This fix is to add a list for some ops to bypass zero-dim-cpu-tensor check to align with the eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147501 Approved by: https://github.com/ezyang	2025-07-24 06:19:46 +00:00
Wei (Will) Feng	68349118b5	[doc] add weifengpy to torch distributed pocs (#158989 ) <img width="415" height="355" alt="Screenshot 2025-07-23 at 16 02 12" src="https://github.com/user-attachments/assets/35b6bb45-d5ed-4d74-8369-e8e66aaa2618" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158989 Approved by: https://github.com/mori360	2025-07-24 04:42:33 +00:00
PyTorch UpdateBot	e09d80c545	[vllm hash update] update the pinned vllm hash (#158997 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158997 Approved by: https://github.com/pytorchbot	2025-07-24 04:04:17 +00:00
Nikita Shulga	07df6ba7f5	[BE] Remove unused `test_python_gloo_with_tls` (#158964 ) This was last modified in 2021 and has not been invokved at least since 2.0 release Pull Request resolved: https://github.com/pytorch/pytorch/pull/158964 Approved by: https://github.com/Camyll, https://github.com/atalman ghstack dependencies: #158961, #158962, #158963	2025-07-24 02:34:27 +00:00
Nikita Shulga	d61153a300	Delete mobile merge rule (#158963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158963 Approved by: https://github.com/atalman ghstack dependencies: #158961, #158962	2025-07-24 02:34:27 +00:00
Nikita Shulga	da9e120e3f	[BE] Remove unused `build-android` action (#158962 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158962 Approved by: https://github.com/Camyll, https://github.com/atalman ghstack dependencies: #158961	2025-07-24 02:34:27 +00:00
Nikita Shulga	611b61e758	[BE] Remove android build rules (#158961 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158961 Approved by: https://github.com/Camyll, https://github.com/atalman	2025-07-24 02:34:27 +00:00
cyy	d352c28dd1	[2/N] Remove FindPackageHandleStandardArgs.cmake (#156559 ) Following #157188, this PR removes FindPackageHandleStandardArgs.cmake Pull Request resolved: https://github.com/pytorch/pytorch/pull/156559 Approved by: https://github.com/albanD	2025-07-24 02:34:10 +00:00
Catherine Lee	0c0fcb53ff	[CI][testing] Use 3 processes for testing on sm89 and sm90 jobs (#158691 ) 3 procs were used for sm86, but we switched to sm89 and the check failed so it switched back to 2 sm90 is H100, but idk what unittests we have running there, but I assume they also have a lot of memory They use larger runners, which have more GPU memory, so its usually ok. I think it's ~22GB -> 10GB per proc if 2, 6GB per proc if 3 (cuda context maybe 1GB) I've applied skips to the ones that OOMed Time decreases from ~2.7hr per test job -> ~2hr Pull Request resolved: https://github.com/pytorch/pytorch/pull/158691 Approved by: https://github.com/huydhn	2025-07-24 01:51:28 +00:00
Teja	febf3c475e	fix forced loglevel in pytorch oss code (#158820 ) Differential Revision: [D78715806](https://our.internmc.facebook.com/intern/diff/D78715806/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158820 Approved by: https://github.com/Skylion007, https://github.com/pradeepfn	2025-07-24 00:40:28 +00:00
Aditya Tewari	7001d6fbc9	Skip slow tests for aarch64-inductor-benchmarks (#158842 ) This PR suggests adding some models to `cpu_skip_list` which are currently being run in TIMM and Torchbench. The suggested models takes a long time which leads to the benchmark runs being `timeout`. [benchmark runs for aarch64](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly-aarch64.yml) • The issue stems from unoptimized groupwise convolution (BF16 /F16 dtype) kernels for aarch64 platforms , which significantly slow down execution leading to the timeout. Action: • An optimized BF16 groupwise convolution kernel is currently being developed in oneDNN, targeted for release in Q4 2025. To maintain dashboard consistency and signal clarity, I’ve skipped the affected tests in: * timm benchmarks * torchbench benchmarks As suggested, skip is applied at the CPU - arch level, explicitly branching for aarch64 and adding models which needs to be skipped. This keeps the logic clean, but: • An alternative considered was increasing shard counts for aarch64 runners, but given the known performance bottleneck, skipping avoids wasted compute cycles. Suggestions around this will be appreciated. Benchmark does not timeout after the suggested change: https://github.com/pytorch/pytorch/actions/runs/16447200138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158842 Approved by: https://github.com/malfet	2025-07-24 00:21:38 +00:00
Bin Bao	0118931e27	[Inductor] Fix a user-defined Triton kernel bool param codegen issue (#158845 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/158778. When handling a boolean type parameter to a user-defined Triton kernel, we need to treat it differently from integer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158845 Approved by: https://github.com/davidberard98, https://github.com/eellison	2025-07-24 00:19:27 +00:00
atalman	ebb032a202	[docker release] Fix push nightly tag (#158984 ) This is a typo. I see that this step is not executing in nightly builds: https://github.com/pytorch/pytorch/actions/runs/16464544564/job/46538759844 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158984 Approved by: https://github.com/oulgen	2025-07-23 23:39:49 +00:00
Ke Wen	60ac3414eb	[a2av] Split in_out_splits into in_splits and out_splits_offsets (#156743 ) So that it would be easier if user would like to feed `out_splits_offsets` as input to a combining a2av (coming next). An example is in #157029. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156743 Approved by: https://github.com/ngimel ghstack dependencies: #158234, #158235	2025-07-23 23:34:48 +00:00
PyTorch MergeBot	d34cee4cf3	Revert "[Torch Native] Add test for packaging weight (#158750 )" This reverts commit 85ee2fb8c5c57b513526b0cc968ba13012167572. Reverted https://github.com/pytorch/pytorch/pull/158750 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing on trunk: inductor/test_aot_inductor_package.py::TestAOTInductorPackageCpp_cuda::test_compile_with_exporter_weights [GH job link](https://github.com/pytorch/pytorch/actions/runs/16478978095/job/46590552109) [HUD commit link](`85ee2fb8c5`) ([comment](https://github.com/pytorch/pytorch/pull/158750#issuecomment-3111188266))	2025-07-23 23:24:55 +00:00
Anshul Sinha	5cdb3d896e	[FSDP][Replicate] added replicate function that uses FSDP instead of DDP (#158207 ) Summary Users would like to use Replicate with TP. Currently, the replicate function uses DDP, which has not been maintained resulting in a lack of integration options. Since users can use FSDP with TP, we will make the replicate function use FSDP so that users can use replicate with FSDP. To that end I have created a replicate function that uses FSDP instead of DDP. One blocker that I ran into is that the replicate function has a contract which assigns a module "replicate" attribute in registry. This would mean that fully_shards is_composable requirement would not be satisfied making it impossible to apply fully_shard to a replicate module. The solution to this was to copy the fully_shard function and state and modify it for replicate. In the future, it should be explored making the replicate_state inherit from FSDP_state to get rid of code duplicity. I have attached below the profile tracing of a replicated Net Module. https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/anshulsi_270fcc36-194a-42f5-9841-cace984c2132_devgpu263.prn2.facebook.com_1792146.1753232748025155780.pt.trace.json Test Case 1. pytest test/distributed/_composable/test_replicate_with_fsdp.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/158207 Approved by: https://github.com/weifengpy Co-authored-by: Anshul Sinha <50644008+sinhaanshul@users.noreply.github.com>	2025-07-23 22:53:06 +00:00
Guilherme Leobas	0204099762	Raise exception in Dynamo if op fails in the interpreter (#158661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158661 Approved by: https://github.com/williamwen42 ghstack dependencies: #158660	2025-07-23 22:31:51 +00:00
Guilherme Leobas	b67f97c166	Correctly handle `OP_CONTAINS` (#158660 ) CPython can fallback to `__iter__` if object doesn't implement `__contains__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158660 Approved by: https://github.com/zou3519	2025-07-23 22:31:51 +00:00
Mikayla Gawarecki	7f649ed4f8	Add basic torch.hash_tensor op (#154149 ) Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor. - The hash is always uint64. - Integers will be casted to uint64 before performing the xor_sum reduction - Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149 Approved by: https://github.com/albanD	2025-07-23 22:28:03 +00:00
Max Ren	86df3ff1f1	fix xnnpack build on mac (#158881 ) Summary: Fix a bug for not getting the correct sources Test Plan: CI on my mac: ``` buck2 build @//fbobjc/mode/profile --show-full-output //xplat/executorch/examples/portable/executor_runner:executor_runner_opt File changed: fbsource//xplat/caffe2/third_party/xnnpack.buck.bzl Buck UI: https://www.internalfb.com/buck2/67b59179-4de8-462a-9202-0b9c34a35aef Network: Up: 2.4MiB Down: 1.3KiB (reSessionID-f687a7cd-5961-4851-bc67-b07043baa52a) Loading targets. Remaining 0/1 504 targets declared Analyzing targets. Remaining 0/42 1960 actions, 2424 artifacts declared Executing actions. Remaining 0/975 37.2s exec time total Command: build. Finished 40 local Time elapsed: 7.7s BUILD SUCCEEDED fbsource//xplat/executorch/examples/portable/executor_runner:executor_runner_opt /Users/maxren/fbsource/buck-out/v2/gen/fbsource/267ffdee31edf15e/xplat/executorch/examples/portable/executor_runner/__executor_runner_opt__/executor_runner_opt ``` Rollback Plan: Reviewed By: swolchok Differential Revision: D78771697 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158881 Approved by: https://github.com/digantdesai	2025-07-23 22:06:27 +00:00
fduwjj	82f8e04f27	Update distributed maintainers (#158900 ) I maintain couple components of distributed like devicemesh, c10d and PGNCCL, gloo, etc. Can I be marked not as emeritus? Thanks! Pull Request resolved: https://github.com/pytorch/pytorch/pull/158900 Approved by: https://github.com/albanD	2025-07-23 21:53:27 +00:00
saienduri	5619bf9971	Enable MI355X PyTorch CI testing. (#158889 ) This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes. - Rework aotriton cmake configuration to rely on `HIP_VERSION` instead of `ROCM_VERSION` as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build. - Bump composable-kernel submodule to [df6023e305f389bbf7249b0c4414e649f3ad6598](`df6023e305`) for mi350 compatibility. - Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker. - Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST. - Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: `ca7d5fae11 (rocm-mi300)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158889 Approved by: https://github.com/jeffdaily	2025-07-23 21:50:31 +00:00
zpcore	d8425e9c75	[1/N] support of replication fallback strategy (#158046 ) #### 1. Provide a default fallback strategy that can apply to arbitrary operator with output in type of single tensor. We can call register_op_strategy to register using the `fallback_op_strategy`: - For op without List[Tensor] as input, call: ``` register_op_strategy(op_overload)(replicate_op_strategy) ``` - For op contains List[Tensor] as input, call: ``` register_op_strategy(op_overload, schema_info=RuntimeSchemaInfo(needs_pytree=True))(replicate_op_strategy) ``` The strategy will force all input and output to be replicated with the corresponding redistribute_cost. #### 2. Add a test function as a necessary condition for strategy function. ``` detect_exists_identical_opspec(*args, op, mesh, strategy_function) ``` This function detects if identical strategies will be produced given the sample `args`. It will iterate all combinations of placements for each arg and produce the output strategy from the registered `strategy_function`. #### 3. Provide a context manger `op_strategy_context` to easily register/unregister strategies for testing. E.g., ``` with op_strategy_context(test_op.default, replicate_op_strategy): ... ``` #### 4. Fix a bug that TupleStrategy never get flatten as expected: `9df0176408/torch/distributed/tensor/_op_schema.py (L286)` Basically we need to 1) register_pytree_node for TupleStrategy, 2) propagate the schema_info to `strategy_schema` after `strategy_schema = _wrap_with_op_strategy(op_schema)`. This is the first implementation. Plan to add support to enable sharding on the batch dim as the output strategy next. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158046 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2025-07-23 21:14:20 +00:00
fduwjj	633d5faf3f	[DeviceMesh] Enable slicing a submesh with warnings (#158899 ) We don't create new PGs when doing slicing in DeviceMesh so it is relatively safe to relax the requirement of one can only do slicing from root mesh. But this does come with caveat when it is asymmetric, for example, only some have the sliced out submesh, for example. So aside from removing the requirement we also add a warning here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158899 Approved by: https://github.com/wz337	2025-07-23 21:13:41 +00:00
Sidharth	4d5d56a30e	[dynamo] lintrunner for gb_registry adds/updates (#158460 ) This PR adds automation to adding/updating the JSON registry through the lintrunner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158460 Approved by: https://github.com/williamwen42	2025-07-23 21:02:54 +00:00
Xuehai Pan	64e8d7d66b	[BE] bump test dependency `z3-solver` to drop using deprecated `pkg_resources` (#158905 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158905 Approved by: https://github.com/albanD, https://github.com/ezyang ghstack dependencies: #158904	2025-07-23 21:01:02 +00:00
Xuehai Pan	b935ad17d5	[BE][Easy] add missing Python 3.14 PyPI classifier (#158904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158904 Approved by: https://github.com/albanD	2025-07-23 21:01:02 +00:00
henrylhtsang	f7f550649f	[cutlass backend] Change default inst level mm config number (#158901 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158901 Approved by: https://github.com/ColinPeppler, https://github.com/jingsh, https://github.com/Skylion007	2025-07-23 20:53:22 +00:00
PaliC	255c0545e7	[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407 ) This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407 Approved by: https://github.com/albanD ghstack dependencies: #158288, #158290, #158291	2025-07-23 20:27:28 +00:00
PaliC	9c68c4d08f	[BE] Remove __reduce_deploy__ (#158291 ) This PR removes the integration point torch.fx had with torch::deploy (and another minor change). Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291 Approved by: https://github.com/albanD ghstack dependencies: #158288, #158290	2025-07-23 20:27:28 +00:00
PaliC	6ed2cb6ccd	[BE] Remove torch deploy \| remove torch deploy specific files (#158290 ) This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290 Approved by: https://github.com/albanD ghstack dependencies: #158288	2025-07-23 20:27:28 +00:00
PaliC	ab26d4fbeb	[BE] remove torch deploy - conditionals (#158288 ) This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started. 1. Remove test_deploy_interaction as we no longer need to worry about this 2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1) 3. Remove `USE_DEPLOY` and switch to the default path always Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288 Approved by: https://github.com/albanD	2025-07-23 20:27:28 +00:00
Denghui Dong	da94023b02	[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 ) Hi team, Please help review this patch. This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable. I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by `257c413cd1` on 3.12.5. So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it. There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain. ~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446 Approved by: https://github.com/sraikund16, https://github.com/cyyever	2025-07-23 20:03:52 +00:00
Nichols A. Romero	c996aff6ed	[ROCm] UT verifies a runtime error is raised if tensor.item() is captured in a cudagraph (#158878 ) Unit test for this PR: https://github.com/pytorch/pytorch/pull/158165 This unit test verifies that a runtime error is raised when tensor.item() operation is captured in a cudagraph. Equally valid for ROCm and CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158878 Approved by: https://github.com/jeffdaily, https://github.com/ngimel	2025-07-23 20:01:50 +00:00
drisspg	691736ae07	Add kernel options to flex docs (#158875 ) Fixes https://github.com/pytorch/pytorch/issues/158741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158875 Approved by: https://github.com/BoyuanFeng, https://github.com/albanD	2025-07-23 19:05:19 +00:00
PaulZhang12	fe8f556006	Fix Triton GEMM templates with k=1 (#158650 ) Thanks to @davidberard98 for much of the analysis here. For GEMMs of K=1, the hints, `tl.multiple_of` and `tl.max_contiguous` apply completely, as the indices to the loads are only dependent on `offs_m` and `offs_n`. For shapes like `(97x1), (1x97)`, this results in misaligned address errors, due to the fact that for all BLOCK_M and BLOCK_N sizes, the last tile is not a contiguous load. With K > 1 case, the hint is not as strict given the dependency on the k indices for the load as well. In the K=1 case, only `offs_m` and `offs_n` are used and broadcasted to the index shape. One can say these hints are "wrong", but in various cases in the hints being wrong, such as with the shape `9999x4, 4x9999`, there is a substantial performance improvement with the hint. For nice shapes with K=1, where M, N are a multiple 8 to where these hints are fine and there is no misaligned address, there is no performance regression observed on H100: <img width="547" height="402" alt="Screenshot 2025-07-18 at 5 05 47 PM" src="https://github.com/user-attachments/assets/fee2bbaa-784c-422e-bb8c-43c6c2607ad2" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158650 Approved by: https://github.com/davidberard98	2025-07-23 18:45:51 +00:00
Shangdi Yu	85ee2fb8c5	[Torch Native] Add test for packaging weight (#158750 ) Add test that require weights to be packaged for torch native For now, we need `package_weights_in_so=True` for compile standalone. The constants are in a `.o` file and will be added as a source to the CMakeLists.txt of the model. After we added weight deduping, we should be able to let this config be False. ``` python test/inductor/test_aot_inductor_package.py -k test_compile_with_exporter_weights ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158750 Approved by: https://github.com/desertfire	2025-07-23 18:36:10 +00:00
Mikayla Gawarecki	fef236da69	Add zero_() and empty_like(t) to torch/csrc/stable/ops.h (#158866 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158866 Approved by: https://github.com/janeyx99	2025-07-23 18:31:05 +00:00
PyTorch MergeBot	76be282e3a	Revert "[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847 )" This reverts commit d898d0d437bfdc0719e6c69d5005606c5e64fca8. Reverted https://github.com/pytorch/pytorch/pull/158847 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI jobs on MI200 and MI300 ([comment](https://github.com/pytorch/pytorch/pull/158847#issuecomment-3109664713))	2025-07-23 18:25:46 +00:00
PaulZhang12	9905ed616a	[Inductor] Expose decomposeK knobs as envvars (#158745 ) Fix up decomposeK autotuning, by removing condition to return more than `k_splits_limit` and setting default to 10 instead of 5. Allow `k_splits_limit` to be configurable to the user via `TORCHINDUCTOR_NUM_DECOMPOSE_K_SPLITS` and also allow user to configure threshold in which to use decompose_k via `TORCHINDUCTOR_DECOMPOSE_K_THRESHOLD` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158745 Approved by: https://github.com/eellison	2025-07-23 18:23:44 +00:00
PyTorch MergeBot	30b0ad5c68	Revert "Fix decorators skipping NCCL tests (#158846 )" This reverts commit 57024913c409764f129d6a7792625f5b05462e31. Reverted https://github.com/pytorch/pytorch/pull/158846 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking trunk. See distributed/_composable/fsdp/test_fully_shard_logging.py::LoggingTests::test_fsdp_logging [GH job link](https://github.com/pytorch/pytorch/actions/runs/16472103496/job/46564570609) [HUD commit link](`57024913c4`) ([comment](https://github.com/pytorch/pytorch/pull/158846#issuecomment-3109553414))	2025-07-23 17:47:35 +00:00
PyTorch MergeBot	41b6cdaf76	Revert "Fix Triton GEMM templates with k=1 (#158650 )" This reverts commit 9df0f565972a8a034fd77d65aff2c53e6e9856d1. Reverted https://github.com/pytorch/pytorch/pull/158650 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78805560 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158650#issuecomment-3109538827))	2025-07-23 17:42:10 +00:00
Animesh Jain	1b456c580d	[dynamo][guards] Add type info of the guarded value in guard managers (#158765 ) tlparse looks like this <img width="1165" height="226" alt="image" src="https://github.com/user-attachments/assets/04c4e6b1-34a3-4d9d-8304-6eb6d9a94980" /> This will aid in reading guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158765 Approved by: https://github.com/Lucaskabela, https://github.com/StrongerXi	2025-07-23 16:59:15 +00:00
Xu Han	5e386eec94	[AOTI] enable aot inductor on Windows (#158915 ) With many PRs landed, we can run the first aot inductor example on Windows. <img width="640" height="427" alt="image" src="https://github.com/user-attachments/assets/131db159-ce17-4857-a3d5-a4b03638f01d" /> Let's remove the Windows check on `AotCodeCompiler`. CC: @angelayi , @desertfire , @jansel Pull Request resolved: https://github.com/pytorch/pytorch/pull/158915 Approved by: https://github.com/desertfire	2025-07-23 16:29:15 +00:00
iremyux	00da8e63eb	CI for Windows Arm64 (#148753 ) This pull request adds a new CI workflow for Windows Arm64, named win-arm64-build-test.yml. It can be triggered on any pull request by including the ciflow/win-arm64 tag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148753 Approved by: https://github.com/malfet	2025-07-23 16:12:20 +00:00
Guilherme Leobas	576253c476	[math] Trace `float.fromhex` (#156976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156976 Approved by: https://github.com/zou3519 ghstack dependencies: #156975, #156977	2025-07-23 16:12:08 +00:00
Guilherme Leobas	f5314f89c8	[struct] Add `struct.pack` and `struct.unpack` polyfills (#156977 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156977 Approved by: https://github.com/XuehaiPan, https://github.com/jansel ghstack dependencies: #156975	2025-07-23 16:12:08 +00:00
Guilherme Leobas	671e22a951	[math] Raise exception in Dynamo if constant fold call fail (#156975 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156975 Approved by: https://github.com/zou3519	2025-07-23 16:12:08 +00:00
Mwiza Kunda	d3d9bc1c31	[inductor] Allow backends to register their own custom config object (#158254 ) An out of tree backend can have its own configuration options that the user can enable to control inductor compilation. These config options need to be taken into account when calculating the key that is used to determine cache miss / hits. This PR allows out of tree backends to specify a custom config module that has the same type as `torch._inductor.config` that can be used to control codegen (in addition to the default config), and will be used when creating the cache key. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158254 Approved by: https://github.com/eellison	2025-07-23 15:56:06 +00:00
angelayi	7d296d5c19	[aoti][mps] Enable more tests (#158703 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158703 Approved by: https://github.com/malfet, https://github.com/desertfire ghstack dependencies: #158349, #158350, #158351	2025-07-23 15:38:56 +00:00
Zhengxu Chen	2a60b8fc97	[export][ez] Fix packaging (#158855 ) Summary: as title, seems ytpo Test Plan: CI Rollback Plan: Differential Revision: D78758466 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158855 Approved by: https://github.com/henryoier	2025-07-23 15:36:14 +00:00
James Wu	d898d0d437	[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847 ) This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks. The following bugfixes are in this PR to make all of this work: - Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes) - Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming. - log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file. ## Test Plan After this PR, the following now works: ``` TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --inference --backend inductor --caching-precompile --warm-start-latency ``` tlparse result (internal): Cold Start (6 seconds): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Warm Start (~1 s): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847 Approved by: https://github.com/zhxchen17	2025-07-23 15:06:54 +00:00
Nikita Shulga	5998cd4eaa	[MPS] Speedup torch.full for 1-byte types (#158874 ) By using [`fillBuffer:range:value:`](https://developer.apple.com/documentation/metal/mtlblitcommandencoder/fillbuffer:range:value:?language=objc) rather than MPSGraph op, which should be faster and also does not have INT_MAX limit Which in turn fixes `test_index_put_accumulate_large_tensor_mps` test Pull Request resolved: https://github.com/pytorch/pytorch/pull/158874 Approved by: https://github.com/dcci	2025-07-23 14:00:40 +00:00
Alexander Grund	57024913c4	Fix decorators skipping NCCL tests (#158846 ) Avoid failures caused by tests exiting via sys.exit instead of `unittest.skip` In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup Using `unittest.skip` decorators avoids the starting of the test in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158846 Approved by: https://github.com/Skylion007	2025-07-23 13:31:21 +00:00
yuchengliu1	ee72338f0c	[Inductor] MSVC use pointer when generating temporary array pointer (#158913 ) MSVC cannot implicitly convert a const iterator to a const pointer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158913 Approved by: https://github.com/desertfire Co-authored-by: Xu Han <xu.han@outlook.com>	2025-07-23 13:19:11 +00:00
Han, Xu	c665594c1e	[AOTI] fix extract file failed on Windows. (#158702 ) Changes: 1. rename zip index filename, and keep it out of normalize path. 2. normalize output path for extract file. Extract files successful: <img width="683" height="247" alt="image" src="https://github.com/user-attachments/assets/72dff7b9-5ec0-4523-a6ee-7768b37bbe63" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158702 Approved by: https://github.com/angelayi	2025-07-23 08:00:14 +00:00
Ruben Rodriguez Buchillon	255a04baf1	[pt2 event logging] send autotuning data for strides and hinted shapes (#158852 ) Summary: # Why capture relevant data for offline lookup table generation # What report the hinted sizes not just the symbolic sizes Test Plan: ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 \| tee /tmp/epx040 ``` This only validates that this change does not break anything, as the schema is not on scuba yet (not actualized) Rollback Plan: Reviewed By: stashuk-olek Differential Revision: D77837548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158852 Approved by: https://github.com/jingsh	2025-07-23 06:44:27 +00:00
Yang Wang	1d302eaee8	[vllm] add vllm test base docker image (#158755 ) # description Add base docker image for vllm. It seems like we use the base docker image for both pytorch build, and tests. Configure a base image for vllm against pytorch CI. # Others Added readme regarding how the base docker images are used, and how to add one, this also explain what is the right file to modify Pull Request resolved: https://github.com/pytorch/pytorch/pull/158755 Approved by: https://github.com/seemethere, https://github.com/huydhn	2025-07-23 05:42:44 +00:00
Colin Peppler	a6b7bea244	[inductor] support linear & layer_norm unbacked (#155267 ) ### What - Use `statically_known_true` over `guard_size_oblivious` in cases where we're checking an optimization path. Otherwise, it will DDE and we can't take the safe/slower path. - For broadcast checks, use `fallback=False` if we encounter a DDE. Typically, unbackeds would be ≥2 and that falls inline with size-oblivious reasoning (i.e. when `size_oblivious=True`). ### Example DDE ``` torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)). (Size-like symbols: u0) Caused by: (_inductor/lowering.py:488 in broadcast_symbolic_shapes) ``` ``` torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)). (Size-like symbols: u0) Caused by: (_inductor/ir.py:2797 in create) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155267 Approved by: https://github.com/eellison	2025-07-23 05:42:01 +00:00
PyTorch UpdateBot	be72bcf828	[vllm hash update] update the pinned vllm hash (#158806 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158806 Approved by: https://github.com/pytorchbot	2025-07-23 04:41:53 +00:00
PyTorch UpdateBot	f80f97d192	[audio hash update] update the pinned audio hash (#158807 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158807 Approved by: https://github.com/pytorchbot	2025-07-23 04:39:50 +00:00
anwang	42a69f7c2b	[MTIA Aten Backend] Migrate addmm.out / baddbmm.out / bmm.out (#158749 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate addmm.out / baddbmm.out / bmm.out to in-tree. Differential Revision: [D78578483](https://our.internmc.facebook.com/intern/diff/D78578483/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158749 Approved by: https://github.com/albanD, https://github.com/nautsimon ghstack dependencies: #158748	2025-07-23 03:45:28 +00:00
anwang	b87471e66f	[MTIA Aten Backend] Migrate addcdiv.out / addcmul.out / eq.Tensor_out / eq.Scalar_out (#158748 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate addcdiv.out / addcmul.out / eq.Tensor_out / eq.Scalar_out to in-tree. Differential Revision: [D78568103](https://our.internmc.facebook.com/intern/diff/D78568103/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158748 Approved by: https://github.com/albanD, https://github.com/nautsimon	2025-07-23 03:45:20 +00:00
Xu Han	f10e4430e2	[AOTI] normalize path and process model files. (#158705 ) Continued to https://github.com/pytorch/pytorch/pull/158702 , split `zip_filename_str` and real file path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158705 Approved by: https://github.com/desertfire	2025-07-23 02:58:21 +00:00
Xu Han	2dccff7dcf	[inductor] pass_fds not supported on Windows, skip them on Windows. (#158830 ) <img width="1366" height="806" alt="image" src="https://github.com/user-attachments/assets/ddf3d27a-36da-47ce-9ba9-00c43805bb06" /> Almost UTs are failed on `AssertionError: pass_fds not supported on Windows.`, let's skip them on Windows. TODO: I will also debug and confirm `pass_fds` on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158830 Approved by: https://github.com/jansel	2025-07-23 02:24:35 +00:00
Pian Pawakapan	dec0d3101c	[export] fix unbacked range deserialization (#158681 ) Fixes https://github.com/pytorch/pytorch/issues/151809, by reading shape assertion nodes into ShapeEnv, and deferring instantiation of node example values, to be done node-by-node. Differential Revision: D78588406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158681 Approved by: https://github.com/ydwu4, https://github.com/avikchaudhuri	2025-07-23 02:13:11 +00:00
PaulZhang12	9df0f56597	Fix Triton GEMM templates with k=1 (#158650 ) Thanks to @davidberard98 for much of the analysis here. For GEMMs of K=1, the hints, `tl.multiple_of` and `tl.max_contiguous` apply completely, as the indices to the loads are only dependent on `offs_m` and `offs_n`. For shapes like `(97x1), (1x97)`, this results in misaligned address errors, due to the fact that for all BLOCK_M and BLOCK_N sizes, the last tile is not a contiguous load. With K > 1 case, the hint is not as strict given the dependency on the k indices for the load as well. In the K=1 case, only `offs_m` and `offs_n` are used and broadcasted to the index shape. One can say these hints are "wrong", but in various cases in the hints being wrong, such as with the shape `9999x4, 4x9999`, there is a substantial performance improvement with the hint. For nice shapes with K=1, where M, N are a multiple 8 to where these hints are fine and there is no misaligned address, there is no performance regression observed on H100: <img width="547" height="402" alt="Screenshot 2025-07-18 at 5 05 47 PM" src="https://github.com/user-attachments/assets/fee2bbaa-784c-422e-bb8c-43c6c2607ad2" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158650 Approved by: https://github.com/davidberard98	2025-07-23 02:05:57 +00:00
albanD	91602a9254	Cleanup old caffe2 scripts (#158475 ) Testing on this one is grep based: if there were no reference to that script I can find, I deleted. We can easily add any of these back if needed! Pull Request resolved: https://github.com/pytorch/pytorch/pull/158475 Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/cyyever	2025-07-23 01:21:31 +00:00
angelayi	cc372ad557	[aoti][mps] Improve tabbing in cpp generation (#158351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158351 Approved by: https://github.com/desertfire, https://github.com/malfet ghstack dependencies: #158349, #158350	2025-07-23 00:54:53 +00:00
angelayi	84058d1179	[aoti][mps] Fix cpu kernel generation (#158350 ) In the case where we have both mps and cpu code which can be inductor compiled, we need to case on the device -- this requires the device field to be correctly passed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158350 Approved by: https://github.com/malfet ghstack dependencies: #158349	2025-07-23 00:54:53 +00:00
angelayi	096dc35d77	[aoti][mps] Fix update constants buffer (#158349 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158349 Approved by: https://github.com/malfet	2025-07-23 00:54:52 +00:00
rzou	56d07d0bde	Add merge_rules category for Dynamo; add guilhermeleobas (#158620 ) Adds guilhermeleobas to merge_rules for Dynamo and functorch. Guilherme has done good work on both of these subsystems and I am tired of him approving my PRs and me not being able to merge them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158620 Approved by: https://github.com/anijain2305	2025-07-23 00:44:27 +00:00
Pian Pawakapan	39b54b78d7	[export] runtime asserts for while HOP subgraphs (#158467 ) Differential Revision: D78431075 For #158366 - Calls runtime asserts pass for HOP subgraphs (in reenter_make_fx) - For while_loop only (can be expanded), clones input tensors for subgraph tracing, so unbacked memos (item, nonzero, etc.) aren't reused Pull Request resolved: https://github.com/pytorch/pytorch/pull/158467 Approved by: https://github.com/ydwu4	2025-07-23 00:34:18 +00:00
Nichols A. Romero	3703dabe42	[ROCm] delete un-needed workaround for tensor.item() (#158486 ) Deleting unused workaround per discussion here: https://github.com/pytorch/pytorch/pull/158165#discussion_r2207968880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158486 Approved by: https://github.com/jeffdaily, https://github.com/houseroad	2025-07-23 00:31:57 +00:00
albanD	d3f9107d68	Remove top limit for cpython version and fix lint appropriately. (#158853 ) As per title. Sorry for the churn in the main commit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158853 Approved by: https://github.com/seemethere, https://github.com/Skylion007, https://github.com/jingsh, https://github.com/malfet, https://github.com/ZainRizvi	2025-07-22 23:59:00 +00:00
Benjamin Glass	cab96b5879	[tests] Reduce sizes of unnecessarily large tensors to reduce OOM flakes (#158456 ) Downsizes several tensors that were massively oversized to test the problem at hand, to reduce test flaking. Fixes #126867 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158456 Approved by: https://github.com/desertfire	2025-07-22 23:41:48 +00:00
Xinya Zhang	6100ed457c	[ROCm] Improve Type Safety of C10_WARP_SIZE (#158271 ) # Background The `C10_WARP_SIZE`, although always be `32` on CUDA platform, varies across different AMD GPUs. Therefore, to correctly refer this value, the host code must be a variable instead of a literal defined by macro, or a `constexpr int`. This PR may cause more compiler errors for third party code on AMD GPU, which is intentional. Having a fixed `C10_WARP_SIZE` value on host code for AMD GPU only defers compile time error to runtime. This PR is recommended to be included as part of Release Notes to describe an API change for whoever uses this macro. Users are recommended to use `C10_WARP_SIZE` directly, which adapts for various scenarios, or define a macro to use `C10_WARP_SIZE`. Assignment of this macro to symbols shared by host/device code causes problems on ROCM platform. (See the fix at `aten/src/ATen/native/cuda/layer_norm_kernel.cu` for a concrete example) # Behaviors * If compiling with HIPCC (i.e `defined(__HIPCC__)`): + Define `C10_WARP_SIZE` to be non-`constexpr` `at::cuda::warp_size()` for host-compilation pass (as compared to `static constexpr int C10_WARP_SIZE = 1;` set in 04bd7e6850e8efec77994963ffee87549555b9c3) + Define `C10_WARP_SIZE` to be a function returning `constexpr int` `64` for `__GFX9__`, and `32` otherwise, for device-compilation pass - `__GFX8__` is also 64 but we do not support any GFX8 GPU. * If not compiling with HIPCC: + Define `C10_WARP_SIZE` to be non-constexpr `at::cuda::warp_size()` # `constexpr` variant for host code For host-compilation cases where a `constexpr` value is needed for warp size (eg. launch bounds), use `C10_WARP_SIZE_STATIC`, which is defined as `64`. This macro follows the pre 04bd7e6850e8efec77994963ffee87549555b9c3 behavior of `C10_WARP_SIZE` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158271 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-07-22 23:19:38 +00:00
PyTorch MergeBot	badfebf29e	Revert "[Inductor] Expose decomposeK knobs as envvars (#158745 )" This reverts commit eac777c4f46b381106f2f2b78fe05b506f8c558c. Reverted https://github.com/pytorch/pytorch/pull/158745 on behalf of https://github.com/jeffdaily due to sorry but rocm CI is broken due to this PR ([comment](https://github.com/pytorch/pytorch/pull/158745#issuecomment-3105071170))	2025-07-22 23:04:16 +00:00
Yahaya Suleiman	fc5a404eb1	[gtest][listing] fixing caffe2:verify_api_visibility - main (#158229 ) Summary: Remove the custom main from this test file Test Plan: https://www.internalfb.com/intern/testinfra/testrun/9570149303161031 Rollback Plan: Reviewed By: patskovn Differential Revision: D78015676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158229 Approved by: https://github.com/Skylion007	2025-07-22 22:45:28 +00:00
AaronWang04	04a393507b	Fused RMSNorm implementation (#153666 ) Relevant #72643 Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090. ```py import torch import torch.nn as nn class RMSNorm(nn.Module): def __init__(self, dim, eps=1e-5): super().__init__() self.eps = eps self.scale = nn.Parameter(torch.ones(dim)) def forward(self, x): norm_x = x.norm(2, dim=-1, keepdim=True) rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype)) x_normed = x / (rms_x + self.eps) return self.scale * x_normed def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16): rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype) input_data = torch.randn(input_shape, device='cuda', dtype=dtype) for _ in range(warmup_iterations): _ = rms_norm_layer(input_data) torch.cuda.synchronize() start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) start_event.record() for _ in range(num_iterations): _ = rms_norm_layer(input_data) end_event.record() torch.cuda.synchronize() elapsed_time_ms = start_event.elapsed_time(end_event) avg_time_ms = elapsed_time_ms / num_iterations print(f"--- RMSNorm CUDA Benchmark ---") print(f"Input Shape: {input_shape}") print(f"Normalized Dimension: {normalized_dim}") print(f"Benchmark Iterations: {num_iterations}") print(f"--- Fused Implementation ---") print(f"Average Time per Iteration: {avg_time_ms:.4f} ms") print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms") compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda() for _ in range(warmup_iterations): _ = compiled_rms_norm(input_data) torch.cuda.synchronize() start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) start_event.record() for _ in range(num_iterations): _ = compiled_rms_norm(input_data) end_event.record() torch.cuda.synchronize() elapsed_time_ms = start_event.elapsed_time(end_event) avg_time_ms = elapsed_time_ms / num_iterations print(f"--- TorchCompile Implementation ---") print(f"Average Time per Iteration: {avg_time_ms:.4f} ms") print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms") print("-" * 50) if __name__ == '__main__': parameter_sets = [ {'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16}, {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16}, {'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16}, {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32}, {'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16}, ] num_benchmark_iterations = 200 num_warmup_iterations = 20 for params in parameter_sets: batch_size = params['batch_size'] sequence_length = params['sequence_length'] hidden_features = params['hidden_features'] data_type = params.get('dtype', torch.float16) shape = (batch_size, sequence_length, hidden_features) norm_dim_to_normalize = hidden_features print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}") benchmark_rmsnorm_cuda(input_shape=shape, normalized_dim=norm_dim_to_normalize, num_iterations=num_benchmark_iterations, warmup_iterations=num_warmup_iterations, dtype=data_type) ``` Here are the triton compile tests ran on a 5090 (comparing this branch vs main) ```py import torch import torch.nn as nn from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code torch.manual_seed(0) device = torch.device("cuda") for batch in range(0, 9): for i in range(9, 16): normalized_shape_arg = (2batch, 2i) input_tensor = torch.randn(2batch, 2i, device=device, requires_grad=True) weight_tensor = torch.randn(2batch, 2i,device=device, requires_grad=True) model = torch.nn.functional.rms_norm compiled_model = torch.compile(model) loss = torch.randn_like(input_tensor) num_iter = 5 for j in range(num_iter): output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor) output.backward(loss) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) start_event.record() num_iter = 10 for j in range(num_iter): output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor) output.backward(loss) end_event.record() torch.cuda.synchronize() elapsed_time_ms = start_event.elapsed_time(end_event) avg_time_ms = round(elapsed_time_ms / num_iter, 5) print(2batch, 2i, avg_time_ms) ``` main ``` 32 512 0.1812 32 1024 0.19021 32 2048 0.18871 32 4096 0.17019 32 8192 0.21944 32 16384 0.38871 32 32768 0.83282 64 512 0.14705 64 1024 0.13987 64 2048 0.14111 64 4096 0.21699 64 8192 0.43141 64 16384 0.90652 64 32768 2.18573 128 512 0.19361 128 1024 0.1963 128 2048 0.20122 128 4096 0.38888 128 8192 0.93795 128 16384 2.23437 128 32768 5.50079 256 512 0.16722 256 1024 0.22856 256 2048 0.39421 256 4096 0.96621 256 8192 2.48746 256 16384 5.53571 256 32768 11.97932 ``` current branch ``` 32 512 0.16328 32 1024 0.18104 32 2048 0.15508 32 4096 0.14356 32 8192 0.20111 32 16384 0.45974 32 32768 0.94799 64 512 0.16874 64 1024 0.18701 64 2048 0.16107 64 4096 0.20152 64 8192 0.46568 64 16384 0.96599 64 32768 2.21661 128 512 0.14982 128 1024 0.15565 128 2048 0.22241 128 4096 0.46128 128 8192 0.88883 128 16384 2.3097 128 32768 5.84448 256 512 0.14346 256 1024 0.2007 256 2048 0.45927 256 4096 0.87876 256 8192 2.10571 256 16384 5.73948 256 32768 12.98581 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666 Approved by: https://github.com/ngimel, https://github.com/albanD	2025-07-22 22:25:44 +00:00
Xu Han	a626dc8f16	[AOTI] windows package load dev (#158671 ) changes: 1. add extract file fail handler for Windows develop. 2. normalize more file paths. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158671 Approved by: https://github.com/angelayi, https://github.com/desertfire	2025-07-22 21:35:57 +00:00
Panagiotis Kourdis	fd47401536	[doc] Updates to distributed.md for XCCL backend (#155834 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155834 Approved by: https://github.com/guangyey, https://github.com/AlannaBurke, https://github.com/d4l3k Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-07-22 21:01:43 +00:00
Sandeep Narendranath Karjala	e44e05f7ae	[dynamo] Move skipIf decorator to class level in test_fx_graph_runnable (#157594 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157594 Approved by: https://github.com/xmfan ghstack dependencies: #157162	2025-07-22 20:41:49 +00:00
Nikita Shulga	ddd74d10fc	More fixes to `MakeTensor::computeStorageSize()` (#158813 ) Followup after https://github.com/pytorch/pytorch/pull/158690 that fixessimilar logic if `strides` are not explicitly specified Expanded testing to cover both cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/158813 Approved by: https://github.com/ZainRizvi, https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #158690	2025-07-22 20:36:12 +00:00
Xinya Zhang	823e223893	[ROCm] logsumexp on ROCm needs scaling back to natural base. (#156903 ) Fixes #156012 This is a temporary solution that makes context parallelism working before logsumexp behavior changes landed in AOTriton. After discussion we are not going to release AOTriton 0.10.1 to fix this due to * Even if the interface is not changed, changing the behavior of returned logsumexp tensor should still be considered as an ABI break. Such changes do not fall into the "ABI compatible" category and should be postponed to next release. * AOTriton 0.11 is scheduled to be released before end of July, which is less than five weeks Pull Request resolved: https://github.com/pytorch/pytorch/pull/156903 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-22 20:32:34 +00:00
fduwjj	6499420e45	[DeviceMesh] Make the repr shorter when debug ENV not set (#158822 ) Users want a shorter repr so this PR is trying to address that when TORCH_DISTRIBUTED_DEBUG is not set to DETAIL. Feedback and discussion is welcomed. Somehow I found that torch.set_printoptions is global, so I am hesitated to use it. Now the print is like <img width="435" height="79" alt="image" src="https://github.com/user-attachments/assets/8f173287-7138-4fbe-a4a3-8483523b21e4" /> or <img width="485" height="104" alt="image" src="https://github.com/user-attachments/assets/21e34db9-56b5-47e2-9767-750d6105a273" /> or <img width="675" height="97" alt="image" src="https://github.com/user-attachments/assets/53aa763e-7edd-4622-9cdb-37e2af8ec11f" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158822 Approved by: https://github.com/wz337, https://github.com/wconstab, https://github.com/xmfan	2025-07-22 20:31:44 +00:00
Goswami, Subrata	e17538022a	Making input dynamically adjust. (#157324 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/157324 Approved by: https://github.com/Skylion007, https://github.com/d4l3k	2025-07-22 20:14:05 +00:00
Goswami, Subrata	37ded2ac90	Using torch.accelerator in comm_mode_features_example.py and visualize_sharding_example.py (#157317 ) Continuation of https://github.com/pytorch/pytorch/pull/153213 . @guangyey @kwen2501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157317 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/d4l3k Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-07-22 19:58:48 +00:00
Justin Chu	767791943d	[ONNX] Set default opset to 20 (#158802 ) Bump default opset to 20, which is a newer opset and the max torchscript exporter supports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158802 Approved by: https://github.com/titaiwangms	2025-07-22 19:55:05 +00:00
Nichols A. Romero	c917c63282	[ROCm][tunableop] UT tolerance increase for matmul_small_brute_force_tunableop at FP16 (#158788 ) TunableOp will sometimes find a less precise solution due to the small input vectors used in this UT. Bumping op tolerance to eliminate flakiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158788 Approved by: https://github.com/jeffdaily	2025-07-22 19:45:35 +00:00
Zain Rizvi	659bfbf443	Revert "We do support 3.14" (#158856 ) Reverting to fix lint This reverts commit 2a249f1967d29626fe6ac6a07f28440348d1cc93. An emergency fix since the change needed to fix this is a little more complex than expected (see https://github.com/pytorch/pytorch/pull/158853 for reference) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158856 Approved by: https://github.com/Camyll, https://github.com/atalman	2025-07-22 19:40:53 +00:00
Electron4444	832ab990c9	Use init_device_mesh API for select tests where possible (#158675 ) This addresses reviews made for: #158538 #108749 It interchanged all the specific DevideMesh constructor calls with the API provided by the test cases, to improve abstraction Pull Request resolved: https://github.com/pytorch/pytorch/pull/158675 Approved by: https://github.com/wconstab	2025-07-22 19:28:42 +00:00
Jay Tsou	56df025d51	Add caching for `_rename_without_collisions` (#158594 ) Fixes #158357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158594 Approved by: https://github.com/pianpwk	2025-07-22 19:19:13 +00:00
Eddie Yan	55ff4f85e9	[FP8][CUTLASS] xFail `honor_sm_carveout` on `sm100` (#152378 ) CUTLASS only supports SM carveout via green contexts on `sm100` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152378 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/nWEIdia	2025-07-22 18:39:50 +00:00
William Wen	7d2ceaff21	[dynamo] skip tracing functions registered in sys.monitoring (#158171 ) Fixes https://github.com/pytorch/pytorch/issues/158164 This was fixed by applying `skip_code_recursive` to any function registered to `sys.monitoring` (via `PyThreadState_GET()->interp->monitoring_callables`). This check is done whenever we attempt to set the eval frame callback from Python. Microbenchmark: `benchmarks/dynamo/microbenchmarks/overheads.py`: BEFORE: ``` requires_grad=False eager 7.1us (warmup=0.0s) compiled 24.6us (warmup=10.0s) requires_grad=True eager 8.9us (warmup=0.0s) compiled 57.8us (warmup=0.1s) inference_mode() eager 6.5us (warmup=0.0s) compiled 23.4us (warmup=0.1s) ``` AFTER: ``` requires_grad=False eager 7.0us (warmup=0.0s) compiled 23.2us (warmup=15.2s) requires_grad=True eager 9.0us (warmup=0.0s) compiled 55.1us (warmup=0.1s) inference_mode() eager 6.4us (warmup=0.0s) compiled 22.2us (warmup=0.1s) ``` Followup thought: how do we let users know that a frame is skipped because the code object is a callable registered to sys.monitoring? (or any other reason?) Differential Revision: [D78530528](https://our.internmc.facebook.com/intern/diff/D78530528) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158171 Approved by: https://github.com/jansel	2025-07-22 18:02:30 +00:00
albanD	2a249f1967	We do support 3.14 This has been added a bit back.	2025-07-22 10:40:18 -07:00
Yidi Wu	52c294008e	[hop] allow non fake inputs when check input alias and mutation (#158798 ) https://github.com/pytorch/pytorch/pull/154193 gets reverted due to a test failure. The root cause being that: an executorch pass turns int inputs into a scalar tensor in cond's subgraph. The pass have been around on the critical path of executorch since two years ago. Changing it would be difficult. So we just allow non-fake inputs for check input mutation and aliasing, which shoudn't affect the correctness of the analysis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158798 Approved by: https://github.com/pianpwk	2025-07-22 17:22:37 +00:00
Alexander Novikov	0971637c11	Fix torch.tensor warning in ONNX symbolic_opset10 export (#158835 ) Fix PyTorch tensor copying warning in ONNX export ## Problem PyTorch ONNX exporter was generating a warning about incorrect tensor copying method: ``` UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158835 Approved by: https://github.com/justinchuby	2025-07-22 16:32:49 +00:00
PyTorch MergeBot	7d6f340238	Revert "[AOTI] Add more default options to compile_standalone (#158560 )" This reverts commit a991e285ae35159680b0ad4be24669906a6fa256. Reverted https://github.com/pytorch/pytorch/pull/158560 on behalf of https://github.com/jeffdaily due to broke rocm CI, no test signal was available from rocm ciflow/trunk, need to add ciflow/rocm to reland ([comment](https://github.com/pytorch/pytorch/pull/158560#issuecomment-3103633964))	2025-07-22 16:20:17 +00:00
Benjamin Glass	4060f30042	[AOTI] Convert C-struct zip handling to RAII container (#158687 ) Attempts to fix a memory leak reported in #158614 by wrapping manually managed MiniZ C-structs in an RAII container. I have been unable to reproduce the reported leak, but this seems like the most likely candidate. Fixes #158614 (hopefully) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158687 Approved by: https://github.com/desertfire	2025-07-22 16:01:51 +00:00
PyTorch MergeBot	9a28e23d97	Revert "removed zero dim cpu logic from fake_tensor.py (#147501 )" This reverts commit 9e0473b56621162bd85e94943a516be4727e5651. Reverted https://github.com/pytorch/pytorch/pull/147501 on behalf of https://github.com/ZainRizvi due to Seems to have broken ROCm. See inductor/test_aot_inductor_package.py::TestAOTInductorPackageCpp_cuda::test_compile_standalone_cos [GH job link](https://github.com/pytorch/pytorch/actions/runs/16428359564/job/46426243808) [HUD commit link](`a991e285ae`) ([comment](https://github.com/pytorch/pytorch/pull/147501#issuecomment-3103494041))	2025-07-22 15:45:34 +00:00
Nikita Shulga	d0c00d9a69	[MPS] Do not crash if tensor dim > INT_MAX (#158824 ) Looks like all MPS operations will crash if one of tensor dimentions are greater than `231-1` Change it into a structured exception, by checking tensor size before attempting to create MPS Tensor Add regression test for it. Before this change running following will abort with exception ``` % python3 -c "import torch; torch.randint(0, 10, (231,), dtype=torch.uint8, device='mps')" /AppleInternal/Library/BuildRoots/1c8f7852-1ca9-11f0-b28b-226177e5bb69/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:829: failed assertion `[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: NDArray dimension length > INT_MAX' zsh: abort python3 -c· ``` Skip the test on MacOS-13, as it crashes somewhere deep in MPSGraph framework with ``` /AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSTemporaryNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158824 Approved by: https://github.com/dcci ghstack dependencies: #158690, #158823	2025-07-22 15:12:26 +00:00
IvanKobzarev	371ffaf415	[bucketing] Support case of several pgs in graph (#158632 ) Main changes: - bucketing collectives only from the same process_group by group_name - Support of groups like [0,2,4,6], [0,1,3,5] using `rank_idx_dict` for in pass operations for slice idxs etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158632 Approved by: https://github.com/wconstab	2025-07-22 14:50:39 +00:00
James Wu	1b772de397	Still run TritonBundler with BundledAOTAutogradCache, save autotune results (#158048 ) When running BundledAOTAutogradCache with precompile, we still need to run triton bundling so that the precompiled CompiledFxGraph has triton cuda kernels. We also pre save the autotune results in the precompile artifact. It would be even better to pre trim the cuda kernels on save and apply them, which we can work on later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158048 Approved by: https://github.com/zhxchen17	2025-07-22 14:12:21 +00:00
Nikita Shulga	8e99714204	[EZ][BE][MPS] Remove unused `ndArrayFromTensor` (#158823 ) And `printTensorNDArray`, both of which according to https://github.com/search?type=code&q=ndArrayFromTensor+org%3Apytorch are not used anywhere Pull Request resolved: https://github.com/pytorch/pytorch/pull/158823 Approved by: https://github.com/dcci ghstack dependencies: #158690	2025-07-22 14:06:42 +00:00
Animesh Jain	9b4d938f04	[dynamo][fsdp] Consistent behavior of int attributes (#157262 ) Reimpl of https://github.com/pytorch/pytorch/pull/150954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157262 Approved by: https://github.com/bdhirsh	2025-07-22 11:26:54 +00:00
PyTorch MergeBot	0142d5f4e2	Revert "Remove is_arvr_mode() from xnnpack.buck.bzl (#158682 )" This reverts commit f09a484b8164aaadd57a79354f0ccf47733f365e. Reverted https://github.com/pytorch/pytorch/pull/158682 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158682#issuecomment-3101648365))	2025-07-22 08:33:08 +00:00
Nichols A. Romero	91b69deeb0	[ROCm][CI] update fbgemm_gpu hash used by inductor tests (#158602 ) fbgemm_gpu build started failing with asmjit errors. Moving to latest tip of fbgemm for inductor tests resolves the build failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158602 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-22 08:04:59 +00:00
Shangdi Yu	392fa75411	Change from import trace to import config (#158796 ) Summary: for this particular instance, we're doing from torch._inductor.config import trace ...trace.provenance_tracking... but for all other call sites, we're doing from torch._inductor import config ... config.trace.provenance_tracking.... Test Plan: CI Rollback Plan: Differential Revision: D78699876 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158796 Approved by: https://github.com/c00w	2025-07-22 06:10:38 +00:00
Junjie Wang (PyTorch)	3a67bf9c62	[PGNCCLx] Bring split and merge for PGNCCLx (#158790 ) Summary: We added group split in D78300794 and remote_group_merge in D78450094. We first want to upstream this change to PGNCCLx as well so that NCCLx can use this new API and we can continue our c10d clean up in https://github.com/pytorch/pytorch/pull/158488. Test Plan: CI ``` buck test -c hpc_comms.use_ncclx=stable comms/ncclx/pg/tests:test_c10d_ncclx -- test_group_split_and_merge ``` Rollback Plan: Differential Revision: D78521060 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158790 Approved by: https://github.com/d4l3k	2025-07-22 06:05:00 +00:00
henrylhtsang	d984143a74	[ci][cutlass backend] Add ci for cutlass backend tests (#156626 ) redo of https://github.com/pytorch/pytorch/pull/156136 Differential Revision: [D77327309](https://our.internmc.facebook.com/intern/diff/D77327309) I want to try land the full version first. If the ci is taking too long, we can revert back to only testing for a few names. ``` -k 'test_max_autotune_cutlass_backend_regular_mm and not test_max_autotune_cutlass_backend_regular_mm_streamk' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156626 Approved by: https://github.com/huydhn, https://github.com/mlazos	2025-07-22 05:18:13 +00:00
Shangdi Yu	21c97bd565	[reland] Transfer "stack_trace" in post_grad passes (#158752 ) Summary: We transfer stack trace in post_grad passes. We shouldn't add "stack_trace" to _COPY_META_FIELDS because _COPY_META_FIELDS is used in proxy.py where stack_trace is explicitly set. Since the stack_trace is being used by more and more debugging tools, we should also start testing it more rigorously. This PR start by adding a first test for testing that stack trace is preserved through post_grad_passes. Test Plan: ``` buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r test_pattern_matcher_transfer_meta buck run mode/dev-nosan fbcode//caffe2/test/inductor:auto_functionalize -- --rcaffe2/test/inductor:auto_functionalize_old ``` Rollback Plan: Differential Revision: D78669729 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158752 Approved by: https://github.com/jingsh	2025-07-22 03:49:13 +00:00
Boyuan Feng	a155f742ad	[benchmark] allow default mode for compile (#158792 ) Allow default mode for compile when users cannot run "max-autotune-no-cudagraphs" due to compilation time. Overall, "default" mode is slower than "[max-autotune-no-cudagraphs](https://github.com/pytorch/pytorch/pull/158536)" depending on input shapes. <img width="3564" height="2368" alt="CrossEntropyBackward_bench" src="https://github.com/user-attachments/assets/5d25c0e4-6714-42bb-a544-b7ef9cbc1b17" /> <img width="3564" height="2368" alt="CrossEntropyForward_bench" src="https://github.com/user-attachments/assets/40e0bbf9-657f-48f2-ac0c-1f0fd6a0ac1d" /> <img width="3564" height="2368" alt="LayerNormBackward_bench" src="https://github.com/user-attachments/assets/db582bb2-d8d4-414a-9de7-b9af061ad0cd" /> <img width="3564" height="2368" alt="LayerNormForward_bench" src="https://github.com/user-attachments/assets/2ce18bd8-73fc-434a-820f-46aa9ad9ddce" /> <img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/f4cb5f4b-93d3-4d96-973f-37643912325a" /> <img width="3564" height="2368" alt="RMSNormForward_bench" src="https://github.com/user-attachments/assets/231c5805-b156-4587-9c5f-504a33b60883" /> <img width="3564" height="2368" alt="SoftmaxBackward_bench" src="https://github.com/user-attachments/assets/f651c578-813b-4a8e-bffc-b5b34bd879fc" /> <img width="3564" height="2368" alt="SoftmaxForward_bench" src="https://github.com/user-attachments/assets/bfdcc043-4370-4355-af84-9f463426b21a" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158792 Approved by: https://github.com/zou3519	2025-07-22 03:07:22 +00:00
cyy	3639d29ea1	Fix warnings of unused-variable (#158627 ) Fixes ``` /var/lib/jenkins/workspace/test/cpp/tensorexpr/test_kernel.cpp:42:22: error: unused variable 'verification_pattern' [-Werror,-Wunused-variable] ``` and also extra semicolons. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158627 Approved by: https://github.com/albanD	2025-07-22 02:49:06 +00:00
FFFrog	aee8a2e985	Remove duplicated installation for python dependencies. (#158339 ) As the title stated. The `Common` Section have installed the python dependencies `1b389025ba/README.md (L247)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158339 Approved by: https://github.com/ezyang	2025-07-22 02:39:28 +00:00
PaulZhang12	eac777c4f4	[Inductor] Expose decomposeK knobs as envvars (#158745 ) Fix up decomposeK autotuning, by removing condition to return more than `k_splits_limit` and setting default to 10 instead of 5. Allow `k_splits_limit` to be configurable to the user via `TORCHINDUCTOR_NUM_DECOMPOSE_K_SPLITS` and also allow user to configure threshold in which to use decompose_k via `TORCHINDUCTOR_DECOMPOSE_K_THRESHOLD` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158745 Approved by: https://github.com/eellison	2025-07-22 01:59:51 +00:00
Xu Han	1a6b21c59f	[AOTI] fix load_pt2 split wrong model name on Windows (#158711 ) fix load_pt2 split wrong model name on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158711 Approved by: https://github.com/jansel	2025-07-22 01:54:44 +00:00
Nikita Shulga	abe0c9538a	[BE] Fix extra-semi warnings (#158730 ) And prevent new ones from appearing by removing `-Wno-error=extra-semi` (not sure what was thereason behind adding the warning but not erroring on on it when building with -Werror introduced by https://github.com/pytorch/pytorch/pull/140236 ) 300+ violations of that rule were fixed by running `sed -i -e "s/});/})/" /` against `torch/nativert` Other 3p deps that needs updates: - TensorPipe - LLVM - FBGEMM Pull Request resolved: https://github.com/pytorch/pytorch/pull/158730 Approved by: https://github.com/Skylion007	2025-07-22 01:05:03 +00:00
PyTorch MergeBot	95b658427d	Revert "Add DeviceAllocator as the base device allocator (#138222 )" This reverts commit 1179e333237b02ed8fe2ba10cb9a23adf98d7d7a. Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/ZainRizvi due to Very sorry but this is still breaking internally. @albanD would you be able to help get this past the finish line? D78496124 has more details on the failure and the workaround might be to do something like what's in D78684669. To validate the fixes internally, you can follow the instructions here to ghimport the changes: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3100195370))	2025-07-22 01:01:41 +00:00
PyTorch MergeBot	6341311333	Revert "Add unified memory APIs for torch.accelerator (#152932 )" This reverts commit 2ad5c25cfc603c3656e6699d6137419dbb009495. Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/ZainRizvi due to Very sorry but this is still breaking internally. @albanD would you be able to help get this past the finish line? D78496124 has more details on the failure and the workaround might be to do something like what's in D78684669. To validate the fixes internally, you can follow the instructions here to ghimport the changes: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3100195370))	2025-07-22 01:01:41 +00:00
Han, Xu	350d6af52c	[AOTI] add windows support for get_cpp_compile_command (#158732 ) add windows support for `get_cpp_compile_command`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158732 Approved by: https://github.com/desertfire	2025-07-22 00:23:10 +00:00
PyTorch MergeBot	9281625a9b	Revert "Setup TorchBench in Docker (#158613 )" This reverts commit cab28330f8c49cdb66d6a299755dc09c87c14a9d. Reverted https://github.com/pytorch/pytorch/pull/158613 on behalf of https://github.com/ZainRizvi due to Seems to have broken trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/16429779764/job/46430634676) [HUD commit link](`b3c868d603`) ([comment](https://github.com/pytorch/pytorch/pull/158613#issuecomment-3100023071))	2025-07-22 00:12:49 +00:00
Huamin Li	2c37acfd89	[AOTI][CPU] Consider bias=None case for fbgemm_linear_fp16_weight (#158535 ) Test Plan: Rollback Plan: Differential Revision: D78458214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158535 Approved by: https://github.com/houseroad, https://github.com/henryoier, https://github.com/jingsh	2025-07-21 23:42:44 +00:00
Raymond Li	08540b13c6	Use cuda error code instead of error text in get_cuda_error_help (#158688 ) Use cudaError_t and switch through the enum to prevent impact by upstream changes in wording Pull Request resolved: https://github.com/pytorch/pytorch/pull/158688 Approved by: https://github.com/q10, https://github.com/aorenste	2025-07-21 23:34:50 +00:00
zpcore	187c2deb40	Fix clamp(min/max) strategy (#158619 ) Part of plan https://github.com/pytorch/pytorch/issues/157495. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158619 Approved by: https://github.com/wanchaol	2025-07-21 23:26:08 +00:00
Catherine Lee	67be2f27e1	[CI][lintrunner] Only run on non deleted changed files (#158794 ) My PR was failing lint because I removed a file, and then lintrunner would try to run on the deleted file and error, so this changes how the changed files are retrieved to only retrieve changed files that have not been removed. I don't think this is possible through `gh pr view`, so instead it uses `gh api` Testing: https://github.com/pytorch/pytorch/pull/158795 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158794 Approved by: https://github.com/seemethere	2025-07-21 23:22:37 +00:00
henrylhtsang	d293022c47	[cutass backend] memorize parts of cache key to reduce general overhead (#158311 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158311 Approved by: https://github.com/ColinPeppler ghstack dependencies: #156781	2025-07-21 23:21:12 +00:00
PyTorch MergeBot	ee5a434f8c	Revert "[BE] remove torch deploy - conditionals (#158288 )" This reverts commit 1a4268b8113d5160d71225bab980f03c2318a0a4. Reverted https://github.com/pytorch/pytorch/pull/158288 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))	2025-07-21 23:17:39 +00:00
PyTorch MergeBot	4c18e85300	Revert "[BE] Remove torch deploy \| remove torch deploy specific files (#158290 )" This reverts commit a6de309ca15cda6b2792fc74e82814dc8d2f9dd9. Reverted https://github.com/pytorch/pytorch/pull/158290 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))	2025-07-21 23:17:39 +00:00
PyTorch MergeBot	920f26c761	Revert "[BE] Remove __reduce_deploy__ (#158291 )" This reverts commit 0b9fb91f17edfbc51ae36584dcb8350b2d8bb23b. Reverted https://github.com/pytorch/pytorch/pull/158291 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))	2025-07-21 23:17:38 +00:00
PyTorch MergeBot	99cc3633f6	Revert "[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407 )" This reverts commit d9426a81d2ab54f809a3b32a6ab2e606073fe66f. Reverted https://github.com/pytorch/pytorch/pull/158407 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))	2025-07-21 23:17:38 +00:00
PyTorch MergeBot	15a50dcf1c	Revert "[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427 )" This reverts commit eb7365072315be2bc4259114e25e269801441748. Reverted https://github.com/pytorch/pytorch/pull/158427 on behalf of https://github.com/ZainRizvi due to Reverting this as part of reverting the stack for https://github.com/pytorch/pytorch/pull/158288 ([comment](https://github.com/pytorch/pytorch/pull/158427#issuecomment-3099815367))	2025-07-21 23:14:57 +00:00
Pian Pawakapan	1227ed6674	[dynamic shapes] fix _maybe_evaluate_static axioms bug (#158672 ) Summary: couldn't get a minimal repro, but xref for size change during dict iteration error: https://fb.workplace.com/groups/1075192433118967/posts/1709439696360901 Test Plan: - Rollback Plan: Differential Revision: D78047846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158672 Approved by: https://github.com/bobrenjc93	2025-07-21 23:14:19 +00:00
Svetlana Karslioglu	2bb684304d	Fix the typos in the right nav by pulling the latest theme (#158746 ) This will fix broken links in the right nav. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158746 Approved by: https://github.com/malfet	2025-07-21 22:51:07 +00:00
Ketan Ambati	f09a484b81	Remove is_arvr_mode() from xnnpack.buck.bzl (#158682 ) Summary: Changes * Deleted function import from build definition utilities * Removed `load("//tools/build_defs:fbsource_utils.bzl", "is_arvr_mode")` * Replaced is_arvr_mode() function calls with direct references to configuration flags * Changed from `is_arvr_mode()` to `"ovr_config//build_mode:arvr_mode"` * Changed conditional expressions to Buck `select()` statements Test Plan: Check if CI passes Rollback Plan: Differential Revision: D78520947 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158682 Approved by: https://github.com/malfet	2025-07-21 22:49:26 +00:00
PyTorch MergeBot	feaa02f9ad	Revert "[build] pin `setuptools>=77` to enable PEP 639 (#158104 )" This reverts commit a78fb63dbdf98a1db219095293de1a11005e0390. Reverted https://github.com/pytorch/pytorch/pull/158104 on behalf of https://github.com/malfet due to It still breaks inductor-perf-nightly, see https://github.com/pytorch/pytorch/actions/runs/16425364208/job/46417088208, I'm going to dismiss all previous reviews ([comment](https://github.com/pytorch/pytorch/pull/158104#issuecomment-3099706457))	2025-07-21 22:46:53 +00:00
Yang Wang	b3c868d603	[vllm]Add vllm.txt for pinned commit (#158754 ) It seems the nightly.yml won't auto-generate txt file when it does not existed, so added the file with latest merged commit from vllm: [vllm commit](https://github.com/vllm-project/vllm/commits/main) Error: https://github.com/pytorch/pytorch/actions/runs/16405915719/job/46351847504 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158754 Approved by: https://github.com/huydhn	2025-07-21 22:41:07 +00:00
Huy Do	cab28330f8	Setup TorchBench in Docker (#158613 ) This reduces the time spending to setup TorchBench in A100/H100 by another half an hour ### Testing * H100 benchmark https://github.com/pytorch/pytorch/actions/runs/16396172453. Once this done, I will review the results on [HUD](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2011%20Jul%202025%2023%3A01%3A24%20GMT&stopTime=Fri%2C%2018%20Jul%202025%2023%3A01%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/huydhn/6/head&lCommit=14a38c719b29a19f518239b5edb084838ac5d2fb&rBranch=main&rCommit=0a99b026d6bd0f67dc2c0a20fe3228ddc4144854) to confirm that all models are there * A100 benchmark https://github.com/pytorch/pytorch/actions/runs/16396173932 Signed-off-by: Huy Do <huydhn@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158613 Approved by: https://github.com/janeyx99	2025-07-21 22:34:08 +00:00
Tristan Rice	4366610f5a	[c10d] block_current_stream: correctness fixes (#158757 ) This fixes a number of issues that were present in https://github.com/pytorch/pytorch/pull/156883 as pointed out by @ngimel Test plan: Expanded tests to cover use after free behavior + non-default stream ``` pytest test/distributed/test_c10d_pypg.py -v -k block_current_stream ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158757 Approved by: https://github.com/ngimel	2025-07-21 22:23:44 +00:00
codingwithsurya	dd0adc9386	[SymmMem] Add NVSHMEM broadcast support into Triton (#158514 ) Adds broadcast collective operation for distributing data from root PE to all other PEs in NVSHMEM Triton kernels. Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_broadcast` <details> <summary> Quick debug print for sanity check </summary> ```markdown ============================================================ [Rank 0] Starting broadcast test with world_size=2 ============================================================ [Rank 0] Configuration: - nelems: 4 - dtype: torch.int64, element_size: 8 bytes - nelems_bytes: 32 ============================================================ [Rank 1] Starting broadcast test with world_size=2 ============================================================ [Rank 1] Configuration: - nelems: 4 - dtype: torch.int64, element_size: 8 bytes - nelems_bytes: 32 [Rank 1] Non-root source data: [-1, -1, -1, -1] [Rank 0] Root source data: [100, 101, 102, 103] [Rank 1] Initial destination: [-999, -999, -999, -999] [Rank 0] Initial destination: [-999, -999, -999, -999] [Rank 0] Executing broadcast operation... [Rank 1] Executing broadcast operation... [Rank 0] Broadcast operation completed /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [Rank 1] Broadcast operation completed /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [Rank 1] Results after broadcast: [Rank 0] Results after broadcast: [Rank 1] Destination buffer: [100, 101, 102, 103] [Rank 1] Expected: [100, 101, 102, 103] [Rank 0] Destination buffer: [100, 101, 102, 103] [Rank 0] Expected: [100, 101, 102, 103] [Rank 1] Match: ✓ [Rank 0] Match: ✓ [Rank 1] ============================================================ [Rank 1] Broadcast test PASSED ✓ [Rank 1] Summary: Root PE 0 broadcasted [100, 101, 102, 103] to all PEs [Rank 1] ============================================================ [Rank 0] ============================================================ [Rank 0] Broadcast test PASSED ✓ [Rank 0] Summary: Root PE 0 broadcasted [100, 101, 102, 103] to all PEs [Rank 0] ============================================================ ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158514 Approved by: https://github.com/fduwjj, https://github.com/mandroid6 ghstack dependencies: #158511, #158512, #158513	2025-07-21 22:23:26 +00:00
PyTorch MergeBot	734826d88e	Revert "[AOTI] windows package load dev (#158671 )" This reverts commit d42c40976727fed4c9908d4194f26917d0a3da66. Reverted https://github.com/pytorch/pytorch/pull/158671 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @angelayi can you please help them validate the fixes internally? You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158671#issuecomment-3099570374))	2025-07-21 22:20:46 +00:00
PyTorch MergeBot	5a56e6a72b	Revert "[AOTI] fix extract file failed on Windows. (#158702 )" This reverts commit 7cc1a9546c135f8e7635e0d38aa2bba797f8907d. Reverted https://github.com/pytorch/pytorch/pull/158702 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158702#issuecomment-3099556215))	2025-07-21 22:18:19 +00:00
PyTorch MergeBot	e8af168ee0	Revert "[AOTI] normalize path and process model files. (#158705 )" This reverts commit ff0da08f4bc5ee135b495926cd58a36a1c0e1a5b. Reverted https://github.com/pytorch/pytorch/pull/158705 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158705#issuecomment-3099532516))	2025-07-21 22:16:03 +00:00
PyTorch MergeBot	97d7dc197f	Revert "[AOTI] Convert C-struct zip handling to RAII container (#158687 )" This reverts commit 8ed5e1844c77d952bcea89ca7d0225d876fec4e8. Reverted https://github.com/pytorch/pytorch/pull/158687 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158687#issuecomment-3099515618))	2025-07-21 22:13:26 +00:00
Lucas Kabela	9498d95b9c	[Dynamo][BetterEngineering] Type trace_rules.py (#158679 ) As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to a core file, `trace_rules.py` Running ``` mypy torch/_dynamo/trace_rules.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 2564 \| 3997 \| 64.15% \| 34 \| 53 \| 64.15% \| \| This PR \| 4022 \| 4022 \| 100.00% \| 53 \| 53 \| 100.00% \| \| Delta \| +1458 \| +25 \| +35.85% \| +19 \| 0 \| +35.85% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/158679 Approved by: https://github.com/williamwen42	2025-07-21 22:12:59 +00:00
Jeff Daily	0e46f54286	[ROCm][CI] update HIP patch for 6.4.1 (#158651 ) patch is intended to fix hipGraph capture for some miopen kernels Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/158651 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-21 22:09:36 +00:00
zeshengzong	216ba6e5f2	Fix `MaskedTensor` to device ignored mask (#151205 ) Fixes #147140 ## Changes - Add `to` implementation in `MaskedTensor` to support move `mask` to target device ## Test Result ```python In [1]: import torch ...: from torch.masked import as_masked_tensor ...: data = torch.tensor([1,2,3]) ...: mask = torch.tensor([True,False,True]) ...: mt = as_masked_tensor(data, mask).to('cuda') ...: mt.get_data().device, mt.get_mask().device /home/zong/code/pytorch/torch/masked/maskedtensor/core.py:247: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project. return MaskedTensor(data, mask) /home/zong/code/pytorch/torch/masked/maskedtensor/_ops_refs.py:354: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project. return MaskedTensor(new_data, _maybe_get_mask(args[0])) Out[1]: (device(type='cuda', index=0), device(type='cuda', index=0)) In [2]: mt.sum(dim=0) /home/zong/code/pytorch/torch/masked/maskedtensor/core.py:247: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project. return MaskedTensor(data, mask) Out[2]: MaskedTensor(4, True) ``` ```bash pytest test/test_maskedtensor.py -vv ``` ![image](https://github.com/user-attachments/assets/640b809c-b4f0-4aca-a09e-04049017a745) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151205 Approved by: https://github.com/ezyang	2025-07-21 21:44:49 +00:00
dependabot[bot]	c774180e59	Bump requests from 2.32.2 to 2.32.4 in /tools/build/bazel (#158006 ) Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/psf/requests/releases">requests's releases</a>.</em></p> <blockquote> <h2>v2.32.4</h2> <h2>2.32.4 (2025-06-10)</h2> <p><strong>Security</strong></p> <ul> <li>CVE-2024-47081 Fixed an issue where a maliciously crafted URL and trusted environment will retrieve credentials for the wrong hostname/machine from a netrc file. (<a href="https://redirect.github.com/psf/requests/issues/6965">#6965</a>)</li> </ul> <p><strong>Improvements</strong></p> <ul> <li>Numerous documentation improvements</li> </ul> <p><strong>Deprecations</strong></p> <ul> <li>Added support for pypy 3.11 for Linux and macOS. (<a href="https://redirect.github.com/psf/requests/issues/6926">#6926</a>)</li> <li>Dropped support for pypy 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/6926">#6926</a>)</li> </ul> <h2>v2.32.3</h2> <h2>2.32.3 (2024-05-29)</h2> <p><strong>Bugfixes</strong></p> <ul> <li>Fixed bug breaking the ability to specify custom SSLContexts in sub-classes of HTTPAdapter. (<a href="https://redirect.github.com/psf/requests/issues/6716">#6716</a>)</li> <li>Fixed issue where Requests started failing to run on Python versions compiled without the <code>ssl</code> module. (<a href="https://redirect.github.com/psf/requests/issues/6724">#6724</a>)</li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's changelog</a>.</em></p> <blockquote> <h2>2.32.4 (2025-06-10)</h2> <p><strong>Security</strong></p> <ul> <li>CVE-2024-47081 Fixed an issue where a maliciously crafted URL and trusted environment will retrieve credentials for the wrong hostname/machine from a netrc file.</li> </ul> <p><strong>Improvements</strong></p> <ul> <li>Numerous documentation improvements</li> </ul> <p><strong>Deprecations</strong></p> <ul> <li>Added support for pypy 3.11 for Linux and macOS.</li> <li>Dropped support for pypy 3.9 following its end of support.</li> </ul> <h2>2.32.3 (2024-05-29)</h2> <p><strong>Bugfixes</strong></p> <ul> <li>Fixed bug breaking the ability to specify custom SSLContexts in sub-classes of HTTPAdapter. (<a href="https://redirect.github.com/psf/requests/issues/6716">#6716</a>)</li> <li>Fixed issue where Requests started failing to run on Python versions compiled without the <code>ssl</code> module. (<a href="https://redirect.github.com/psf/requests/issues/6724">#6724</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`021dc729f0`"><code>021dc72</code></a> Polish up release tooling for last manual release</li> <li><a href="`821770e822`"><code>821770e</code></a> Bump version and add release notes for v2.32.4</li> <li><a href="`59f8aa2adf`"><code>59f8aa2</code></a> Add netrc file search information to authentication documentation (<a href="https://redirect.github.com/psf/requests/issues/6876">#6876</a>)</li> <li><a href="`5b4b64c346`"><code>5b4b64c</code></a> Add more tests to prevent regression of CVE 2024 47081</li> <li><a href="`7bc45877a8`"><code>7bc4587</code></a> Add new test to check netrc auth leak (<a href="https://redirect.github.com/psf/requests/issues/6962">#6962</a>)</li> <li><a href="`96ba401c12`"><code>96ba401</code></a> Only use hostname to do netrc lookup instead of netloc</li> <li><a href="`7341690e84`"><code>7341690</code></a> Merge pull request <a href="https://redirect.github.com/psf/requests/issues/6951">#6951</a> from tswast/patch-1</li> <li><a href="`6716d7c9f2`"><code>6716d7c</code></a> remove links</li> <li><a href="`a7e1c745dc`"><code>a7e1c74</code></a> Update docs/conf.py</li> <li><a href="`c799b8167a`"><code>c799b81</code></a> docs: fix dead links to kenreitz.org</li> <li>Additional commits viewable in <a href="https://github.com/psf/requests/compare/v2.32.2...v2.32.4">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=requests&package-manager=pip&previous-version=2.32.2&new-version=2.32.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158006 Approved by: https://github.com/Skylion007 Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-07-21 21:35:38 +00:00
Bin Bao	a991e285ae	[AOTI] Add more default options to compile_standalone (#158560 ) Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560 Approved by: https://github.com/yushangdi	2025-07-21 21:16:48 +00:00
zero000064	9e0473b566	removed zero dim cpu logic from fake_tensor.py (#147501 ) Fixes #144748 In #144748, the inconsistency between the eager mode and the inductor mode is reported as a bug. The root cause is fake_tenosr.py's find-common-device method, `0b0da81021/torch/_subclasses/fake_tensor.py (L833)`, takes zero dim cpu tensor into account but the device check in adaption.h doesn't. This fix is to add a list for some ops to bypass zero-dim-cpu-tensor check to align with the eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147501 Approved by: https://github.com/ezyang	2025-07-21 21:11:10 +00:00
Howard Huang	5e17932c22	[DCP] Add support for ShardedTensor to PgTransport (#158573 ) Add support for ShardedTensors in when PGTransport is used for send/recv checkpoints Test is pulled from https://github.com/pytorch/pytorch/pull/157963 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158573 Approved by: https://github.com/meetv18	2025-07-21 21:04:23 +00:00
Xuan Zhang	6b0526a2c4	ban fusion of large amount of reads (#158667 ) This is an reland attempt of https://github.com/pytorch/pytorch/pull/157563, but insteading of introducing the `realize_acc_reads_size_threshold` config and setting to a default value, we set it to `None` for now to unblock an internal use case. Will deep dive into the issue and harden the logic in later PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158667 Approved by: https://github.com/yf225	2025-07-21 21:00:40 +00:00
PyTorch MergeBot	bc379aebe2	Revert "Still run TritonBundler with BundledAOTAutogradCache, save autotune results (#158048 )" This reverts commit 8e57cdb746b4ab28865fdf01532f87b0d21700e9. Reverted https://github.com/pytorch/pytorch/pull/158048 on behalf of https://github.com/jeffdaily due to rocm failures due to unit test introduced in this PR, but no pre-merge signal available ([comment](https://github.com/pytorch/pytorch/pull/158048#issuecomment-3098746624))	2025-07-21 20:45:21 +00:00
Ruben Rodriguez Buchillon	b1a0c34dd3	[pt2 event logging] add configurable prefix (#157678 ) Summary: # Why make experiments easier to find # What - dynamo config to provide a prefix - use the prefix when sending data to scuba through the self.id_ field Test Plan: ``` # code edited to set the prefix as `coconutruben-02` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 \| tee /tmp/epx040 ``` on scuba ``` \| autotune_dtypes \| autotune_offset \| autotune_shape \| autotune_strides \| event \| run_id \| \| -----\| -----\| -----\| -----\| -----\| ----- \| \| "torch.float16, torch.float16" \| "0, 0" \| "4096x3008, 3008x2048" \| "[3008, 1], [2048, 1]" \| "mm_template_autotuning" \| "coconutruben-02-e6bdccc5-6dcf-4d68-9a04-b34f2c6d94fd" \| \| "torch.float16, torch.float16" \| "0, 0" \| "4096x3008, 3008x2048" \| "[3008, 1], [2048, 1]" \| "mm_template_autotuning" \| "coconutruben-02-14165153-5842-4eaa-9e6c-3b0cbc016375" \| ``` Rollback Plan: Differential Revision: D77837550 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157678 Approved by: https://github.com/stashuk-olek	2025-07-21 20:41:03 +00:00
Eli Uriegas	851e953f68	ci: Only run lint jobs on relevant files (#158773 ) Conditionally run lint jobs on relevant files, this is mainly targetd at clangtidy since it takes a long time but also includes mypy since that's an additional 4 minutes of runtime that we can save. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158773 Approved by: https://github.com/malfet	2025-07-21 20:21:34 +00:00
zeshengzong	b66f429827	Fix `torch.randint`, `torch.mul` param missing description (#158731 ) Wrong separator cause param description truncated. - Change separator of param and its description - Remove quote make `torch.dtype` display as reference to the class ## Test Result ### Before <img width="1092" height="784" alt="image" src="https://github.com/user-attachments/assets/e8d96b26-07e9-40ff-9392-fa6665d4bbe4" /> <img width="1111" height="457" alt="image" src="https://github.com/user-attachments/assets/a3c2e333-f861-4aeb-b4fb-05c8d880ae81" /> ### After <img width="897" height="820" alt="image" src="https://github.com/user-attachments/assets/d1b5cefa-717a-4223-84b0-4346b7eecf44" /> <img width="872" height="409" alt="image" src="https://github.com/user-attachments/assets/96223c37-cd9d-4656-9e55-032d09cbe5c1" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158731 Approved by: https://github.com/ngimel	2025-07-21 20:17:27 +00:00
Lucas Kabela	ea5b06ed5b	[Dynamo][BetterEngineering] Type side_effects.py (#158605 ) As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to a core file, `side_effects.py` Running ``` mypy torch/_dynamo/side_effects.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 365 \| 1166 \| 31.30% \| 16 \| 51 \| 31.37% \| \| This PR \| 1185 \| 1185 \| 100.00% \| 51 \| 51 \| 100.00% \| \| Delta \| +820 \| +19 \| +68.70% \| +35 \| 0 \| +68.63% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/158605 Approved by: https://github.com/StrongerXi	2025-07-21 19:34:14 +00:00
Natalia Gimelshein	25fbf09d5f	Use more fine-grained locks in sym mem kernels (#158523 ) Summary: Use only acq in the beginning of the kernel, and only release in the end Test Plan: Existing tests Rollback Plan: Differential Revision: D78458020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158523 Approved by: https://github.com/drisspg, https://github.com/kwen2501	2025-07-21 19:23:47 +00:00
Benjamin Glass	22920c9138	Grab bag of (mostly) typing improvements (#158075 ) Collects some scattershot improvements made while attempting to enable training for AOTInductor. Non-typing changes are: 1. Swapping a few custom searches for the output node in an FX graph for calling `graph.output_node()`. 2. Removing two unused parameters from `torch.export._unlift._unlift`. 3. Switching handles to constants in `cpp_wrapper_cpu` to use C++ references for memory efficiency. 4. Cleaning out unused, unexported imports from `torch/export/__init__.py`, and adding one missing export to `__all__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158075 Approved by: https://github.com/Skylion007	2025-07-21 19:17:01 +00:00
codingwithsurya	ad2dec1997	[SymmMem] Add NVSHMEM alltoall support into Triton (#158513 ) Implements collective alltoall operation for NVSHMEM Triton kernels. Enables data exchange where each PE sends unique data to every other PE in the team. Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_alltoall` <details> <summary>Quick debug print for sanity check</summary> ```markdown ============================================================ [Rank 0] Starting alltoall test with world_size=2 ============================================================ [Rank 0] Configuration: - nelems_per_pe: 2 - dtype: torch.int64, element_size: 8 bytes - nelems_bytes: 16 /dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:1653: NULL value get_device_list failed /dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:1653: NULL value get_device_list failed [Rank 0] Preparing source data: [Rank 1] Preparing source data: - Data for PE 0: [0, 0] (indices 0-1) - Data for PE 1: [1, 1] (indices 2-3) [Rank 0] Complete source buffer: [0, 0, 1, 1] - Data for PE 0: [100, 100] (indices 0-1) - Data for PE 1: [101, 101] (indices 2-3) [Rank 1] Complete source buffer: [100, 100, 101, 101] [Rank 1] Initial destination buffer: [-1, -1, -1, -1] [Rank 0] Initial destination buffer: [-1, -1, -1, -1] /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [rank0]:[W716 15:30:06.215666766 ProcessGroupNCCL.cpp:5064] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device. [rank1]:[W716 15:30:06.215752786 ProcessGroupNCCL.cpp:5064] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device. NCCL version 2.27.5+cuda12.4 [Rank 1] Executing alltoall operation... [Rank 0] Executing alltoall operation... [Rank 1] alltoall operation completed /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [Rank 0] alltoall operation completed /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [Rank 0] Results after alltoall: [Rank 1] Results after alltoall:[Rank 0] Destination buffer: [0, 0, 100, 100] [Rank 0] Verifying results: - From PE 0 (indices 0-1): Expected: [0, 0] Actual: [0, 0] [Rank 1] Destination buffer: [1, 1, 101, 101] [Rank 1] Verifying results: - From PE 0 (indices 0-1): Expected: [1, 1] Actual: [1, 1] Match: ✓ Match: ✓ - From PE 1 (indices 2-3): Expected: [100, 100] - From PE 1 (indices 2-3): Expected: [101, 101] Actual: [100, 100] Actual: [101, 101] Match: ✓ Match: ✓ [Rank 0] ============================================================ [Rank 0] Summary: ALL TESTS PASSED ✓ [Rank 0] Data flow explanation: - Each rank sends 2 elements to every other rank [Rank 1] ============================================================ [Rank 1] Summary: ALL TESTS PASSED ✓ - Rank 0 sent: [0, 0, 1, 1] [Rank 1] Data flow explanation: - Each rank sends 2 elements to every other rank - Rank 0 received: [0, 0, 100, 100] - My data for PE 0 (0) went to PE 0's buffer - I received PE 0's data for me (0) - My data for PE 1 (1) went to PE 1's buffer - Rank 1 sent: [100, 100, 101, 101] - I received PE 1's data for me (100) [Rank 0] ============================================================ - Rank 1 received: [1, 1, 101, 101] - My data for PE 0 (100) went to PE 0's buffer - I received PE 0's data for me (1) - My data for PE 1 (101) went to PE 1's buffer - I received PE 1's data for me (101) [Rank 1] ============================================================ ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158513 Approved by: https://github.com/fduwjj, https://github.com/mandroid6 ghstack dependencies: #158511, #158512	2025-07-21 19:14:47 +00:00
henrylhtsang	662dd7db5b	[cutlass backend] cache maybe_append_choices (#156781 ) This PR attempts to cache: * codegen for cutlass backend for the same kernel. Even if runtime params are different. From some profiling, most of the time spent is on render. So we only target to cache that part for now. The output of render is `code`, and we are able to cache that easily. Also, I have to cache size_args, since it depends on `kernel.get_dynamic_shape_args()`, which depends on the state of self when we call render. make_key is doing most of the work here: We are hashing on input node layouts, output node layout and op.configuration_name() (this is what hash(op) would do anyway). Pull Request resolved: https://github.com/pytorch/pytorch/pull/156781 Approved by: https://github.com/ColinPeppler	2025-07-21 19:02:39 +00:00
PyTorch MergeBot	72db0a98a3	Revert "[DTensor] Assert DTensorSpec has valid placements (#158133 )" This reverts commit 1839e8d04b81ee6eda0cff6fbfc218a7a600f6f7. Reverted https://github.com/pytorch/pytorch/pull/158133 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D78496151 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158133#issuecomment-3097994857))	2025-07-21 18:54:07 +00:00
Benjamin Glass	8ed5e1844c	[AOTI] Convert C-struct zip handling to RAII container (#158687 ) Attempts to fix a memory leak reported in #158614 by wrapping manually managed MiniZ C-structs in an RAII container. I have been unable to reproduce the reported leak, but this seems like the most likely candidate. Fixes #158614 (hopefully) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158687 Approved by: https://github.com/desertfire	2025-07-21 18:53:14 +00:00
Menglu Yu	393fecb2cc	[Optimus][Unit test] clean up the unit test (#158696 ) Summary: We should only patch the specific pattern(s) for each unit test. Test Plan: ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion ``` Buck UI: https://www.internalfb.com/buck2/f8d37674-91c4-4244-90fa-f24fc3f91e4b Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275088644915 Network: Up: 100KiB Down: 233KiB (reSessionID-92039f44-bc6f-4e78-87b1-93bca1bd1c66) Analyzing targets. Remaining 0/296 Executing actions. Remaining 0/20196 5.8s exec time total Command: test. Finished 2 local, 2 cache (50% hit) 4.6s exec time cached (79%) Time elapsed: 3:55.1s Tests finished: Pass 13. Fail 0. Fatal 0. Skip 0. Build failure 0 Rollback Plan: Differential Revision: D78598127 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158696 Approved by: https://github.com/Skylion007, https://github.com/masnesral	2025-07-21 18:05:09 +00:00
Sam Larsen	9285b8245c	[BE][testing] fix test_cat_max_autotune_triton (#158589 ) Summary: This test often fails internally -- looks like it's because autotuning sometimes chooses not to do the epilog tuning. Turning off `benchmark_epilogue_fusion` seems to fix. Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -- --exact 'caffe2/test/inductor:max_autotune - test_cat_max_autotune_triton (caffe2.test.inductor.test_max_autotune.TestMaxAutotune)' --run-disabled` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158589 Approved by: https://github.com/eellison	2025-07-21 18:02:18 +00:00
Xuehai Pan	637e75433c	[BE] always use `uv pip` if possible in `pip_init.py` for `lintrunner init` (#157199 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157199 Approved by: https://github.com/ezyang, https://github.com/ZainRizvi	2025-07-21 17:56:05 +00:00
Xuehai Pan	a78fb63dbd	[build] pin `setuptools>=77` to enable PEP 639 (#158104 ) For reference here is the link PEP 639: [peps.python.org/pep-0639](https://peps.python.org/pep-0639/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158104 Approved by: https://github.com/rgommers, https://github.com/Skylion007, https://github.com/atalman	2025-07-21 17:46:40 +00:00
FFFrog	7205458b85	[Easy] Show some clear error when torch.ops.load_library fails. (#157524 ) Background: ```Shell torch 2.5.1+cpu torchvision 0.20.1 ``` ```Python import torch import torchvision Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module> from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils # usort:skip File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module> def meta_nms(dets, scores, iou_threshold): File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 795, in register use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1) File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 184, in _register_fake handle = entry.fake_impl.register(func_to_register, source) File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"): RuntimeError: operator torchvision::nms does not exist ``` Cause: ``` torchvision's .so file lacks some symbol definitions, because these symbols come from CUDA, but the current environment does not have CUDA and GPU. The above error message is very confusing. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157524 Approved by: https://github.com/ezyang	2025-07-21 17:32:31 +00:00
PyTorch MergeBot	35f1b4ad9e	Revert "Fused RMSNorm implementation (#153666 )" This reverts commit 15ef4f28df0a14e9f0d55a57a4e2db415a303be7. Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking tests internally. @albanD can you please help land this change?You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts. See D78599667 for more info ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3097690935))	2025-07-21 17:31:42 +00:00
Yu, Guangye	cbe1cb7018	[CMake] Move xpu flag to xpu.cmake (#158542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158542 Approved by: https://github.com/gujinghui, https://github.com/ezyang	2025-07-21 17:19:59 +00:00
Xu Han	9894d43b6c	[AOTI] explicit aoti wrapper functions for Windows. (#158713 ) On Windows, we need to explicit declaration for export APIs. Because the package loader call these API via GetProcAddress. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158713 Approved by: https://github.com/desertfire	2025-07-21 15:59:44 +00:00
Zain Rizvi	f168cf49a8	[BE] Always use python 3.9 for pre-push hook's lintrunner (#158693 ) A follow up to https://github.com/pytorch/pytorch/pull/158389 Sets up the pre-push lintrunner to always use python 3.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158693 Approved by: https://github.com/atalman	2025-07-21 15:19:27 +00:00
PyTorch MergeBot	393377d215	Revert "[CI] update flake8 and mypy lint dependencies (#158720 )" This reverts commit a527e816935957a164d74dd7c5069310b2857695. Reverted https://github.com/pytorch/pytorch/pull/158720 on behalf of https://github.com/malfet due to This broke lint, see `8e57cdb746/1` ([comment](https://github.com/pytorch/pytorch/pull/158720#issuecomment-3096893256))	2025-07-21 13:58:50 +00:00
James Wu	8e57cdb746	Still run TritonBundler with BundledAOTAutogradCache, save autotune results (#158048 ) When running BundledAOTAutogradCache with precompile, we still need to run triton bundling so that the precompiled CompiledFxGraph has triton cuda kernels. We also pre save the autotune results in the precompile artifact. It would be even better to pre trim the cuda kernels on save and apply them, which we can work on later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158048 Approved by: https://github.com/zhxchen17	2025-07-21 13:35:46 +00:00
Edward Z. Yang	d5a29fc58a	De-abstract premature generalization with InductorWrapper (#158528 ) See docblock on InductorWrapper for the distinction. This will matter on a later refactor PR where I will change the signature for one of these but not the other. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158528 Approved by: https://github.com/jamesjwu ghstack dependencies: #158449	2025-07-21 13:27:07 +00:00
Edward Z. Yang	979fae761c	Rename modules in AOTAutograd (#158449 ) Fixes https://github.com/pytorch/pytorch/issues/158382 ``` renamed: torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py -> torch/_functorch/_aot_autograd/graph_capture.py renamed: torch/_functorch/_aot_autograd/traced_function_transforms.py -> torch/_functorch/_aot_autograd/graph_capture_wrappers.py renamed: torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py -> torch/_functorch/_aot_autograd/graph_compile.py ``` Everything else is ONLY import changes. I did not rename any functions even if we probably should have. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158449 Approved by: https://github.com/jamesjwu	2025-07-21 13:27:07 +00:00
Sun, Jiayi	1eb6b2089f	[Inductor] Set the default value of min_chunk_size to 512 (#150762 ) Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized. I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150762 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-07-21 12:46:05 +00:00
codingwithsurya	bbc32d680f	[SymmMem] Add NVSHMEM sync_all support into Triton (#158512 ) Adds `sync_all()` function for local store visibility synchronization in NVSHMEM Triton kernels. Provides memory ordering for local operations without remote completion guarantees. Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_sync` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158512 Approved by: https://github.com/fduwjj ghstack dependencies: #158511	2025-07-21 10:27:59 +00:00
Xuehai Pan	a527e81693	[CI] update flake8 and mypy lint dependencies (#158720 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158720 Approved by: https://github.com/Skylion007	2025-07-21 09:24:29 +00:00
Nikita Shulga	1c6328a588	[EZ][BE] Fix compilation warning in Pooling.metal (#158729 ) This one ``` Compiling /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal to Pooling_30.air /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:172:1: warning: non-void function does not return a value in all control paths [-Wreturn-type] } ^ 1 warning generated. ``` Although functionally one is not supposed to hit this codepath ever, it's not not to throw warning Pull Request resolved: https://github.com/pytorch/pytorch/pull/158729 Approved by: https://github.com/Skylion007	2025-07-21 04:34:14 +00:00
codingwithsurya	70b4a8880b	[SymmMem] Add NVSHMEM barrier_all, my_pe, n_pes support into Triton (#158511 ) Adds device-side barrier synchronization and PE identification functions for NVSHMEM Triton integration. Includes `barrier_all()` for collective synchronization and `my_pe()`/`n_pes()` for PE identification within kernels. We are launching with cooperative grid launch (for all the PRs in this stack) because the `nvshmemx_collective_launch` function must be used to launch kernels on the GPU when the kernels use NVSHMEM synchronization or collective APIs, and `nvshmemx_collective_launch` essentially boils down to a CUDA cooperative group launch. Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_barrier` Also tested that if you remove the barrier, you get an assertion error/race conditions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158511 Approved by: https://github.com/fduwjj	2025-07-21 02:37:33 +00:00
PyTorch MergeBot	5e1232871b	Revert "[build] pin `setuptools>=77` to enable PEP 639 (#158104 )" This reverts commit a4ec381302f8acd279033707b182bed30ffd2091. Reverted https://github.com/pytorch/pytorch/pull/158104 on behalf of https://github.com/malfet due to This break inductor-perf-nighly-macos by failing to build torchvision, see https://github.com/pytorch/pytorch/issues/158728 ([comment](https://github.com/pytorch/pytorch/pull/158104#issuecomment-3095048940))	2025-07-21 02:24:11 +00:00
Xu Han	ff0da08f4b	[AOTI] normalize path and process model files. (#158705 ) Continued to https://github.com/pytorch/pytorch/pull/158702 , split `zip_filename_str` and real file path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158705 Approved by: https://github.com/desertfire	2025-07-21 01:08:59 +00:00
Nikita Shulga	2cdafab0bd	[BE] Raise ValueError from `torch.cat` meta func (#158249 ) Followup after https://github.com/pytorch/pytorch/pull/155460 From [Python documentation](https://docs.python.org/3/library/exceptions.html#ValueError): > Raised when an operation or function receives an argument that has the right type but an inappropriate value, and the situation is not described by a more precise exception such as IndexError. Raise [`TypeError`](https://docs.python.org/3/library/exceptions.html#TypeError) when input-output types are incompatible with each other > Raised when an operation or function is applied to an object of inappropriate type. The associated value is a string giving details about the type mismatch. > This exception may be raised by user code to indicate that an attempted operation on an object is not supported, and is not meant to be. If an object is meant to support a given operation but has not yet provided an implementation, [NotImplementedError](https://docs.python.org/3/library/exceptions.html#NotImplementedError) is the proper exception to raise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158249 Approved by: https://github.com/jbschlosser, https://github.com/Skylion007, https://github.com/albanD	2025-07-20 23:49:18 +00:00
Ankita George	4b02bd76d3	DCP safetensors test fix (#158685 ) https://github.com/pytorch/pytorch/pull/158069 removed the consolidated output path argument without updating the test. Reported by a user here https://github.com/pytorch/pytorch/pull/156705#issuecomment-3090748034. Adding back the logic from the original PR https://github.com/pytorch/pytorch/pull/158069 and fixing the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158685 Approved by: https://github.com/teja-rao	2025-07-20 22:52:54 +00:00
Mwiza Kunda	2e038793ef	[inductor][templates] Finalize all registered hooks (#157270 ) This refactor ensures all registered template hooks have been finalised before accessing the code object of the template. In `simd.SimdScheduling.codegen_template` the template hooks are finalised manually with `template.finalize_hook(hook_name)` calls, so it is the responsibility of the caller to finalise all the template hooks. This PR adds: - `RenderPartial.finalize_remaining` a function that can be called at the end to finalise the remaining active hooks after a selection of hooks have been finalised manually. - A test with a custom template implementation that registers custom hooks that the scheduler needs to finalise. This test should fail if the scheduler does not finalise the registered custom hook. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157270 Approved by: https://github.com/eellison	2025-07-20 22:07:32 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	5e149a6482	Add deprecation warning (#158203 ) Summary: export_for_training exist because we couldn't migrate internal usages of export to the final IR. Now that we have completed the migration, we should deprecate and delete this API. Test Plan: CI Rollback Plan: Differential Revision: D78240836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158203 Approved by: https://github.com/JacobSzwejbka	2025-07-20 17:02:01 +00:00
Andrey Talman	badf002014	[Reland] Add warning about removed sm50 and sm60 arches (#158700 ) Related to https://github.com/pytorch/pytorch/issues/157517 Detect when users are executing torch build with cuda 12.8/12.9 and running on Maxwell or Pascal architectures. We would like to include reference to the issue: https://github.com/pytorch/pytorch/issues/157517 as well as ask people to install CUDA 12.6 builds if they are running on sm50 or sm60 architectures. Test: ``` >>> torch.cuda.get_arch_list() ['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120'] >>> torch.cuda.init() /home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:263: UserWarning: Found <GPU Name> which is of cuda capability 5.0. PyTorch no longer supports this GPU because it is too old. The minimum cuda capability supported by this library is 7.0. warnings.warn( /home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:268: UserWarning: Support for Maxwell and Pascal architectures is removed for CUDA 12.8+ builds. Please see https://github.com/pytorch/pytorch/issues/157517 Please install CUDA 12.6 builds if you require Maxwell or Pascal support. ``` Please note I reverted original PR https://github.com/pytorch/pytorch/pull/158301 because it broke internal users. This is a reland, added added check for non empty torch.cuda.get_arch_list() Pull Request resolved: https://github.com/pytorch/pytorch/pull/158700 Approved by: https://github.com/huydhn, https://github.com/Skylion007, https://github.com/eqy	2025-07-20 14:57:46 +00:00
Natalia Gimelshein	4869f71170	don't set CUDA_MODULE_LOADING (#158712 ) If needed, it'll be set in `_C._cuda_init()`. setenv is not threadsafe, so this can cause segfaults due to getenv/setenv races. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158712 Approved by: https://github.com/eqy	2025-07-20 01:36:26 +00:00
Yukio Siraichi	b4abf41425	Raise `BufferError` for DLPack buffer-related errors. (#150691 ) This PR addresses the Array API documentation for [`__dlpack__`][1] and [`from_dlpack`][2] by making some buffer-related errors `BufferError` instead of `RuntimeError`, e.g. incompatible dtype, strides, or device. [1]: https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html [2]: https://data-apis.org/array-api/latest/API_specification/generated/array_api.from_dlpack.html#from-dlpack Pull Request resolved: https://github.com/pytorch/pytorch/pull/150691 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #150216, #150217, #150218	2025-07-20 00:46:21 +00:00
Yukio Siraichi	a10f15718d	[DLPack] Add support for missing keyword-arguments. (#150218 ) This PR introduces the rest of the keyword-arguments added in DLPack version 2023.12: `dl_device` and `copy`. In summary, we handle these arguments in the C++ implementation of `to_dlpack(...)` at _torch/csrc/Module.cpp_, by calling the `maybeCopyTensor` function at _aten/src/ATen/DLConvertor.cpp_. It also introduces the following changes: - Add a new Python API `torchDeviceToDLDevice()`, which is simply a refactoring of the `getDLDevice()` function at _aten/src/ATen/DLConvertor.cpp_. - Add both keyword-arguments to the `from_dlpack()` function at _torch/utils/dlpack.py_ and to the `Tensor.__dlpack__()` dunder method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150218 Approved by: https://github.com/albanD ghstack dependencies: #150216, #150217	2025-07-20 00:46:20 +00:00
Yukio Siraichi	1d526fe78f	Fix DLPack stream logic. (#150217 ) This PR fixes the logic for dealing with CUDA and ROCm streams whenever we are trying to create a DLPack capsule from a tensor. In summary, this PR: - Uses the legacy default stream if `tensor.__dlpack__(stream=None)` is called for a CUDA tensor. - Errors if `tensor.__dlpack__(stream=2)` is called for a CUDA tensor: PyTorch doesn't support the per-thread default stream. - Errors if `tensor.__dlpack__(stream=stream)`, where `stream` is 1 or 2, is called for a CUDA tensor using ROCm. For more details, see [the documentation][1]. [1]: https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/150217 Approved by: https://github.com/msaroufim, https://github.com/albanD ghstack dependencies: #150216	2025-07-20 00:46:20 +00:00
Yukio Siraichi	b64f338da4	[DLPack] add NumPy exchange tests. (#150216 ) This PR resolves an old TODO that requested NumPy DLPack exchange tests once version 1.22 was required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150216 Approved by: https://github.com/msaroufim, https://github.com/albanD	2025-07-20 00:46:20 +00:00
dolpm	a1cfe7f1df	[nativert] benchmark util (#158678 ) Differential Revision: D78514241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158678 Approved by: https://github.com/SherlockNoMad, https://github.com/georgiaphillips	2025-07-20 00:28:09 +00:00
Huy Do	d36afac83b	Build domain libraries for all workflows with TorchBench config (#158601 ) They are expensive GPU runners and should not spend time building packages Signed-off-by: Huy Do <huydhn@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158601 Approved by: https://github.com/ZainRizvi	2025-07-19 21:51:39 +00:00
Xu Han	7cc1a9546c	[AOTI] fix extract file failed on Windows. (#158702 ) Changes: 1. rename zip index name, and keep it out of normalize path. 2. normalize output path for extract file. Extract files successful: <img width="683" height="247" alt="image" src="https://github.com/user-attachments/assets/72dff7b9-5ec0-4523-a6ee-7768b37bbe63" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158702 Approved by: https://github.com/angelayi	2025-07-19 08:58:42 +00:00
Jane Xu	7cc5d03dfc	Document the rest of the specific optimizer module APIs (#158669 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158669 Approved by: https://github.com/albanD ghstack dependencies: #158483	2025-07-19 07:27:15 +00:00
Jane Xu	f73594164a	[BE] document Adadelta and Adagrad APIs properly (#158483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158483 Approved by: https://github.com/albanD	2025-07-19 07:27:15 +00:00
AaronWang04	a9f84021fb	[CI] Fixes CI for CUDA Version > 12.9 (#157385 ) Compute capabilities older than volta (inclusive) is no longer supported in CUDA Version > 12.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157385 Approved by: https://github.com/eqy	2025-07-19 06:51:57 +00:00
Boyuan Feng	22d82222c6	GenAI Layer Benchmark (#158536 ) This PR adds GenAI layer benchmark. It compares pytorch eager, pytorch compiler, liger, and quack. It covers all kernels supported by [quack](https://github.com/Dao-AILab/quack?tab=readme-ov-file#kernels-) (CrossEntropy Fwd/Bwd, Softmax Fwd/Bwd, RMSNorm Fwd/Bwd, LayerNorm Fwd) and LayerNormBwd. ## Motivations - Many OSS users asked how to properly benchmark torch.compile generated kernels. One common error is to compile a kernel/layer for one shape (e.g., batch size=1) and benchmark for another shape (e.g., batch size = 1024), which leads to bad performance. This provides an simple & clear example for proper benchmark. - We recently added GenAI model benchmark (based on [vLLM](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm)). But it's usually hard to optimize models directly due to complexity. Layer benchmarks are easier to reason and optimize. ## Key Settings - Avoid reusing a kernel specializing on 1 shape for benchmark on another shape. ```python torch._dynamo.config.automatic_dynamic_shapes = False # Needed since changing args to function causes recompiles torch._dynamo.config.recompile_limit = 1000000 ``` - For forward, people may mark batch size as dynamic to avoid runtime recompilation. We respect the setting in this kernel-level benchmark. ``` torch._dynamo.mark_dynamic(x, 0) ``` GPU: H100 (devvm006.dkl0) Results: [P1874246170](https://www.internalfb.com/phabricator/paste/view/P1874246170) Note: for numerical accuracy, we use the default tolerance of torch.testing.assert_close (i.e., for `torch.bfloat16`, use rtol `1.6e-2` and atol `1e-5`). It shows numerical issues for some backends and kernels. Next step is to add roofline analysis, add to ci for checking regression, cover more GenAI Kernels, and include GenAI Layers for common fusion patterns. <img width="3564" height="2368" alt="CrossEntropyBackward_bench" src="https://github.com/user-attachments/assets/7aa77ad1-83eb-41ea-a27d-50fd5b1dd6be" /> <img width="3564" height="2368" alt="CrossEntropyForward_bench" src="https://github.com/user-attachments/assets/a26ec028-3791-4a41-a12a-05e10f60e9aa" /> <img width="3564" height="2368" alt="LayerNormBackward_bench" src="https://github.com/user-attachments/assets/cc6673ed-c148-4dd2-a729-5f02e717ab3e" /> <img width="3564" height="2368" alt="LayerNormForward_bench" src="https://github.com/user-attachments/assets/f71f9f9d-7b45-4ce7-89d0-e9bce727efae" /> <img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/e012821a-b7e6-4e83-a24c-c97fa8cd37b5" /> <img width="3564" height="2368" alt="RMSNormForward_bench" src="https://github.com/user-attachments/assets/2d52ee1e-9a8c-4bd1-a180-97b93f07171d" /> <img width="3564" height="2368" alt="SoftmaxBackward_bench" src="https://github.com/user-attachments/assets/02aad056-3ce1-4b40-8cfe-adae81fd017a" /> <img width="3564" height="2368" alt="SoftmaxForward_bench" src="https://github.com/user-attachments/assets/779f6b0d-a102-4164-8300-86fff0329ddf" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158536 Approved by: https://github.com/yf225, https://github.com/eellison	2025-07-19 05:41:01 +00:00
Nikita Shulga	5cde34473c	Fix `MakeTensor::computeStorageSize()` (#158690 ) For tensor with non-zero offset, it must be multiplied by element size Add regression test by creating Tensor in array of 6 elements with offset 3, which before the fix crashed with ``` C++ exception with description "setStorage: sizes [3, 3], strides [0, 1], storage offset 3, and itemsize 4 requiring a storage size of 24 are out of bounds for storage of size 15 Exception raised from checkInBoundsForStorage at /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/Resize.h:123 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 56 (0x104a9cd44 in libc10.dylib) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 120 (0x104a9a05c in libc10.dylib) frame #2: void at::native::checkInBoundsForStorage<long long>(c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, caffe2::TypeMeta const&, c10::Storage const&) + 656 (0x111dbd314 in libtorch_cpu.dylib) frame #3: void at::native::setStrided<long long>(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long) + 152 (0x111dcd22c in libtorch_cpu.dylib) frame #4: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) + 312 (0x111dccf98 in libtorch_cpu.dylib) frame #5: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>>>, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 104 (0x1129a1e94 in libtorch_cpu.dylib) frame #6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 476 (0x112200ad0 in libtorch_cpu.dylib) frame #7: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) const + 236 (0x1115db098 in libtorch_cpu.dylib) frame #8: at::native::expand(at::Tensor const&, c10::ArrayRef<long long>, bool) + 348 (0x111dcc0d4 in libtorch_cpu.dylib) frame #9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 116 (0x1157ac410 in libtorch_cpu.dylib) frame #10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 992 (0x114e8b010 in libtorch_cpu.dylib) frame #11: at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 316 (0x112743c90 in libtorch_cpu.dylib) frame #12: at::expand_size(at::Tensor const&, c10::ArrayRef<long long>) + 164 (0x1047d82b4 in basic) frame #13: BasicTest_TestForBlobResizeCPU_Test::TestBody() + 284 (0x1047d8048 in basic) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158690 Approved by: https://github.com/angelayi	2025-07-19 05:21:33 +00:00
Luca Wehrstedt	fac0be7b9c	[async-TP] Turn asserts back into silent skips (#158572 ) https://github.com/pytorch/pytorch/pull/149946 modified some checks that verify whether async-TP is "applicable" to a given collective operation in a graph. Before, the pattern-mathcing+replacement would just be skipped, but now these are asserts that fail and raise. This is causing concrete issues in some graphs where 2-dimensional device meshes are being used (e.g., TP + CP) but only one dimension has symm-mem enabled. See #158569. This PR is turning these asserts back into harmless early-exits. Note that this only needed to be done for reduce-scatters, as it was already the case for all-gathers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158572 Approved by: https://github.com/danielvegamyhre, https://github.com/atalman	2025-07-19 04:54:38 +00:00
Laith Sakka	64dabb2cf5	only fail regressions>10% on pr_time benchmarks (#158577 ) Moving to a new framework, maintaitning the pr_time benchmark test right now is hard and often breaking. 1. only fail PRs >10% regressions. 2. post monitor with pr_time benchmarks dashboard (oncall), and update expected results (frequently or on big changes) (supposed to already be doing https://www.internalfb.com/unidash/dashboard/pt2_diff_time_metrics) 3. setting up some one detections detectors warnings that would be triggered at regressions and notify internally post land https://www.internalfb.com/monitoring/detector/1140915271179237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158577 Approved by: https://github.com/xmfan, https://github.com/janeyx99	2025-07-19 04:35:31 +00:00
Tristan Rice	ab557421a4	[cca] [c10d] Refactor CUDAEventCache into separate files (#158616 ) Summary: Refactored CUDAEventCache from ProcessGroupNCCL.hpp/.cpp into dedicated header and implementation files for better code organization and maintainability. Split out CUDAEventCache into: - New header file: CUDAEventCache.hpp - New implementation file: CUDAEventCache.cpp - Updated build_variables.bzl to include the new file This change improves code maintainability, readability, and follows better code organization practices. --- > Generated by [Confucius Code Assist (CCA)](https://www.internalfb.com/wiki/Confucius/Analect/Shared_Analects/Confucius_Code_Assist_(CCA)/) [Session](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Chat), [Trace](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Trace) Test Plan: Verified build with: ``` buck build //caffe2/test/distributed:c10d ``` --- > Generated by [Confucius Code Assist (CCA)](https://www.internalfb.com/wiki/Confucius/Analect/Shared_Analects/Confucius_Code_Assist_(CCA)/) [Session](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Chat), [Trace](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Trace) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158616 Approved by: https://github.com/fduwjj	2025-07-19 02:51:28 +00:00
Laith Sakka	90b082e207	enable_caching_generated_triton_templates=True by default (#158592 ) Got some risk, but good to catch issues if there is any, easy to revert single flag flip. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158592 Approved by: https://github.com/eellison	2025-07-19 02:19:34 +00:00
Huy Do	a741094159	Build domain libraries on the build job (#158600 ) By setting the name of the domain libraries to build via `BUILD_ADDITIONAL_PACKAGES` environment variable, the build job will build them and make them available as artifacts in the same way as the PyTorch CI wheel. To ensure that this doesn't break CI, the test job will still build them as usual if the wheels are not there. Building dependencies like FBGEMM on the test job is bad, especially for GPU jobs, because it leave the GPU resource idle Fixes https://github.com/pytorch/pytorch/issues/152024 Signed-off-by: Huy Do <huydhn@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158600 Approved by: https://github.com/yangw-dev ghstack dependencies: #158598, #158599	2025-07-19 02:03:50 +00:00
Huy Do	2955acaed6	Clean up some unused build env variables (#158599 ) * Parameter build-with-debug isn't needed, it isn't even passed into Docker. Debug build is detected via the build environment name * AWS_DEFAULT_REGION is a leftover from ARC and isn't used anywhere in .ci/pytorch nor .github Signed-off-by: Huy Do <huydhn@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158599 Approved by: https://github.com/cyyever, https://github.com/ZainRizvi ghstack dependencies: #158598	2025-07-19 01:59:00 +00:00
Ryan Guo	2c16eb9f3d	[dynamo] Support more basic output types for `nonstrict_trace` (#157969 ) Fixes #157397 and improves the user-facing error message for remaining unsupported cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157969 Approved by: https://github.com/zou3519	2025-07-19 00:59:54 +00:00
PyTorch MergeBot	c2c88846a9	Revert "[Easy] Show some clear error when torch.ops.load_library fails. (#157524 )" This reverts commit 555f3562541992b66a550eca8e8740884b1247f8. Reverted https://github.com/pytorch/pytorch/pull/157524 on behalf of https://github.com/wdvr due to reverting for now to reopen the discussion ([comment](https://github.com/pytorch/pytorch/pull/157524#issuecomment-3091317252))	2025-07-19 00:45:31 +00:00
PyTorch MergeBot	5b40f6581e	Revert "Add warning about removed sm50 and sm60 arches (#158301 )" This reverts commit fb731fe371cb1b5bf95de84b19c213590526acb2. Reverted https://github.com/pytorch/pytorch/pull/158301 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158301#issuecomment-3091307023))	2025-07-19 00:32:04 +00:00
Xu Han	d42c409767	[AOTI] windows package load dev (#158671 ) changes: 1. add extract file fail handler for Windows develop. 2. normalize more file paths. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158671 Approved by: https://github.com/angelayi	2025-07-19 00:06:40 +00:00
Will Constable	a3aacd6cb2	[DTensor] fix copy_ strategy (#158538 ) The previous strategy directly used 'self' input strategy for 'src' input. The fixed strategy correctly maps the self dim to src dim so that it works even if the src input is broadcast. E.g. for this program, broadcasting will occur on dims 0,1,3 of self. ``` self = torch.ones((2,3,4,5)) src = torch.ones((4,1)) self.copy_(src) ``` These are the correct sharding combinations: \| self \| src \| \|-------\|------\| \| Shard(0) \| Replicate() \| \| Shard(1) \| Replicate() \| \| Shard(2) \| Shard(0) \| \| Shard(3) \| Shard(1) \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/158538 Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol ghstack dependencies: #158490	2025-07-18 23:44:43 +00:00
Will Constable	36bddcd18c	[DTensor] Fix default_strategy and rename for clarity (#158490 ) Fixes several bugs in the original. - foremost, fixes a serious bug where we returned incorrect strategies by mixing input_specs that were frozen from select_strategy.strategies[0] with output_specs that varied across select_strategy.strategies[0..N] (e.g. we could create a nonsense strategy like input:Shard(0) output(Replicate) for an op like clone - fixes the redistribute costs: they should not actually be 0, they should be the cost of redistributing our single input from another strategy to the current strategy, in our list of output strategies - adds a note, wondering if we should have just literally returned the input strategy instead of creating this new object - Currently, using default_strategy is incorrect becuase it maps 'self' tensor's strategies directly onto 'src' tensor without accounting for the fact that copy_ supports broadcasting a smaller rank tensor into a larger one. Separates out copy_ op from default strategy, adds missing test case, but does not fix the underlying issue with copy_, leaves that for future PR Renames to `propagate_single_input_strategy` since that's more descriptive Pull Request resolved: https://github.com/pytorch/pytorch/pull/158490 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2025-07-18 23:44:42 +00:00
AaronWang04	15ef4f28df	Fused RMSNorm implementation (#153666 ) Relevant #72643 Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090. ```py import torch import torch.nn as nn class RMSNorm(nn.Module): def __init__(self, dim, eps=1e-5): super().__init__() self.eps = eps self.scale = nn.Parameter(torch.ones(dim)) def forward(self, x): norm_x = x.norm(2, dim=-1, keepdim=True) rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype)) x_normed = x / (rms_x + self.eps) return self.scale * x_normed def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16): rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype) input_data = torch.randn(input_shape, device='cuda', dtype=dtype) for _ in range(warmup_iterations): _ = rms_norm_layer(input_data) torch.cuda.synchronize() start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) start_event.record() for _ in range(num_iterations): _ = rms_norm_layer(input_data) end_event.record() torch.cuda.synchronize() elapsed_time_ms = start_event.elapsed_time(end_event) avg_time_ms = elapsed_time_ms / num_iterations print(f"--- RMSNorm CUDA Benchmark ---") print(f"Input Shape: {input_shape}") print(f"Normalized Dimension: {normalized_dim}") print(f"Benchmark Iterations: {num_iterations}") print(f"--- Fused Implementation ---") print(f"Average Time per Iteration: {avg_time_ms:.4f} ms") print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms") compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda() for _ in range(warmup_iterations): _ = compiled_rms_norm(input_data) torch.cuda.synchronize() start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) start_event.record() for _ in range(num_iterations): _ = compiled_rms_norm(input_data) end_event.record() torch.cuda.synchronize() elapsed_time_ms = start_event.elapsed_time(end_event) avg_time_ms = elapsed_time_ms / num_iterations print(f"--- TorchCompile Implementation ---") print(f"Average Time per Iteration: {avg_time_ms:.4f} ms") print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms") print("-" * 50) if __name__ == '__main__': parameter_sets = [ {'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16}, {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16}, {'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16}, {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32}, {'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16}, ] num_benchmark_iterations = 200 num_warmup_iterations = 20 for params in parameter_sets: batch_size = params['batch_size'] sequence_length = params['sequence_length'] hidden_features = params['hidden_features'] data_type = params.get('dtype', torch.float16) shape = (batch_size, sequence_length, hidden_features) norm_dim_to_normalize = hidden_features print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}") benchmark_rmsnorm_cuda(input_shape=shape, normalized_dim=norm_dim_to_normalize, num_iterations=num_benchmark_iterations, warmup_iterations=num_warmup_iterations, dtype=data_type) ``` Here are the triton compile tests ran on a 5090 (comparing this branch vs main) ```py import torch import torch.nn as nn from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code torch.manual_seed(0) device = torch.device("cuda") for batch in range(0, 9): for i in range(9, 16): normalized_shape_arg = (2batch, 2i) input_tensor = torch.randn(2batch, 2i, device=device, requires_grad=True) weight_tensor = torch.randn(2batch, 2i,device=device, requires_grad=True) model = torch.nn.functional.rms_norm compiled_model = torch.compile(model) loss = torch.randn_like(input_tensor) num_iter = 5 for j in range(num_iter): output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor) output.backward(loss) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) start_event.record() num_iter = 10 for j in range(num_iter): output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor) output.backward(loss) end_event.record() torch.cuda.synchronize() elapsed_time_ms = start_event.elapsed_time(end_event) avg_time_ms = round(elapsed_time_ms / num_iter, 5) print(2batch, 2i, avg_time_ms) ``` main ``` 32 512 0.1812 32 1024 0.19021 32 2048 0.18871 32 4096 0.17019 32 8192 0.21944 32 16384 0.38871 32 32768 0.83282 64 512 0.14705 64 1024 0.13987 64 2048 0.14111 64 4096 0.21699 64 8192 0.43141 64 16384 0.90652 64 32768 2.18573 128 512 0.19361 128 1024 0.1963 128 2048 0.20122 128 4096 0.38888 128 8192 0.93795 128 16384 2.23437 128 32768 5.50079 256 512 0.16722 256 1024 0.22856 256 2048 0.39421 256 4096 0.96621 256 8192 2.48746 256 16384 5.53571 256 32768 11.97932 ``` current branch ``` 32 512 0.16328 32 1024 0.18104 32 2048 0.15508 32 4096 0.14356 32 8192 0.20111 32 16384 0.45974 32 32768 0.94799 64 512 0.16874 64 1024 0.18701 64 2048 0.16107 64 4096 0.20152 64 8192 0.46568 64 16384 0.96599 64 32768 2.21661 128 512 0.14982 128 1024 0.15565 128 2048 0.22241 128 4096 0.46128 128 8192 0.88883 128 16384 2.3097 128 32768 5.84448 256 512 0.14346 256 1024 0.2007 256 2048 0.45927 256 4096 0.87876 256 8192 2.10571 256 16384 5.73948 256 32768 12.98581 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/albanD	2025-07-18 23:24:21 +00:00
Grace Cheng	60b9b06a53	[caffe2] Fix Missing override in get_buffer of NCCLSymmetricMemory (#158597 ) Summary: Fix the error that occurs in the devarm environment when compiling with Clang: ``` caffe2/torch/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu:97:20: error: 'get_buffer' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 97 \| virtual at::Tensor get_buffer(int \| ^ caffe2/torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp:56:20: note: overridden virtual function is here 56 \| virtual at::Tensor get_buffer(int rank, c10::IntArrayRef sizes, c10::ScalarType dtype, int64_t storage_offset) = 0; \| ^ 1 error generated. ``` Test Plan: See D78520305 Rollback Plan: Differential Revision: D78517953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158597 Approved by: https://github.com/janeyx99	2025-07-18 23:12:29 +00:00
fduwjj	a835dbc096	[c10d][ez] Fix error message to reflect the correct API name (#158668 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158668 Approved by: https://github.com/VieEeEw	2025-07-18 23:10:47 +00:00
Yang Wang	f76f4abf3f	Track monitor (#156907 ) Tracking gpu mem allocation, we were tracking the gpu bandwidth memory, the mem allocation is the one reflect wether the gpu is oom or not, upcoming ui fix. UI fix: https://github.com/pytorch/test-infra/pull/6878/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/156907 Approved by: https://github.com/huydhn	2025-07-18 22:54:13 +00:00
Yang Wang	be483a5481	setup pinned commit for vllm in pytorch ci (#158591 ) Set up pinned commit for vllm in nightly Pull Request resolved: https://github.com/pytorch/pytorch/pull/158591 Approved by: https://github.com/seemethere, https://github.com/huydhn	2025-07-18 22:30:20 +00:00
Huamin Li	bc7b1f5252	[AOTI] Use libstdc++ only for fbcode cpu case (#158659 ) Differential Revision: D78567218 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158659 Approved by: https://github.com/kflu, https://github.com/zoranzhao	2025-07-18 22:27:10 +00:00
Simon Fan	07c4c2a792	[dynamo][be] hide warnings without invalidating warnings cache (#158520 ) I feel uneasy about touching `__warningregistry__` since it is undocumented and private surface. The only public API hook that doesn't increment warnings version seems to be https://docs.python.org/3/library/warnings.html#warnings.showwarning. So we could wack a mole all the warnings muters in compile to just not display warnings, and we wouldn't invalidate warnings cache. This PR adds it for torch/_dynamo, and I didn't find any warnings versioning mutation from torch/_inductor. There is a behavior change if someone calls a compiled graph with simplefilter("error"): ```python # e.g. test/dynamo_expected_failures/TestAutogradFallback.test_no_autograd_kernel_inplace_mode_nothing with warnings.catch_warnings(): warnings.simplefilter("error") # turns all warnings into errors compiled_fn() # will throw if any of the muted warnings fire ``` FIXES https://github.com/pytorch/pytorch/issues/128427 A note for the future: The warnings module doesn't offer a thread safe way of using it. Even regular filters have this problem, directly editing `__warningregistry__` would be very bad, and this PR would mute all threads. Someone will need to build a thread safe warnings interface. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158520 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2025-07-18 22:02:31 +00:00
Michael Lazos	89850bbc07	[Dynamo] Use proper sources for constructing dataclass defaults (#157993 ) Partially fixes https://github.com/pytorch/pytorch/issues/154009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157993 Approved by: https://github.com/williamwen42, https://github.com/anijain2305	2025-07-18 21:51:40 +00:00
PyTorch MergeBot	3bb729df97	Revert "Fix test consolidate hf safetensors (#157386 )" This reverts commit fa1c20ae9285f7994a73d2d06025065f96b67a57. Reverted https://github.com/pytorch/pytorch/pull/157386 on behalf of https://github.com/jithunnair-amd due to Need to revert this so we can revert PR 156705, which introduced errors on ROCm CI. These errors were not seen on CUDA CI because CUDA CI docker images do not have safetensors installed and the test silently passes ([comment](https://github.com/pytorch/pytorch/pull/157386#issuecomment-3090706074))	2025-07-18 21:00:12 +00:00
PyTorch MergeBot	e3351b3ddf	Revert "[DCP][HF] [ez]Change where sharded tensors are saved (#158069 )" This reverts commit 627ba411366bcc15019c49756d3f22fd3914bd50. Reverted https://github.com/pytorch/pytorch/pull/158069 on behalf of https://github.com/jithunnair-amd due to Didn't remove reference to `consolidated_output_path` in test_hf_safetensor_e2e.py; CUDA runs do not surface issue because safetensors is not installed and the test silently passes ([comment](https://github.com/pytorch/pytorch/pull/158069#issuecomment-3090692336))	2025-07-18 20:54:19 +00:00
Huy Do	1ab1ab38a0	Use linux.12xlarge.memory to build for H100/sm_90 (#158598 ) Use a bigger runner here because CUDA_ARCH 9.0 is only built for H100 or newer GPUs, so it doesn't benefit much from existing compiler cache from trunk. Also use a memory-intensive runner here because memory is usually the bottleneck Signed-off-by: Huy Do <huydhn@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158598 Approved by: https://github.com/ZainRizvi, https://github.com/malfet	2025-07-18 20:31:56 +00:00
Colin L Reliability Rice	8b2a650572	pt2_remote_cache: Log sample for failures, and log the explicit reason we're faling. (#156874 ) Summary: This allows us to start alerting on cache failures, based on scuba data Test Plan: Added new tests explicitly for the Remote Cache API. Note that we have existing tests for memcache, but not for manifold AFAICT. There are two potential wrinkles. One we're adding a new field (and everything uses ScubaData AFAICT, so this should just work). The other one is the implicit api contract that if the sample is None, then it will be ignored (and not crash). I believe the second one is implemented correctly (and tested). The first one is a little more nebulous, but I think won't cause any breakages. Also manually ran a compile and made sure it didn't break - P1851504490 as well as forcing it to break and checking we didn't screw up the exception handling - P1851504243 Rollback Plan: Differential Revision: D77054339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156874 Approved by: https://github.com/oulgen, https://github.com/masnesral	2025-07-18 20:28:27 +00:00
Apostolos Kokolis	ec0b538961	[inductor] Make times and repeat parameters command line args (#158590 ) Summary: Small change to make the `times` and `repeat` variables controllable as command line args. Test Plan: Execute: ``` buck2 run <run params> <path>:inductor_benchmark -- --times=1 --repeat=1 ``` Only runs once, and without passing the args it runs with default values of 10. Rollback Plan: Reviewed By: malfet Differential Revision: D78458680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158590 Approved by: https://github.com/FindHao, https://github.com/malfet	2025-07-18 20:07:55 +00:00
Xu Han	599f94e7b9	[AOTI] add Windows file ext to package loader. (#158578 ) Add `object` and `extension` file type for Windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/158578 Approved by: https://github.com/angelayi	2025-07-18 19:57:12 +00:00
Sam Larsen	04ac258cf6	[BE][testing] Fix test_cudacodecache.py (#158259 ) Summary: According to internal test failures, looks like we're missing a check for cuda: https://fburl.com/testinfra/eznzkyha Test Plan:c`buck test` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158259 Approved by: https://github.com/exclamaforte, https://github.com/BoyuanFeng	2025-07-18 19:56:13 +00:00
Zain Rizvi	1b5fdb23b9	[BE] Add pre-push hook for lintrunner to the PyTorch repo (#158389 ) Adds a pre-commit hook (technically a pre-push hook) to the PyTorch repo. This is currently an opt-in feature, which one can opt into by running `python scripts/setup_hooks.py` locally. ### Features - Run Lintrunner Before Push: Before every `git push`, automatically runs lintrunner on your changes. - Really need to skip the checks? Run `git push --no-verify` - Consistent, Isolated, Lintrunner Environment: During pre-push, Lintrunner runs in it's own virtual en environment that contain all lintrunner dependencies in a consistent, isolated environment. No more lintrunner failures because you created a new .venv. (Did you know you needed to run `lintrunner init` every time you make a new .venv?) - Dependencies Automatically Updated: If .lintrunner.toml is updated, this will automatically re-run `lintrunner init` to ensure you install the latest dependencies specified ### Installation - Run `python scripts/setup_hooks.py`. Now every `git push` will first run lintrunner. ### Additional details - The lintrunner used by the pre-push hook runs in a special per-repo virtual environment managed by the commit-hook tool located under `$USER/.cache/pre-commit` - Does not affect your regularly used lintrunner - Manual invocations of lintrunner will continue to depend on your local environment instead of the special pre-push one. If there's enough interest, we could explore consolidating them. - Does not run `lintrunner -a` for you. - You still need to manually run that (can be changed later though!) - Have staged/unstaged changes? No worries - This runs `git stash` before running the pre-commit hooks and pops back your changes afterwards, so only the changes actaully being pushed will be tested ### Downsides - No streaming UI updates - While you still get the same output from lintrunner that you're used to, the commit-hook framework doesn't show any output while lintrunner is actually running. Instead, it shows the entire output after linter has completed execution, which could be a few minutes (especially if it has to run `lintrunner init` first) - `uv` installation is required to run the setup script. The setup script will ask users to install uv if it's not available. - This is required to be able to install the pre-commit package in a safe way that's available no matter what .venv you are running in. ### Opting out - Disable hook for a single push: Run `git push --no-verify` - Disable hook permanently: If something goes wrong and you need to wipe your setup: - Delete the `$USER/.cache/pre-commit` folder and the `.git/hooks/pre-push` file in your local repo. - You can now rerun `python scripts/setup_hooks.py` to setup your git push hook again if you want. ### Potential Future Changes Things that could be done to make this even better if folks like these ideas: - Automatic setup - Our `CONTRIBUTING.md` file tells devs to run `make setup-env`. That could be a good entry point to hook the installation into - Fix the console output streaming - Make every lintrunner invocation (including manual ones) use the same repo-specific venv that the commit-hook uses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158389 Approved by: https://github.com/seemethere	2025-07-18 19:55:35 +00:00
dsashidh	75e2628782	Add lower bounds for fsspec and networkx dependencies (#158565 ) Fixes #156587 This sets lower bounds for fsspec and networkx in both setup.py and requirements,txt. - fsspec>= 0.8.5 (released December 15, 2020) - netowrkx>= 2.5.1 (released April 3, 2021) These are the first stable versions released after Python 3.9 came out on October 5, 2020. Since Python 3.8 is no longer maintained, setting these minimums helps ensure PyTorch won't be installed alongside unexpectedly old versions of these packages. Tested with these versions locally to make sure they don't break anything. Adding CI for lower-bound testing could be a follow up later if need. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158565 Approved by: https://github.com/janeyx99	2025-07-18 19:42:09 +00:00
Svetlana Karslioglu	79e49efadd	Pull latest Sphinx theme (#158595 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158595 Approved by: https://github.com/albanD	2025-07-18 18:46:47 +00:00
Sam Larsen	b87e50db5e	[BE][testing] Fix internal test failures in test/dynamo/test_unspec (#158485 ) Summary: These tests failing internally because the number of underlying calls to the rng differ by virtue of various library initializations that get sucked in with an internal build. Test Plan: ``` buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_random_object' --run-disabled buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_random_values_with_graph_break' --run-disabled buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_feed_random_values_into_graph_only' --run-disabled ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158485 Approved by: https://github.com/williamwen42	2025-07-18 18:41:03 +00:00
Lucas Kabela	656885b614	[Dynamo][Better Engineering] Type devices, resume_execution and testing utils (#158593 ) As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to a set of utilities in dynamo, `device_interface.py`, `resume_execution.py`, `tensor_version_ops.py`, `test_case.py`, and `test_minifier_common.py` Running ``` mypy torch/_dynamo/device_interface.py torch/_dynamo/resume_execution.py torch/_dynamo/tensor_version_op.py torch/_dynamo/test_case.py torch/_dynamo/test_minifier_common.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 976 \| 1672 \| 58.37% \| 76 \| 112 \| 67.86% \| \| This PR \| 1719 \| 1719 \| 100.00% \| 112 \| 112 \| 100.00% \| \| Delta \| +743 \| +47 \| +41.63% \| +36 \| 0 \| +32.14% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/158593 Approved by: https://github.com/mlazos	2025-07-18 18:22:06 +00:00
Lucas Kabela	6e07d6a0ff	[Dynamo][Better Engineering] Add typing support for _dynamo/repro and debug_utils (#158504 ) As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to an important set of utilities in dynamo, `repro/` and the base `debug_utils.py` Running ``` mypy torch/_dynamo/repro/ torch/_dynamo/debug_utils.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 905 \| 3268 \| 27.69% \| 22 \| 81 \| 27.16% \| \| This PR \| 3368 \| 3368 \| 100.00% \| 81 \| 81 \| 100.00% \| \| Delta \| +2463 \| +100 \| +72.31% \| +59 \| 0 \| +72.84% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/158504 Approved by: https://github.com/mlazos	2025-07-18 18:15:55 +00:00
yuchengliu1	b4358c5e87	[inductor] Explicitly link c10 in inductor. (#158622 ) MSVC have error "unresolved external symbol" when compiling inductor. Explicitly link c10 in inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158622 Approved by: https://github.com/desertfire Co-authored-by: Xu Han <xu.han@outlook.com>	2025-07-18 18:00:50 +00:00
PyTorch MergeBot	86675af3f0	Revert "[ROCm][CI] update fbgemm_gpu hash used by inductor tests (#158602 )" This reverts commit 9308261a2afb69d807ea06508bb8582b066d9ccd. Reverted https://github.com/pytorch/pytorch/pull/158602 on behalf of https://github.com/ZainRizvi due to The lint job failure was hiding a real lint failure. See here for more details: [GH job link](https://github.com/pytorch/pytorch/actions/runs/16375911199/job/46275682191) [HUD commit link](`6f73e06796`) ([comment](https://github.com/pytorch/pytorch/pull/158602#issuecomment-3090209891))	2025-07-18 17:46:11 +00:00
Deepak Seshadri	725cdb218e	Name threads in caffe2/torch/distributed/checkpoint AsyncCheckpointExecutor (#158612 ) Differential Revision: D78493333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158612 Approved by: https://github.com/d4l3k	2025-07-18 17:33:12 +00:00
Xu Han	8c3f84908b	[aot] fix greater_than_max build fail on Windows. (#158479 ) Error snapshot: <img width="937" height="110" alt="image" src="https://github.com/user-attachments/assets/10195f84-83c4-42db-af3c-76f875a6a983" /> Reason: `std::numeric_limits::max` is confilct to windef.h:`max(a, b)` Fix code: <img width="488" height="269" alt="image" src="https://github.com/user-attachments/assets/3328c37b-7c89-435e-944c-4ca7c9b6c5b6" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158479 Approved by: https://github.com/desertfire	2025-07-18 17:18:10 +00:00
Guilherme Leobas	6f73e06796	[iter] exhaust `ListIterator` when `unpack_var_sequence` is called (#156370 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156370 Approved by: https://github.com/zou3519 ghstack dependencies: #156369	2025-07-18 16:48:27 +00:00
Guilherme Leobas	acffd1a297	[iter] Update some of the tests to not call pickle (#156369 ) Some tests in test_iter only fail because of pickle. I'm skipping the pickle section as Dynamo doesn't support it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156369 Approved by: https://github.com/zou3519	2025-07-18 16:48:27 +00:00
PyTorch MergeBot	bf4aa78279	Revert "[DTensor] Fix default_strategy and rename for clarity (#158490 )" This reverts commit d8b084312b54e97bdbaf6a178fe2fc628a23243b. Reverted https://github.com/pytorch/pytorch/pull/158490 on behalf of https://github.com/clee2000 due to broke lint? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16361950974/job/46231492581) [HUD commit link](`d8b084312b`) ([comment](https://github.com/pytorch/pytorch/pull/158490#issuecomment-3090042448))	2025-07-18 16:45:32 +00:00
PyTorch MergeBot	50f33a6fca	Revert "[DTensor] fix copy_ strategy (#158538 )" This reverts commit 7b05bdd925f0f4b49e68662f9761fabaa27f2faf. Reverted https://github.com/pytorch/pytorch/pull/158538 on behalf of https://github.com/clee2000 due to broke lint? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16361950974/job/46231492581) [HUD commit link](`d8b084312b`) ([comment](https://github.com/pytorch/pytorch/pull/158490#issuecomment-3090042448))	2025-07-18 16:45:32 +00:00
Xu Han	35df895d05	[AOTI] package loader normalize path separator (#158630 ) Add `normalize_path_separator` to handle Windows path simplify. This solution is working well on `torch/_inductor/cpp_builder.py`: `a00cd8cf25/torch/_inductor/cpp_builder.py (L406-L409)` Let's copy it to package loader. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158630 Approved by: https://github.com/angelayi	2025-07-18 15:55:24 +00:00
Zain Rizvi	193b29ee0c	[BE][EZ] Minor doc fixes (#158574 ) [BE] Minor doc fixes	2025-07-18 10:34:55 -05:00
Zhengxu Chen	036eb1f65d	[precompile] Filter out ID_MATCH family of guards with caching_precompile. (#158368 ) Summary: For case like caching_precompile, we almost always want to drop ID_MATCH-type guards since they will block serialization. This diff add this behavior when this global flag is toggled on so that ID_MATCH guards are excluded from compilation and serialization. Test Plan: test_dynamo -- -k test_id_match_with_config Rollback Plan: Differential Revision: D78363609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158368 Approved by: https://github.com/jamesjwu	2025-07-18 14:47:11 +00:00
Jane Xu	e882c761dd	Add STD_TORCH_CHECK to headeronly (#158377 ) Differential Revision: [D78366519](https://our.internmc.facebook.com/intern/diff/D78366519/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158377 Approved by: https://github.com/albanD	2025-07-18 14:35:20 +00:00
bobrenjc93	0eae6b68f4	Unify torch.tensor and torch.ops.aten.scalar_tensor behavior (#158537 ) Fixes #158376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158537 Approved by: https://github.com/atalman	2025-07-18 14:05:52 +00:00
Xuehai Pan	a4ec381302	[build] pin `setuptools>=77` to enable PEP 639 (#158104 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158104 Approved by: https://github.com/rgommers, https://github.com/Skylion007, https://github.com/atalman	2025-07-18 11:49:54 +00:00
Aidyn-A	27af877f84	[ATen][CUDA][SDPA] Flash Attention: Refactor sm version checks (#158558 ) The architecture version checks are unnecessary fine-grained in PyTorch. Considering the fact that PyTorch's Flash Attention works on all `sm_80+` machines, it makes more sense to just check for lower bound. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158558 Approved by: https://github.com/eqy	2025-07-18 09:59:41 +00:00
Will Constable	7b05bdd925	[DTensor] fix copy_ strategy (#158538 ) The previous strategy directly used 'self' input strategy for 'src' input. The fixed strategy correctly maps the self dim to src dim so that it works even if the src input is broadcast. E.g. for this program, broadcasting will occur on dims 0,1,3 of self. ``` self = torch.ones((2,3,4,5)) src = torch.ones((4,1)) self.copy_(src) ``` These are the correct sharding combinations: \| self \| src \| \|-------\|------\| \| Shard(0) \| Replicate() \| \| Shard(1) \| Replicate() \| \| Shard(2) \| Shard(0) \| \| Shard(3) \| Shard(1) \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/158538 Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol ghstack dependencies: #158495, #158490	2025-07-18 09:59:37 +00:00
Aleksei Nikiforov	ead80f3202	Fix s390x CI: ensure that all python dependencies are installed when … (#158552 ) …building pytorch for tests on s390x Pull Request resolved: https://github.com/pytorch/pytorch/pull/158552 Approved by: https://github.com/huydhn	2025-07-18 09:13:41 +00:00
PyTorch MergeBot	32aade9d8d	Revert "Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037 )" This reverts commit 39ac189808c61588f3594dbc2fc1d69bb6194c47. Reverted https://github.com/pytorch/pytorch/pull/158037 on behalf of https://github.com/jithunnair-amd due to Ignored ROCm failures while ROCm was unstable, but HUD clearly shows this PR introduced failures on trunk ([comment](https://github.com/pytorch/pytorch/pull/158037#issuecomment-3087982975))	2025-07-18 07:47:46 +00:00
PyTorch MergeBot	be896d6b41	Revert "Forward-fix unused variables warning/error (#158549 )" This reverts commit eeda1a75ace75ce8a6763050fb91d236a6d3287b. Reverted https://github.com/pytorch/pytorch/pull/158549 on behalf of https://github.com/jithunnair-amd due to Sorry, need to revert this first, so we can revert PR 158037, which broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/158549#issuecomment-3087942475))	2025-07-18 07:44:14 +00:00
Yidi Wu	a3396a9b85	[hop] set capture_scalar_outputs=True by default for compiled hops (#158480 ) We want to do it for two reasons: 1. It's tedious for users to manually turn on capture_scalar_outputs=True when compiling map and scan with inductor, where we decomposing them into while_loop and use the idx tensor.item() to select a slice of output buffer and write into it. This pr turns on the flag by default. 2. a graph break caused by capture_scalar_outputs=False would cause the hop to fail, and we should turn it on by default so that the error message is more meaningful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158480 Approved by: https://github.com/zou3519	2025-07-18 07:16:50 +00:00
Yidi Wu	fda3f3b2ec	[while_loop] fix constant tensor used as carried inputs (#158381 ) Address second part of #158366, where torch.tensor(0), is treated as a constant tensor and its .item() gets specailized to 0 which causes a silent specialization. The fix is to unspecialize the constant carries and make them non-constant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158381 Approved by: https://github.com/zou3519	2025-07-18 07:08:11 +00:00
drisspg	a00cd8cf25	Add a way to disable compile for debugging flex-attention (#158534 ) Finally got around to doing this, this flag lets us do: ```Python #!/usr/bin/env python3 """ FlexAttention Debug: Using breakpoints and unwrap """ import torch import torch.nn.attention.flex_attention as fa unwrap = torch._C._functorch.get_unwrapped def score_mod(score, batch, head, q_idx, kv_idx): # Set breakpoint here to debug breakpoint() # In debugger, unwrap to see actual tensor values: # >>> actual_score = unwrap(unwrap(unwrap(unwrap(score)))) # >>> actual_batch = unwrap(batch) # >>> actual_head = unwrap(head) # >>> actual_q_idx = unwrap(q_idx) # >>> actual_kv_idx = unwrap(kv_idx) # >>> print(actual_score) # >>> print(f"q_idx: {actual_q_idx}, kv_idx: {actual_kv_idx}") return torch.where(q_idx >= kv_idx, score, torch.tensor(float('-inf'))) def main(): # Enable debug mode fa._FLEX_ATTENTION_DISABLE_COMPILE_DEBUG = True # Small example B, H, S, D = 1, 2, 4, 8 q = torch.randn(B, H, S, D) k = torch.randn(B, H, S, D) v = torch.randn(B, H, S, D) # Run - will hit breakpoint output = fa.flex_attention(q, k, v, score_mod=score_mod) # Disable debug mode fa._FLEX_ATTENTION_DISABLE_COMPILE_DEBUG = False if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158534 Approved by: https://github.com/Chillee, https://github.com/zou3519	2025-07-18 05:33:45 +00:00
PaliC	eb73650723	[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427 ) This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter. 1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption. 2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`. For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from the `check_pyobj` call. ``` mpy::handle handle_from_tensor(Arena& A, TensorRef t) { - // fast case: tensor is live in python - std::optional<PyObject> mb_obj = - t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /ignore_hermetic_tls=/false); - if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) { - return mb_obj; - } - return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(t))); -} -} + // fast case: tensor is live in python + std::optional<PyObject> mb_obj = + t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj( + /ignore_hermetic_tls=/false); + if (mb_obj.has_value() && + !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) { + return mb_obj; + } + return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(t))); +} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427 Approved by: https://github.com/albanD	2025-07-18 05:23:00 +00:00
Jeff Daily	9308261a2a	[ROCm][CI] update fbgemm_gpu hash used by inductor tests (#158602 ) fbgemm_gpu build started failing with asmjit errors. Moving to latest tip of fbgemm for inductor tests resolves the build failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158602 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-18 05:02:31 +00:00
PyTorch MergeBot	9a7c2f1f64	Revert "Add torch compile force disable caches alias (#158072 )" This reverts commit 2ecf083b7247f265a03ec296ba9d7b795f035118. Reverted https://github.com/pytorch/pytorch/pull/158072 on behalf of https://github.com/jeffdaily due to fails on rocm, signal ignored while rocm was unstable ([comment](https://github.com/pytorch/pytorch/pull/158072#issuecomment-3086740829))	2025-07-18 04:58:24 +00:00
Will Constable	d8b084312b	[DTensor] Fix default_strategy and rename for clarity (#158490 ) Fixes several bugs in the original. - foremost, fixes a serious bug where we returned incorrect strategies by mixing input_specs that were frozen from select_strategy.strategies[0] with output_specs that varied across select_strategy.strategies[0..N] (e.g. we could create a nonsense strategy like input:Shard(0) output(Replicate) for an op like clone - fixes the redistribute costs: they should not actually be 0, they should be the cost of redistributing our single input from another strategy to the current strategy, in our list of output strategies - adds a note, wondering if we should have just literally returned the input strategy instead of creating this new object - Currently, using default_strategy is incorrect becuase it maps 'self' tensor's strategies directly onto 'src' tensor without accounting for the fact that copy_ supports broadcasting a smaller rank tensor into a larger one. Separates out copy_ op from default strategy, adds missing test case, but does not fix the underlying issue with copy_, leaves that for future PR Renames to `propagate_single_input_strategy` since that's more descriptive Pull Request resolved: https://github.com/pytorch/pytorch/pull/158490 Approved by: https://github.com/wanchaol, https://github.com/XilunWu ghstack dependencies: #158495	2025-07-18 04:09:32 +00:00
Shangdi Yu	1e86fa2e5b	Add stack trace to Inductor IR nodes if `inductor.config.trace.provenance_tracing=True` (#158576 ) Summary: - Split `create_mapping` to `create_mapping_pre_post_grad_nodes` and ` create_node_mapping_kernel_to_post_grad` - Store a mapping from pre_grad graph node names to stack traces in `_inductor_pre_grad_node_stack_trace` - Add `stack_traces` member to ir.Node and add it to the string representation of ir.Node - When we create an IR node, if `inductor.config.trace.provenance_tracing=True`, we populate `stack_traces` from `origins`. The nodes in `origins` are post_grad graph nodes. If a node has `node.stack_trace`, we store the stack_trace directly. This is particularly important for backward graph nodes because they don't have a mapping to pre-grad graph nodes. If a node doesn't have `.stack_trace ` (such as `linear`-> `addmm` nodes), we use the stack trace of the pre_grad graph nodes that it maps to. - A post grad graph node might not have stack trace if it correspond to multiple pre grad graph nodes, e.g. [GroupLinearFusion](`a00442421a/torch/_inductor/fx_passes/group_batch_fusion.py (L299)`) Example: ``` scheduling ExternKernelOut( python_kernel_name='extern_kernels.mm', name=buf0, layout=FixedLayout('cuda:0', torch.float32, size=[8, 16], stride=[16, 1]), inputs=[InputBuffer(name='arg2_1', layout=FixedLayout('cuda:0', torch.float32, size=[8, 10], stride=[10, 1])), ReinterpretView( StorageBox( ConstantBuffer(name='fc1_weight', layout=FixedLayout('cuda:0', torch.float32, size=[16, 10], stride=[10, 1])) ), FixedLayout('cuda:0', torch.float32, size=[10, 16], stride=[1, 10]), origins=OrderedSet([mm_default_1]), stack_traces = {, File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/scripts/shangdiy/aot.py", line 29, in forward, x = self.fc1(x), File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/torch/nn/modules/linear.py", line 125, in forward, return F.linear(input, self.weight, self.bias), } )], constant_args=(), kwargs={}, output_view=None, python_kernel_name=extern_kernels.mm, cpp_kernel_name=at::mm_out, ordered_kwargs_for_cpp_kernel=(), op_overload=None, arg_properties=[{}, {}], allarg_properties={}, kwarg_properties=None, unbacked_bindings={}, mutation_outputs=[], origin_node=mm_default_1, origins=OrderedSet([mm_default_1]), stack_traces = {, File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/scripts/shangdiy/aot.py", line 29, in forward, x = self.fc1(x), File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/torch/nn/modules/linear.py", line 125, in forward, return F.linear(input, self.weight, self.bias), } ) ``` Test Plan: ``` buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing ``` Rollback Plan: Differential Revision: D78365534 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158576 Approved by: https://github.com/angelayi	2025-07-18 04:05:17 +00:00
Sherlock Huang	86dbc0ef67	[NativeRT] Remove makeProxyExecutor from ModelRunner interface (#158587 ) Summary: makeProxyExecutor shouldn't be exposed to ModelRunner Interface. Test Plan: CI Rollback Plan: Differential Revision: D78501011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158587 Approved by: https://github.com/yiming0416, https://github.com/henryoier	2025-07-18 03:20:40 +00:00
Will Constable	89d842fec5	Make torch.distributed.breakpoint() set a long timeout (#158481 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158481 Approved by: https://github.com/d4l3k ghstack dependencies: #158469	2025-07-18 02:18:43 +00:00
Will Constable	ce4554352b	Shunt fx_interpreter graphmodule print on error into tlparse (#158469 ) Include both the error stacktrace and the graphmodule in a new structured trace artifact. Log the shortened version to the console, and also log a hint to look at the tlparse for more. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158469 Approved by: https://github.com/ezyang	2025-07-18 02:18:43 +00:00
Lucas Kabela	583138d170	[Dynamo][Better Engineering] Add typing for comptime, cache, and convert_frame (#158379 ) As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to a critical tracing point for dynamo, primarily for`comptime.py` but also `cache_size.py` and `convert_frame.py`. Running ``` mypy torch/_dynamo/comptime.py torch/_dynamo/cache_size.py torch/_dynamo/convert_frame.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 1837 \| 2215 \| 82.93% \| 45 \| 82 \| 54.88% \| \| This PR \| 2230 \| 2230 \| 100.00% \| 82 \| 82 \| 100.00% \| \| Delta \| +393 \| +15 \| +17.07% \| +37 \| 0 \| +45.12% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/158379 Approved by: https://github.com/mlazos	2025-07-18 02:11:57 +00:00
eqy	6fd6fc418d	[B200] Fix flex-attention heuristic for `test_tma_with_customer_kernel_options_cuda` (#158494 ) Otherwise fails with ``` torch._inductor.exc.InductorError: RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_tem_fused__to_copy_ones_sort_sum_zeros_2 Required: 264224 Hardware limit: 232448 Reducing block sizes or `num_stages` may help. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158494 Approved by: https://github.com/drisspg	2025-07-18 02:03:49 +00:00
Will Constable	ddbecdfb66	[DTensor] Document redistribute_costs (#158495 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158495 Approved by: https://github.com/zpcore, https://github.com/XilunWu	2025-07-18 01:43:38 +00:00
CaoE	ef38edb284	Add stride check for attn_mask on non-cpu device (#158424 ) Fixes #158374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158424 Approved by: https://github.com/Valentine233, https://github.com/drisspg, https://github.com/atalman	2025-07-18 01:10:58 +00:00
CaoE	6673ac746c	Fix test linalg for MKL upgrading (#158312 ) Fixes #158054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158312 Approved by: https://github.com/albanD	2025-07-18 01:08:33 +00:00
Gabriel Ferns	7b72e5b3ad	Fix Pandas version mismatch upon reinstalling numpy (#158584 ) If you reinstall numpy after having installed pandas, it will error out sometimes if the versions are different enough (see below snippet). This change forces pandas to be reinstalled when installing numpy. It doesn't work in a separate pip call, because then pip takes the version of numpy requested by pandas as the one to install, undoing the command in the first place. ``` (numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip list Package Version ------------------ ----------- attrs 25.3.0 build 1.2.2.post1 certifi 2025.7.14 charset-normalizer 3.4.2 cmake 4.0.3 exceptiongroup 1.3.0 expecttest 0.3.0 filelock 3.18.0 fsspec 2025.5.1 hypothesis 6.135.32 idna 3.10 importlib_metadata 8.7.0 Jinja2 3.1.6 lintrunner 0.12.7 MarkupSafe 2.1.5 mpmath 1.3.0 networkx 3.2.1 ninja [1.11.1.4](https://www.internalfb.com/phabricator/paste/view/1.11.1.4) opt-einsum 3.3.0 optree 0.16.0 packaging 25.0 pip 25.1 psutil 7.0.0 pyproject_hooks 1.2.0 python-dateutil 2.9.0.post0 pytz 2025.2 PyYAML 6.0.2 requests 2.32.4 setuptools 78.1.1 six 1.17.0 sortedcontainers 2.4.0 sympy 1.14.0 tomli 2.2.1 typing_extensions 4.14.0 tzdata 2025.2 urllib3 2.5.0 uv 0.7.21 wheel 0.45.1 zipp 3.23.0 (numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install numpy==1.22.4 Collecting numpy==1.22.4 Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB) Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB) Installing collected packages: numpy Successfully installed numpy-1.22.4 (numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install pandas==2.0.3 Collecting pandas==2.0.3 Using cached pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB) Requirement already satisfied: python-dateutil>=2.8.2 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2025.2) Requirement already satisfied: tzdata>=2022.1 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2025.2) Requirement already satisfied: numpy>=1.20.3 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (1.22.4) Requirement already satisfied: six>=1.5 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas==2.0.3) (1.17.0) Using cached pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB) Installing collected packages: pandas Successfully installed pandas-2.0.3 (numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install --pre numpy==2.0.2 Collecting numpy==2.0.2 Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB) Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB) Installing collected packages: numpy Attempting uninstall: numpy Found existing installation: numpy 1.22.4 Uninstalling numpy-1.22.4: Successfully uninstalled numpy-1.22.4 Successfully installed numpy-2.0.2 (numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ python Python 3.9.23 (main, Jun 5 2025, 13:40:20) [GCC 11.2.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/__init__.py", line 22, in <module> from pandas.compat import is_numpy_dev as _is_numpy_dev # pyright: ignore # noqa:F401 File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/compat/__init__.py", line 25, in <module> from pandas.compat.numpy import ( File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/compat/numpy/__init__.py", line 4, in <module> from pandas.util.version import Version File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/util/__init__.py", line 2, in <module> from pandas.util._decorators import ( # noqa:F401 File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/util/_decorators.py", line 14, in <module> from pandas._libs.properties import cache_readonly File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/_libs/__init__.py", line 13, in <module> from pandas._libs.interval import Interval File "pandas/_libs/interval.pyx", line 1, in init pandas._libs.interval ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158584 Approved by: https://github.com/huydhn	2025-07-18 00:14:16 +00:00
Nikita Shulga	33c9b414aa	[CI][MPS] Enable test_indexing on MPS (#158582 ) - Skip `test_index_put_accumulate_large_tensor_mps` as it crashes with ``` /com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:829: failed assertion `[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: NDArray dimension length > INT_MAX' ``` while running `torch.ones([2**31+5], dtype=torch.int8, device='mps')` - Adjust types for `test_index_put_src_datatype` as index_put on MPS is not implemented for complex (yet) - Adjust `test_index` to avoid using DoubleTensors for MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/158582 Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/manuelcandales	2025-07-17 23:33:52 +00:00
Lucas Kabela	b0e325c2c8	[Dynamo][Better Engineering] Add type coverage to decorators (#158509 ) As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to an important file in dynamo, `decorators.py` NOTE: Untyped fns are because there is a conflict with `__init__.py` in compiler so we can't type these at this time Running ``` mypy torch/_dynamo/decorators.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 209 \| 908 \| 23.02% \| 9 \| 39 \| 23.08% \| \| This PR \| 870 \| 943 \| 100.00% \| 36 \| 39 \| 100.00% \| \| Delta \| +661 \| +35 \| +76.98% \| +27 \| 0 \| +76.92% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/158509 Approved by: https://github.com/williamwen42	2025-07-17 23:31:26 +00:00
Yiming Zhou	f63988ae00	[BE]Clean up old APIs in AOTI c shim (#158400 ) Summary: The shims for aten ops are now generated by torchgen. But there are some still old APIs in `aoti_torch/c/shim.h` This diff moves the old to-be-deprecated APIs for aten ops to a separate header file `shim_deprecated.h` The to-be-deprecated APIs are determined by comparing APIs in `shim.h` and ops in `fallback_ops.py` Test Plan: CI Rollback Plan: Differential Revision: D78378373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158400 Approved by: https://github.com/jingsh, https://github.com/desertfire	2025-07-17 23:24:50 +00:00
Jeff Daily	2df2e3bb51	[ROCm][CI] Last known good HIP patch (#158596 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/158596 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-17 22:52:16 +00:00
Mwiza Kunda	0ecfb93a0b	Avoid globally modifying torch.testing._internal.common_methods_invocations.wrapper_set_seed (#158548 ) Test modules that depend on the original definition of `wrapper_set_seed` will inadvertently be affected if they import from test_torchinductor_opinfo.py. Additionally, using pytest `test_torchinductor_opinfo.py test_other_module.py` when run in the same process may affect the test behaviour of `test_other_module.py` if the tests depend on `wrapper_set_seed`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158548 Approved by: https://github.com/janeyx99	2025-07-17 22:31:59 +00:00
Paul Ganssle	74f4cf4bd5	Add missing <vector> in c10/util/WaitCounter.h (#158354 ) It seems that `#include <vector>` is being pulled in indirectly, but it is being used directly, so it is best to explicitly include it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158354 Approved by: https://github.com/janeyx99	2025-07-17 22:23:05 +00:00
cyy	1b91954b9f	Suppress volatile type error (#158435 ) Fixes ``` /var/lib/jenkins/workspace/torch/csrc/dynamo/guards.cpp:5320:10: error: compound assignment to object of volatile-qualified type 'volatile char' is deprecated [-Werror,-Wdeprecated-volatile] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158435 Approved by: https://github.com/janeyx99	2025-07-17 22:21:04 +00:00
Mikayla Gawarecki	41b2c4d119	Reduce random reads for offset metadata when calling torch.load under FakeTensorMode (#157931 ) We already test the `_get_offset` functionality with that TORCH_SERIALIZATION_DEBUG flag that is set in CI, so I didn't add more testing specifically for FakeTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/157931 Approved by: https://github.com/albanD	2025-07-17 22:17:52 +00:00
Animesh Jain	af6624023e	[dynamo] Skip training flag check id already guarding on nn modules (#158492 ) This might help some legacy models that still have inline_inbuilt_nn_modules False for some reason. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158492 Approved by: https://github.com/StrongerXi	2025-07-17 21:42:19 +00:00
Catherine Lee	a00442421a	[CI][TD] Enable TD on all test configs (#158163 ) I think the main one that was missing is dynamo_wrapped There's also slow and inductor, but the filter later for workflows stops TD from running on those anyways dynamo_wrapped is the second longest jobs for pull right now <img width="1265" height="311" alt="image" src="https://github.com/user-attachments/assets/d4ca034c-a8f0-4b31-a80f-0f4f21fce32a" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158163 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2025-07-17 21:05:25 +00:00
PyTorch MergeBot	ced5cf042d	Revert "Cleanup old caffe2 scripts (#158475 )" This reverts commit 94d7f0c1ef9a4cb4db0eb5d6b1ffc55941cbeab1. Reverted https://github.com/pytorch/pytorch/pull/158475 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158475#issuecomment-3085447409))	2025-07-17 20:58:34 +00:00
Kurt Mohler	1b88da1cac	[MPS] Improve performance of max_pool3d (#157875 ) To check how the changes from this PR affect performance, I wrote a script here: `55ef32a127/max_pool_mps/perf.py`. Before this PR, I get this: ``` =================== max_pool3d =================== 0: 0.013105 ms, max_pool3d, (3, 2, 2, 2), {'kernel_size': 2} 1: 0.038003 ms, max_pool3d, (3, 10, 10, 10), {'kernel_size': 5} 2: 0.212963 ms, max_pool3d, (3, 100, 100, 100), {'kernel_size': 5} 3: 1.224645 ms, max_pool3d, (3, 200, 200, 200), {'kernel_size': 5} 4: 7.317867 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 4, 'padding': 1} 5: 34.679233 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20} 6: 34.626383 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1} 7: 44.835892 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1, 'stride': 40} 8: 0.083579 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2} 9: 0.936575 ms, max_pool3d, (10, 10, 30, 30, 30), {'kernel_size': 2} 10: 5.329883 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2} 11: 11.713617 ms, max_pool3d, (10, 10, 70, 70, 70), {'kernel_size': 2} 12: 25.450454 ms, max_pool3d, (10, 10, 90, 90, 90), {'kernel_size': 2} 13: 0.058375 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2, 'dilation': 2} 14: 3.757558 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2, 'dilation': 2} 15: 33.451588 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 2, 'dilation': 2} ``` After this PR, I get this: ``` =================== max_pool3d =================== 0: 0.007202 ms, max_pool3d, (3, 2, 2, 2), {'kernel_size': 2} 1: 0.018596 ms, max_pool3d, (3, 10, 10, 10), {'kernel_size': 5} 2: 0.130717 ms, max_pool3d, (3, 100, 100, 100), {'kernel_size': 5} 3: 0.966795 ms, max_pool3d, (3, 200, 200, 200), {'kernel_size': 5} 4: 4.095804 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 4, 'padding': 1} 5: 12.833446 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20} 6: 12.859346 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1} 7: 14.080529 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1, 'stride': 40} 8: 0.029283 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2} 9: 0.175700 ms, max_pool3d, (10, 10, 30, 30, 30), {'kernel_size': 2} 10: 0.742750 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2} 11: 1.939596 ms, max_pool3d, (10, 10, 70, 70, 70), {'kernel_size': 2} 12: 4.074821 ms, max_pool3d, (10, 10, 90, 90, 90), {'kernel_size': 2} 13: 0.028425 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2, 'dilation': 2} 14: 0.384375 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2, 'dilation': 2} 15: 2.623346 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 2, 'dilation': 2} ``` Every case is improved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157875 Approved by: https://github.com/malfet	2025-07-17 20:34:12 +00:00
angelayi	66c9bc5062	[export] Add runnable code to export docs (#158506 ) Preview: https://docs-preview.pytorch.org/pytorch/pytorch/158506/export.html Yay I can add runnable code to export docs now Also moved export API reference to a different file. With these changes, we can start to consolidate the [export tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_export_tutorial.html) with the docs on pytorch docs. We just need to move the section on DDE and 0/1 specialization, and then I think we can delete the export tutorial. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158506 Approved by: https://github.com/pianpwk, https://github.com/svekars	2025-07-17 20:15:22 +00:00
Simon Fan	80ac73c057	[ca] reset between tests (#158418 ) CA reset is much faster than dynamo reset, so it's probably okay to run it every time. I'm not sure if this will fix the flaky autograd tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158418 Approved by: https://github.com/jansel	2025-07-17 20:14:29 +00:00
IvanKobzarev	eeb0783fe6	[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062 ) Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158062 Approved by: https://github.com/wconstab	2025-07-17 20:04:42 +00:00
Edward Yang	ef256ad17b	Make Inductor imports TYPE_CHECKING only (#158524 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158524 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-07-17 19:55:19 +00:00
Wouter Devriendt	fd51bcdd21	check if USE_ROCM is defined (#158571 ) Summary: check if USE_ROCM is defined D78424375 broke some builds: see T231304402 Test Plan: rerunning failed builds Rollback Plan: Reviewed By: Camyll Differential Revision: D78493019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158571 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-07-17 19:48:26 +00:00
Jack Taylor	7ebbf2cae7	Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563 ) (#158550 ) This reverts commit 8554c8007ddaa8029e7e01bb1af12f358bf597c2 #157563 due to causing a few breakages on ROCm Reverted expected_results.csv to 26807dcf277feb2d99ab88d7b6da526488baea93 > @xuanzhang816 Sorry, but I have to revert this PR yet again because it clearly reintroduced failures on ROCm after the remerge: `f4d8bc46c7/2` and the failures are still showing up on tip-of-tree on HUD Context https://github.com/pytorch/pytorch/pull/157563#issuecomment-3083350857 Needs to be relanded in non bc-breaking way, or sanity checked for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158550 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-07-17 19:47:41 +00:00
Xu Han	8dcebaa7b0	[AOTI] add WIN32 implement for create_temp_dir (#158570 ) add Windows implement for `create_temp_dir`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158570 Approved by: https://github.com/angelayi	2025-07-17 19:22:59 +00:00
Divyansh Khanna	7e34f9c292	Add torch._C._log_api_usage_once to datapipes (mapper) (#155489 ) This is to get a better understanding of how datapipes is used right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155489 Approved by: https://github.com/ramanishsingh	2025-07-17 19:01:49 +00:00
albanD	25f4d7e482	Use new type statement to fix public API of types (#158487 ) Since type statement breaks older python version, trying to find equivalent behavior without the type mechanics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158487 Approved by: https://github.com/andrewor14	2025-07-17 18:46:44 +00:00
Kevin Fu	ad223a6c5f	Add FP8 Types (#158430 ) Summary: Add FP8 Types Test Plan: sandcastle Rollback Plan: Differential Revision: D78395110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158430 Approved by: https://github.com/henryoier	2025-07-17 18:09:56 +00:00
Eli Uriegas	f92a2035e4	ci: Update lint workflow to only run on changed files for PRs (#158518 ) This modifies the lint workflow to use the new get-changed-files workflow to optimize lint execution by only running on files that have actually changed in pull requests. This more closely mirrors the type of behavior that users expect when running lint locally on their PRs. This also leaves the default behavior as a fallback for when you're not running on a pull request. Since lint runs on the pull_request event I'm not really worried about any type of ciflow shenanigans in this. This also splits mypy into its own job since mypy needs to run on all-files all the time. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158518 Approved by: https://github.com/huydhn ghstack dependencies: #158517	2025-07-17 18:00:44 +00:00
Sam Larsen	bff69f25c2	[BE][testing] fix test/dynamo/test_repros:test_longtensor_list (#158458 ) Summary: This test is failing internally because the number of underlying calls to the rng differ by virtue of various library initializations that get sucked in with an internal build. Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_repros.py::ReproTests::test_longtensor_list' --run-disabled` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158458 Approved by: https://github.com/jansel	2025-07-17 17:27:00 +00:00
Songhao Jia	6d31d38965	recovering node source from dict (#158373 ) (#158473 ) Summary: this diff recovers NodeSource object from its dict representation, which is crucial for NodeSource serde. Test Plan: ci Rollback Plan: Differential Revision: D78434648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158473 Approved by: https://github.com/angelayi	2025-07-17 17:00:19 +00:00
PyTorch MergeBot	bfe5674e22	Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 )" This reverts commit 0797b2b6a80cf70a7accc3d5413186e7693d4451. Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/wdvr due to reverting as discussed with @drisspg - @eqy please reach out to @drisspg for more info ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-3084759671))	2025-07-17 16:55:55 +00:00
albanD	94d7f0c1ef	Cleanup old caffe2 scripts (#158475 ) Testing on this one is grep based: if there were no reference to that script I can find, I deleted. We can easily add any of these back if needed! Pull Request resolved: https://github.com/pytorch/pytorch/pull/158475 Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/cyyever	2025-07-17 16:50:06 +00:00
PyTorch MergeBot	23550ab735	Revert "DDE-Free select with unbacked index. (#157605 )" This reverts commit 79d7c754ab8ae0e5c3a614521632d2cfbfa0fdba. Reverted https://github.com/pytorch/pytorch/pull/157605 on behalf of https://github.com/laithsakka due to fail pr time benchmarks ([comment](https://github.com/pytorch/pytorch/pull/157605#issuecomment-3084663020))	2025-07-17 16:20:02 +00:00
Xu Han	16b21fa8b2	[AOTI] skip ld and objcopy on Windows. (#158545 ) Skip `ld` and `objcopy` on Windows. They are not support on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158545 Approved by: https://github.com/desertfire	2025-07-17 15:43:24 +00:00
Oguz Ulgen	2ecf083b72	Add torch compile force disable caches alias (#158072 ) Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072 Approved by: https://github.com/ezyang	2025-07-17 15:40:36 +00:00
PyTorch MergeBot	813c76b98d	Revert "Unify torch.tensor and torch.ops.aten.scalar_tensor behavior (#158537 )" This reverts commit 58c7cf9ede6311da5533dbcaf238a912176a6a85. Reverted https://github.com/pytorch/pytorch/pull/158537 on behalf of https://github.com/albanD due to This broke C++ tests ([comment](https://github.com/pytorch/pytorch/pull/158537#issuecomment-3084425920))	2025-07-17 15:06:43 +00:00
PyTorch MergeBot	288bf54a23	Revert "Move off of deprecated API in 2.9 (#158527 )" This reverts commit 9636e2cfd3e995ef977f670ad47e8e895296d992. Reverted https://github.com/pytorch/pytorch/pull/158527 on behalf of https://github.com/albanD due to breaks trunk ([comment](https://github.com/pytorch/pytorch/pull/158527#issuecomment-3084385585))	2025-07-17 14:55:28 +00:00
Xu Han	da4c7b4ced	[AOTI] align signature to model_base.h (#158554 ) Remove `const` keyword, align its signature to `model_base.h` `eeda1a75ac/torch/csrc/inductor/aoti_runtime/model_base.h (L51-L53)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158554 Approved by: https://github.com/desertfire	2025-07-17 14:44:32 +00:00
Xu Han	a04bd11895	[AOTI] Use format_consts_to_cpp on Windows. (#158543 ) `format_consts_to_asm` is not supported on Windows, force use `format_consts_to_cpp` on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158543 Approved by: https://github.com/desertfire	2025-07-17 14:40:34 +00:00
bobrenjc93	58c7cf9ede	Unify torch.tensor and torch.ops.aten.scalar_tensor behavior (#158537 ) Fixes #158376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158537 Approved by: https://github.com/atalman	2025-07-17 13:39:25 +00:00
Ankita George	38c04415a9	[oss][hf][bug fix] Remove buggy consolidation logic (#158380 ) Summary: I tried to add some logic that could optimize for the non-row wise sharded case and do it more efficiently, but this has some bugs, so removing it for now and will find a better algorithm for the non-row wise sharded case to find the maximum number of bytes that we can write at a time. Test Plan: ensure tests pass Rollback Plan: Differential Revision: D78366701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158380 Approved by: https://github.com/Saiteja64	2025-07-17 13:05:06 +00:00
David Berard	7892f5a007	[inductor][triton] Update HAS_WARP_SPEC to check triton.Config params. Update Triton Hash to top of release/3.4.x stack (#158459 ) Update triton commit hash to `11ec6354315768a85da41032535e3b7b99c5f706`, which is the new release/3.4.x branch in triton-lang/triton. Also, update HAS_WARP_SPEC handling: In triton 3.4, warp spec will have a different interface: num_consumer_groups will be determined automatically by the compiler. This breaks the current Inductor integration, so for now, update HAS_WARP_SPEC to check whether triton.Config takes num_consumer_groups and num_buffers_warp_spec as parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158459 Approved by: https://github.com/atalman	2025-07-17 12:50:46 +00:00
Xuehai Pan	d5af0eca8d	[BE][3/5] fix typos in aten/ (aten/src/ATen/native/) (#157552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157552 Approved by: https://github.com/albanD ghstack dependencies: #156605, #157637, #157550, #157551	2025-07-17 12:08:34 +00:00
Xuehai Pan	f57ef62ebc	[BE][2/5] fix typos in aten/ (aten/src/ATen/native/) (#157551 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157551 Approved by: https://github.com/albanD ghstack dependencies: #156605, #157637, #157550	2025-07-17 12:08:33 +00:00
Xuehai Pan	4c8b408d16	[BE][1/5] fix typos in aten/ (#157550 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157550 Approved by: https://github.com/albanD ghstack dependencies: #156605, #157637	2025-07-17 12:08:33 +00:00
Xuehai Pan	c8d43cbc6e	[BE][3/6] fix typos in test/ (#157637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157637 Approved by: https://github.com/yewentao256, https://github.com/albanD ghstack dependencies: #156605	2025-07-17 12:08:33 +00:00
Xuehai Pan	3f8e2e91ad	[BE][15/16] fix typos in torch/ (torch/distributed/tensor/) (#156605 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156605 Approved by: https://github.com/wanchaol, https://github.com/albanD	2025-07-17 12:08:33 +00:00
Luca Wehrstedt	eeda1a75ac	Forward-fix unused variables warning/error (#158549 ) Introduced in https://github.com/pytorch/pytorch/pull/158037, didn't seem to trigger on PR, but trunk CI is failing in some `linux-jammy-cpu-py3.12-gcc11-inductor-*` jobs where this warning is turned into an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158549 Approved by: https://github.com/danthe3rd	2025-07-17 09:44:19 +00:00
Jiang, Yanbing	f4d8bc46c7	Enable TF32 as fp32 internal precision for matmul/linear/conv (#157520 ) ### Description This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in `mkldnn backend`. Since we have refined fp32 precision API in https://github.com/pytorch/pytorch/pull/125888, we can easily extend the API to support TF32 for `mkldnn backend`. ``` torch.backends.mkldnn.matmul.fp32_precision = 'tf32' torch.backends.mkldnn.conv.fp32_precision = "tf32" ``` Related kernel update and UTs update are done. And the wrapper `bf32_on_and _off` is updated to `reduced_f32_on_and_off`, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including `bf32 ON` and `tf32 ON`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/157520 Approved by: https://github.com/mingfeima, https://github.com/jansel	2025-07-17 08:57:34 +00:00
Luca Wehrstedt	39ac189808	Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037 ) cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack. The scaling format is still detected from the sizes of the scale tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037 Approved by: https://github.com/eqy, https://github.com/drisspg	2025-07-17 08:26:27 +00:00
Sherlock Huang	d76323d417	[NativeRT] Remove normalizeDevice (#158489 ) Summary: In pytorch, tensor.to("cuda") behaves differently from tensor.to("cuda:0). tensor.to("cuda") will read from thread local DeviceGuard, aka cuda::current_device(), to infer the device index. TBEPermute is relying on this behavior to route output tensor to a device specified by current thread. For this reason, we remove the normalizeDevice(), and disallow index-less cuda device in Placement. Device-to-device mapping must be done between concrete device! Test Plan: CI Rollback Plan: Differential Revision: D78443109 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158489 Approved by: https://github.com/henryoier	2025-07-17 06:48:25 +00:00
Kevin Fu	04349f9ee5	[PT2]: Skip AOTI Weight Loading during Init (#158416 ) Summary: AOTI already has weights embedded in .so file. So for the initial load, no need to load the weights again. This allows lowered modules can have different set of weights on different hardwares. Test Plan: ``` MODEL_TYPE=ads_mtml_offsite_cvr_oba_optout_dedicated_model MODEL_ENTITY_ID=895279202 SNAPSHOT_ID=0 MODULE=merge buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true fbcode//caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE} --moduleName ${MODULE} --predictor-hardware-type 1 --submodToDevice "" --benchmarkDontRebatchSamples=true --benchmarkNumIterations 1000 ``` Rollback Plan: Differential Revision: D78383881 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158416 Approved by: https://github.com/henryoier, https://github.com/SherlockNoMad	2025-07-17 06:47:47 +00:00
Jane Xu	09db3a22e8	[BE] Get rid of final mentions of BUILD_SPLIT_CUDA (#158453 ) BUILD_SPLIT_CUDA logic has been removed for a while Differential Revision: [D78418191](https://our.internmc.facebook.com/intern/diff/D78418191/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158453 Approved by: https://github.com/albanD ghstack dependencies: #158358, #158365	2025-07-17 06:47:10 +00:00
Andrey Talman	a38f433be2	[Docker builds] Move from Miniconda to Miniforge (#158370 ) This is related to: https://www.anaconda.com/legal/terms/terms-of-service Trying to fix outage with docker builds. https://github.com/pytorch/pytorch/actions/runs/16298993712/job/46033590799 Rocm and XPU builds since they use Miniforge are not affected ``` #22 ERROR: process "/bin/sh -c bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt" did not complete successfully: exit code: 1 ------ > [base 14/42] RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt: 11.93 CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding: 11.93 • https://repo.anaconda.com/pkgs/main 11.93 • https://repo.anaconda.com/pkgs/r 11.93 11.93 To accept a channel's Terms of Service, run the following and replace `CHANNEL` with the channel name/URL: 11.93 ‣ conda tos accept --override-channels --channel CHANNEL ``` Hence solution is: 1. using `` conda tos accept --override-channels --channel defaults`` 2. use Miniforge instead of Miniconda. Using solution 2. Solution Tried that don't work: 1. Using ``CONDA_ALWAYS_YES = true `` 4. Using older version of miniconda ``` [Miniconda3-py310_25.5.1-0-Linux-x86_64.sh](https://repo.anaconda.com/miniconda/Miniconda3-py310_25.5.1-0-Linux-x86_64.sh) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158370 Approved by: https://github.com/seemethere Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2025-07-17 06:33:08 +00:00
PyTorch MergeBot	9f37cce693	Revert "[Docker builds] Move from Miniconda to Miniforge (#158370 )" This reverts commit 0a99b026d6bd0f67dc2c0a20fe3228ddc4144854. Reverted https://github.com/pytorch/pytorch/pull/158370 on behalf of https://github.com/laithsakka due to this fail pr time benchmarks ([comment](https://github.com/pytorch/pytorch/pull/158370#issuecomment-3082744071))	2025-07-17 06:28:49 +00:00
drisspg	9636e2cfd3	Move off of deprecated API in 2.9 (#158527 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158527 Approved by: https://github.com/danielvegamyhre	2025-07-17 06:18:13 +00:00
PaliC	d9426a81d2	[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407 ) This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407 Approved by: https://github.com/albanD ghstack dependencies: #158288, #158290, #158291	2025-07-17 05:56:26 +00:00
PaliC	0b9fb91f17	[BE] Remove __reduce_deploy__ (#158291 ) This PR removes the integration point torch.fx had with torch::deploy (and another minor change). Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291 Approved by: https://github.com/albanD ghstack dependencies: #158288, #158290	2025-07-17 05:56:26 +00:00
PaliC	a6de309ca1	[BE] Remove torch deploy \| remove torch deploy specific files (#158290 ) This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290 Approved by: https://github.com/albanD ghstack dependencies: #158288	2025-07-17 05:56:18 +00:00
PaliC	1a4268b811	[BE] remove torch deploy - conditionals (#158288 ) This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started. 1. Remove test_deploy_interaction as we no longer need to worry about this 2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1) 3. Remove `USE_DEPLOY` and switch to the default path always Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288 Approved by: https://github.com/albanD	2025-07-17 05:56:07 +00:00
Laith Sakka	79d7c754ab	DDE-Free select with unbacked index. (#157605 ) When select has data dependent input, we cant tell if the actual index shall be index+size or index. to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the output view and we compute its value dynamically at runtime when inductor is lowered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605 Approved by: https://github.com/ColinPeppler	2025-07-17 05:08:11 +00:00
FFFrog	415dfabe9b	[Easy] Fix the format (#158450 ) When I modify the code located in test/cpp_extensions/open_registration_extension/torch_openreg/torch_openreg, some unrelated format error occurred. ```Python Lint for torch/_inductor/fx_passes/fuse_attention.py: Error (CODESPELL) spelling error Failed due to ValueError: /pytorch/pytorch/torch/_inductor/fx_passes/fuse_attention.py:587: differnt ==> different Please either fix the error or add the word(s) to the dictionary file. HINT: all-lowercase words in the dictionary can cover all case variations. Lint for torch/fx/traceback.py: Error (MYPY) [assignment] Incompatible types in assignment (expression has type "str", variable has type "None") 101 \| 102 \| def _get_action_string(self): 103 \| if self._action_string is None: 104 \| self._action_string = "+".join([a.name.lower() for a in self.action]) 105 \| return self._action_string 106 \| 107 \| def print_readable(self, indent=0): Error (MYPY) [assignment] Incompatible types in assignment (expression has type "dict[str, Any]", variable has type "None") 121 \| if self._dict is None: 122 \| # Convert the object to a dictionary 123 \| action_string = self._get_action_string() 124 \| self._dict = { 125 \| "name": self.name, 126 \| "target": self.target, 127 \| "graph_id": self.graph_id, Error (MYPY) [return-value] Incompatible return value type (got "None", expected "dict[Any, Any]") 130 \| "from_node": [node.to_dict() for node in self.from_node], 131 \| } 132 \| 133 \| return self._dict 134 \| 135 \| def __eq__(self, other: object): 136 \| if not isinstance(other, NodeSource): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158450 Approved by: https://github.com/Skylion007	2025-07-17 04:56:10 +00:00
Natalia Gimelshein	8eaa9f2701	Fix mask construction when dispatching index_put to masked_fill (#158472 ) Fixes #158413 Previously trailing Nones in the index were incorrectly handled as implicit broadcasting dims in the mask, whereas they should just be ignored. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158472 Approved by: https://github.com/ezyang	2025-07-17 04:21:43 +00:00
PyTorch UpdateBot	ebf83b8b77	[audio hash update] update the pinned audio hash (#158402 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158402 Approved by: https://github.com/pytorchbot	2025-07-17 04:19:06 +00:00
Raymond Li	24b49b9881	[Fix] Rework CUDA error explanation framework to be less destructive … (#158484 ) …in fbsource Fix-forward for #158395 Added `std::string c10::cuda::get_cuda_error_help(const char* error_string)` to provide a framework for appending clarifying messages to CUDA errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158484 Approved by: https://github.com/aorenste	2025-07-17 03:36:47 +00:00
Will Constable	1839e8d04b	[DTensor] Assert DTensorSpec has valid placements (#158133 ) This helped identify buggy sharding rules during debugging, why not check it in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158133 Approved by: https://github.com/XilunWu, https://github.com/zpcore ghstack dependencies: #158132	2025-07-17 02:32:26 +00:00
Yu, Guangye	2ad5c25cfc	Add unified memory APIs for torch.accelerator (#152932 ) # Motivation The following API will be put under torch.accelerator - empty_cache - max_memory_allocated - max_memory_reserved - memory_allocated - memory_reserved - memory_stats - reset_accumulated_memory_stats - reset_peak_memory_stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932 Approved by: https://github.com/albanD ghstack dependencies: #138222	2025-07-17 01:56:01 +00:00
Yu, Guangye	1179e33323	Add DeviceAllocator as the base device allocator (#138222 ) # Motivation In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases. <div align="center"> <table> <tr> <td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td> </tr> <tr> <td> ```python torch.xxx.empty_cache ``` </td> <td> ```python torch.accelerator.empty_cache ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_peak_memory_stats ``` </td> <td> ```python torch.accelerator.reset_peak_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_accumulated_memory_stats ``` </td> <td> ```python torch.accelerator.reset_accumulated_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_stats ``` </td> <td> ```python torch.accelerator.memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_allocated ``` </td> <td> ```python torch.accelerator.memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_allocated ``` </td> <td> ```python torch.accelerator.max_memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_reserved ``` </td> <td> ```python torch.accelerator.memory_reserved ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_reserved ``` </td> <td> ```python torch.accelerator.max_memory_reserved ``` </td> </tr> </table> </div> # Solution This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222 Approved by: https://github.com/albanD, https://github.com/Camyll	2025-07-17 01:56:01 +00:00
Arsh Zahed	f6d138807f	Always disable ShardingPropagation cache if compiling (#156868 ) Fixes #151106 Addresses issue (2) in #152963 for the DTensor sharding propagation cache being brittle under compile. The existing `_are_we_tracing` from `distributed._functional_collectives`, which mostly determines if currently tracing based on Fake Tensor dispatch mode, is reused here. Test Plan: There are already tests for DTensor + Compile with dynamic shape ([test_dtensor_dynamic](https://github.com/pytorch/pytorch/blob/main/test/distributed/tensor/test_dtensor_compile.py#L260), [test_dynamo_dtensor_from_local_dynamic_shapes](https://github.com/pytorch/pytorch/blob/main/test/distributed/tensor/test_dtensor_compile.py#L402)) that cover the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156868 Approved by: https://github.com/xmfan	2025-07-17 01:33:53 +00:00
Jiapeng Li	c09eba877f	[Device] Add support for PrivateUse1 device type in parse_type function (#157609 ) This pull request refactors the `parse_type` function in `c10/core/Device.cpp` to improve the handling of the `PrivateUse1` device type. The main change involves reordering the logic to check for the `PrivateUse1` device type earlier in the function for better clarity and efficiency. This help to migrate existed backend to PrivateUse1 smoothly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157609 Approved by: https://github.com/jgong5, https://github.com/albanD	2025-07-17 01:27:44 +00:00
Animesh Jain	2179afd714	[easy][guards] Add developer comment for posterity (#158471 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158471 Approved by: https://github.com/StrongerXi	2025-07-17 01:17:04 +00:00
Animesh Jain	d7e1b8b11d	[dynamo] Constant fold torch.autograd._profiler_enabled (#158482 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158482 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi	2025-07-17 01:07:42 +00:00
Xu Han	b6454a9058	[AOT_inductor] model_base.h add Windows include files. (#158477 ) model_base.h add Windows include files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158477 Approved by: https://github.com/desertfire, https://github.com/jansel	2025-07-17 00:57:48 +00:00
Eli Uriegas	e9367a7a42	ci: Add reusable workflow to get changed files in PRs (#158517 ) Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158517 Approved by: https://github.com/huydhn	2025-07-17 00:57:43 +00:00
clr	e78f2ac92b	inductor: Fix crash in split_cat when tensors is a Node (#157155 ) If there is only one node passed to aten::cat, the argument is a single node, rather than a list of nodes with a valid length. Example stack ``` File "/dev/shm/uid-99/be3468a8-seed-nspid4026546656_cgpid14993614-ns-4026546628/torch/_inductor/pattern_matcher.py", line 1115, in apply self.handler(match, match.args, *match.kwargs) File "/dev/shm/uid-99/be3468a8-seed-nspid4026546656_cgpid14993614-ns-4026546628/torch/_inductor/fx_passes/split_cat.py", line 1786, in merge_split_cat_aten if len(cat_inputs) < threshold_to_cat: torch._inductor.exc.InductorError: TypeError: object of type 'Node' has no len() ``` This has failed about 7 internal jobs in the last week, running pytorch trunk code from 06/15 I've attached a test which reproduces this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157155 Approved by: https://github.com/jansel	2025-07-17 00:57:38 +00:00
Shangdi Yu	82a1ee1135	Refactor Provenance Tracking (#158399 ) Summary: As inductor provenance tracking is getting more use cases, we want to separate the inductor provenance tracking guarding flag from the general `trace.enabled`, so we can enable provenance tracking without all the overhead of `trace.enabled` - change the guard flag from `trace.enabled` to `trace.provenance_tracking`. It is turned on by either `TORCH_COMPILE_DEBUG=1` or `INDUCTOR_PROVENANCE=1`. - Move the provenance tracking logic and variables out of DebugContext, because DebugContext is only enabled with `trace.enabled`. Since the variables are now global variables, added `reset_provenance_globals()` context manager to reset them for each `compile_fx()` call. - Move `set_kernel_post_grad_provenance_tracing` from `util.py` to `debug.py` so now all provenance related logic is in `debug.py`. In the future, if we want to enable it further, we can change the provenance tracking flag to be enabled when `TORCH_TRACE` is set. I think we should do that in a separate PR, so it's easier to revert if this flag change creates any problem. See more motivation in internal Diff Test Plan: ``` buck2 run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r graph_provenance buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing ``` Differential Revision: D78287976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158399 Approved by: https://github.com/angelayi	2025-07-17 00:23:00 +00:00
Laith Sakka	306dd19216	update expeced results (#158497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158497 Approved by: https://github.com/xmfan	2025-07-17 00:02:52 +00:00
Howard Huang	1d58476162	[PP] Add eval() API to schedule (#157795 ) These change add an `eval()` API to PP schedules ## Context Currently, you can run "Forward only" for a schedule in two ways: 1. Use a custom schedule `_ScheduleForwardOnly` 2. Do not pass in `loss_fn` in schedule constructor, and no backward computations will be executed. However, this is still limiting because we may want to run forward through the pipeline / calculate the loss, but without backward, e.g. during validation. These changes allow for this. ```python if self.rank == 0: schedule.eval(x) elif self.rank == self.world_size - 1: losses = [] schedule.eval(target=target, losses=losses) else: schedule.eval() ``` TODO: - in later PRs, we will deprecate the `_ScheduleForwardOnly` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157795 Approved by: https://github.com/wconstab	2025-07-16 23:48:45 +00:00
Lucas Kabela	a4d753295e	[Dynamo][Better Engineering] Add enhanced typing support to `_dynamo/eval_frame.py` (#158276 ) As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to the main entrypoint for dynamo, `eval_frame.py` Running ``` mypy torch/_dynamo/eval_frame.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 623 \| 2232 \| 27.91% \| 19 \| 68 \| 27.94% \| \| This PR \| 2285 \| 2285 \| 100.00% \| 68 \| 68 \| 100.00% \| \| Delta \| +1662 \| +63 \| +72.09% \| +49 \| 0 \| +72.06% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/158276 Approved by: https://github.com/williamwen42 Co-authored-by: William Wen <williamwen@meta.com>	2025-07-16 23:31:10 +00:00
Frank Lin	a9f902add0	[CUDA] Use runtime driver API for cuStreamWriteValue32 (#158295 ) Reopen https://github.com/pytorch/pytorch/pull/156097 Fixes https://github.com/pytorch/pytorch/issues/154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR https://github.com/pytorch/pytorch/pull/156097 and https://github.com/pytorch/pytorch/pull/154097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158295 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-07-16 23:14:36 +00:00
Mikayla Gawarecki	e311886e3d	Add transpose to torch/csrc/stable (#158160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158160 Approved by: https://github.com/janeyx99	2025-07-16 22:50:57 +00:00
angelayi	3cb11877aa	[aoti][mps] Enable test_aot_inductor.py tests (#155598 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155598 Approved by: https://github.com/yushangdi	2025-07-16 22:26:57 +00:00
Lucas Kabela	5951fcd50a	[Dynamo][Better Engineering] Support typing in codegen.py (#158386 ) As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to a critical tracing point for dynamo, primarily for `codegen.py` but also `config.py` Running ``` mypy torch/_dynamo/codegen.py torch/_dynamo/config.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 347 \| 1330 \| 26.09% \| 24 \| 50 \| 48.00% \| \| This PR \| 1334 \| 1334 \| 100.00% \| 50 \| 50 \| 100.00% \| \| Delta \| +987 \| +4 \| +73.91.% \| +26 \| 0 \| +52.00% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/158386 Approved by: https://github.com/StrongerXi	2025-07-16 22:09:01 +00:00
Lucas Kabela	ada44e5ba7	[Dynamo][Better Engineering] Add typing to bytecode analysis and transform (#158293 ) As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to a critical tracing point for dynamo, `bytecode_transformation.py` and by extension, `bytecode_analysis.py` Running ``` mypy torch/_dynamo/bytecode_transformation.py torch/_dynamo/bytecode_analysis.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 1422 \| 1920 \| 74.06% \| 73 \| 93 \| 78.49% \| \| This PR \| 1968 \| 1968 \| 100.00% \| 93 \| 93 \| 100.00% \| \| Delta \| +546 \| +48 \| +25.94% \| 20 \| 0 \| +21.51% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/158293 Approved by: https://github.com/StrongerXi, https://github.com/Skylion007	2025-07-16 21:50:55 +00:00
Sam Larsen	9df0176408	[BE][testing] Disable test_static_cuda_launcher:test_floats internally (#158296 ) Summary: it seems the check for 'Offd' vs. 'Offf' doesn't work Pull Request resolved: https://github.com/pytorch/pytorch/pull/158296 Approved by: https://github.com/davidberard98	2025-07-16 21:27:40 +00:00
Xilun Wu	94c746bb43	[DTensor][BE] add document to ShardingPropagator.register_op_strategy (#158362 ) Summary Add document to `ShardingPropagator.register_op_strategy` on how to draft `strategy_func` and when to use `schema_info`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158362 Approved by: https://github.com/zpcore	2025-07-16 21:08:59 +00:00
Catherine Lee	473208cb18	[ez][lint] Add pr_time_benchmarks to merge conflictless csv linter (#158353 ) Discovered this when looking at a PR I was trying to revert and was surprised that the PR got rid of the spaces but didn't trigger the linter. Turns out the file was following the rule but wasn't actually being checked Pull Request resolved: https://github.com/pytorch/pytorch/pull/158353 Approved by: https://github.com/seemethere, https://github.com/Camyll	2025-07-16 20:31:07 +00:00
atalman	fb731fe371	Add warning about removed sm50 and sm60 arches (#158301 ) Related to https://github.com/pytorch/pytorch/issues/157517 Detect when users are executing torch build with cuda 12.8/12.9 and running on Maxwell or Pascal architectures. We would like to include reference to the issue: https://github.com/pytorch/pytorch/issues/157517 as well as ask people to install CUDA 12.6 builds if they are running on sm50 or sm60 architectures. Test: ``` >>> torch.cuda.get_arch_list() ['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120'] >>> torch.cuda.init() /home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:263: UserWarning: Found <GPU Name> which is of cuda capability 5.0. PyTorch no longer supports this GPU because it is too old. The minimum cuda capability supported by this library is 7.0. warnings.warn( /home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:268: UserWarning: Support for Maxwell and Pascal architectures is removed for CUDA 12.8+ builds. Please see https://github.com/pytorch/pytorch/issues/157517 Please install CUDA 12.6 builds if you require Maxwell or Pascal support. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158301 Approved by: https://github.com/nWEIdia, https://github.com/albanD	2025-07-16 20:11:18 +00:00
Yiming Zhou	a9ee4250d5	[4/n] Remove references to TorchScript in PyTorch docs (#158317 ) Summary: jit.rst Test Plan: CI Rollback Plan: Differential Revision: D78309840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158317 Approved by: https://github.com/svekars, https://github.com/zhxchen17	2025-07-16 20:01:34 +00:00
PyTorch MergeBot	14ecc03361	Revert "recovering node source from dict (#158373 )" This reverts commit 4d055982e38f59fdb2a4c9d8855e58548bc42c12. Reverted https://github.com/pytorch/pytorch/pull/158373 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158373#issuecomment-3080093479))	2025-07-16 19:55:21 +00:00
angelayi	1cc62c2cb9	[export] Update docs (#157750 ) Preview: https://docs-preview.pytorch.org/pytorch/pytorch/157750/export.html Changes: * Rename draft_export.md -> export.draft_export.md for consistency. * Removed non-strict section in export, instead pointed to programming model doc. * Extended "Expressing Dynamism" section to include Dim hints, ShapeCollection, and AdditionalInputs. * Removed Specialization section in favor of programming model doc * Added pt2 archive doc * Cleaned up sidebar Pull Request resolved: https://github.com/pytorch/pytorch/pull/157750 Approved by: https://github.com/pianpwk	2025-07-16 19:53:12 +00:00
fduwjj	f58a680d09	[c10d]Prototype of remote_group_merge (#158287 ) Tentative implementation of merge_remote_group per the proposal here: [docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89](https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158287 Approved by: https://github.com/d4l3k ghstack dependencies: #157716	2025-07-16 19:33:57 +00:00
PyTorch MergeBot	944a140e90	Revert "[cuda][cupy] Improve cupy device placement when device is provided (#158320 )" This reverts commit 59f9b25f3cfc635053843372ea29ff4bf754da3f. Reverted https://github.com/pytorch/pytorch/pull/158320 on behalf of https://github.com/wdvr due to reverting because most likely causing test/test_numba_integration.py::TestNumbaIntegration::test_from_cuda_array_interface_inferred_strides to fail ([comment](https://github.com/pytorch/pytorch/pull/158320#issuecomment-3079960616))	2025-07-16 19:15:33 +00:00
cyy	79ab84e9b8	Fix invalid formatting (#158436 ) It causes errors under C++20 ``` /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:330:40: error: call to consteval function 'fmt::fstring<>::fstring<std::string, 0>' is not a constant expression ``` Indeed the printed value is treated as format string and it may contain special chars in some cases. While this is not true in our case, it can't be determined in compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158436 Approved by: https://github.com/Skylion007	2025-07-16 18:47:09 +00:00
Jane Xu	2b0f9b1f61	Move c10/macros/Macros.h to headeronly (#158365 ) ^ Differential Revision: [D78361893](https://our.internmc.facebook.com/intern/diff/D78361893/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158365 Approved by: https://github.com/swolchok ghstack dependencies: #158358	2025-07-16 18:46:52 +00:00
Jane Xu	b40f48d191	Move the rest of c10/macros/Export.h (#158358 ) Differential Revision: [D78356975](https://our.internmc.facebook.com/intern/diff/D78356975/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158358 Approved by: https://github.com/swolchok	2025-07-16 18:46:52 +00:00
Songhao Jia	4d055982e3	recovering node source from dict (#158373 ) Summary: this diff recovers NodeSource object from its dict representation, which is crucial for NodeSource serde. Test Plan: ci Rollback Plan: Differential Revision: D78363882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158373 Approved by: https://github.com/yushangdi	2025-07-16 18:46:09 +00:00
Manuel Candales	bc9091a524	Fix indexing with multi-dimensional boolean mask (#158369 ) Fixes #71673 This fixes a bug in PyTorch indexing, that shows up when mixing multi-dimensional boolean masks with other forms of indexing. Examples: ```python >>> import torch >>> x = torch.ones([2, 2, 3]) >>> m = torch.tensor(((True, False), (False, False))) # (2x2 boolean mask) >>> x[m].shape # this works fine (the boolean mask acts on the 2x2 subspace selecting one row) torch.Size([1, 3]) >>> x[m, 0] # this should produce a tensor of shape (1,) Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 3] at index 1 >>> x[m, ::2] # this should produce a tensor of shape (1, 2) Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 1, 3] at index 1 >>> x[m, None] # this should produce a tensor of shape (1, 1, 3) Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 1, 2, 3] at index 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158369 Approved by: https://github.com/ngimel	2025-07-16 18:30:57 +00:00
Denghui Dong	a26bf38927	Don't need to handle PyTrace_EXCEPTION in pyProfileFn (#154392 ) According to the [document](https://python.readthedocs.io/fr/stable/c-api/init.html#c.PyTrace_EXCEPTION) and [comment](https://github.com/python/cpython/blob/3.9/Modules/_lsprof.c#L407), we don't need to handle PyTrace_EXCEPTION in pyProfileFn. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154392 Approved by: https://github.com/sraikund16, https://github.com/cyyever	2025-07-16 18:00:11 +00:00
Yidi Wu	da05b7fb94	[cond] add _FlopCounterMode support for cond (#158067 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158067 Approved by: https://github.com/zou3519 ghstack dependencies: #158077	2025-07-16 17:26:20 +00:00
Yidi Wu	82b1c48292	[hop] add supports_higher_order_operators flag to TorchDispatchMode (#158077 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158077 Approved by: https://github.com/zou3519	2025-07-16 17:26:20 +00:00
yuchengliu1	a369350065	enable compiled autograd on CPU windows (#158432 ) compiled autograd on windows is disabled in PR #144707 because cuda windows cannot compile this code. However these code can be compiled on CPU. This PR enable these code on CPU windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158432 Approved by: https://github.com/jansel, https://github.com/xmfan Co-authored-by: Xu Han <xu.han@outlook.com>	2025-07-16 17:22:37 +00:00
Nichols A. Romero	ff611d971f	[ROCm] check stream graph capture status in memcpy_and_sync inline function (#158165 ) Check for stream graph capture when using hipMemcpyWithStream. Fixes https://github.com/pytorch/pytorch/issues/155684, https://github.com/pytorch/pytorch/issues/155231 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158165 Approved by: https://github.com/jeffdaily	2025-07-16 17:17:34 +00:00
Han, Xu	4805a6ead6	[aot][XPU] switch xpu to use consts cpp build. (#158425 ) Intel compiler is not support `format_consts_to_asm`, let's use `format_consts_to_cpp`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158425 Approved by: https://github.com/jansel	2025-07-16 16:19:33 +00:00
Sam Larsen	a8b9736737	[BE][testing] disable test_custom_op_square internally (#158367 ) Summary: test is failing with `ld.lld: error: unable to find library -laoti_custom_ops` Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:test_aot_inductor_custom_ops -- --exact 'caffe2/test/inductor:test_aot_inductor_custom_ops - test_custom_op_square_cuda (caffe2.test.inductor.test_aot_inductor_custom_ops.AOTInductorTestABICompatibleCuda)' --run-disabled` Differential Revision: [D78364617](https://our.internmc.facebook.com/intern/diff/D78364617) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158367 Approved by: https://github.com/desertfire	2025-07-16 16:16:14 +00:00
Sam Larsen	4b11428cb5	[BE][testing] Skip test_repeated_masked_load internally (#158355 ) Summary: Test is failing internally because of the import from functorch.einops. _Maybe_ there's a way to get this dependence in the TARGETS file, but the obvious things didn't work. I'm wondering if this test is that important to have running in OSS and internally anyway? Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:cuda_repro -- --exact 'caffe2/test/inductor:cuda_repro - test_repeated_masked_load (caffe2.test.inductor.test_cuda_repro.CudaReproTests)' --run-disabled` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158355 Approved by: https://github.com/eellison	2025-07-16 16:15:44 +00:00
Sam Larsen	a04a13c449	[BE][testing] Skip test_triton_interpret internally (#158260 ) Summary: Subprocesses in fbcode are tricky because of .par files. I'm thinking it's not an important enough test to get it running and skipping is fine. Test Plan: `buck test` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158260 Approved by: https://github.com/eellison	2025-07-16 16:14:44 +00:00
tvukovic-amd	a23f4471b9	[ROCm][Windows] Fix finding ROCm/HIP version (#156486 ) This commit fixes Windows build issue related to trying to use rocm-core (rocm-core doesn't exist on HIP SDK) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156486 Approved by: https://github.com/jeffdaily, https://github.com/stellaraccident	2025-07-16 15:31:43 +00:00
Jithun Nair	06a67a8948	Fix sha256 for aotriton ROCm7.0 tarball (#158420 ) Fixes following issue of building PyTorch with ROCm7.0: ``` -- verifying file... file='/var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz' -- SHA256 hash of /var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz does not match expected value expected: '7e29c325d5bd33ba896ddb106f5d4fc7d715274dca7fe937f724fffa82017838' actual: '1e9b3dddf0c7fc07131c6f0f5266129e83ce2331f459fa2be8c63f4ae91b0f5b' -- Hash mismatch, removing... CMake Error at aotriton_external-prefix/src/aotriton_external-stamp/download-aotriton_external.cmake:163 (message): Each download failed! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158420 Approved by: https://github.com/jeffdaily	2025-07-16 15:24:20 +00:00
PyTorch MergeBot	9513b9d03f	Revert "Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037 )" This reverts commit bc65253369933160a2da3fc786d027a572faf6b7. Reverted https://github.com/pytorch/pytorch/pull/158037 on behalf of https://github.com/lw due to OSX failures are real ([comment](https://github.com/pytorch/pytorch/pull/158037#issuecomment-3079042171))	2025-07-16 15:04:10 +00:00
David Berard	0b19d463d9	forward fix lint (#158448 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158448 Approved by: https://github.com/adamomainz	2025-07-16 14:55:33 +00:00
Ke Wen	5763ec5f8d	[BE] Replace lib with TORCH_INSTALL_LIB_DIR (#158235 ) Their values are actually the same. Just staying in line with other `INSTALL` commands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158235 Approved by: https://github.com/Skylion007 ghstack dependencies: #158234	2025-07-16 14:20:19 +00:00
Ke Wen	2043f6911e	[BE] Rename libnvshmem_extension to libtorch_nvshmem (#158234 ) `libnvshmem_extension.so` creates an illusion that it is a shared library from NVSHMEM. But indeed it is built from torch source code, for symmetric tensor infrastructure and operations, though leveraging NVSHMEM APIs. Thus this PR renames `libnvshmem_extension.so` to `libtorch_nvshmem.so`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158234 Approved by: https://github.com/albanD	2025-07-16 14:20:19 +00:00
Luca Wehrstedt	bc65253369	Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037 ) cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack. The scaling format is still detected from the sizes of the scale tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037 Approved by: https://github.com/eqy, https://github.com/drisspg	2025-07-16 13:54:09 +00:00
dolpm	51a708ffc6	[nativert] libtorch kernel registry (#157150 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D77451703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157150 Approved by: https://github.com/georgiaphillips, https://github.com/henryoier	2025-07-16 12:36:55 +00:00
Raymond Li	55d888a616	Add framework for explanations for common CUDA errors (#158395 ) As popularly requested in user groups. Test plan: ``` import torch a = torch.randn(10000) device = torch.device('cuda:1') a = a.to(device) ``` Before: ``` Traceback (most recent call last): File "/data/users/raymo/pytorch/test/cuda.py", line 6, in <module> a = a.to(device) ^^^^^^^^^^^^ torch.AcceleratorError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` After: ``` Traceback (most recent call last): File "/data/users/raymo/pytorch/test/cuda.py", line 6, in <module> a = a.to(device) ^^^^^^^^^^^^ torch.AcceleratorError: CUDA error: invalid device ordinal GPU device may be out of range, do you have enough GPUs? CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158395 Approved by: https://github.com/aorenste Co-authored-by: Aaron Orenstein <aorenste@fb.com>	2025-07-16 12:31:18 +00:00
Andrey Talman	0a99b026d6	[Docker builds] Move from Miniconda to Miniforge (#158370 ) This is related to: https://www.anaconda.com/legal/terms/terms-of-service Trying to fix outage with docker builds. https://github.com/pytorch/pytorch/actions/runs/16298993712/job/46033590799 Rocm and XPU builds since they use Miniforge are not affected ``` #22 ERROR: process "/bin/sh -c bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt" did not complete successfully: exit code: 1 ------ > [base 14/42] RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt: 11.93 CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding: 11.93 • https://repo.anaconda.com/pkgs/main 11.93 • https://repo.anaconda.com/pkgs/r 11.93 11.93 To accept a channel's Terms of Service, run the following and replace `CHANNEL` with the channel name/URL: 11.93 ‣ conda tos accept --override-channels --channel CHANNEL ``` Hence solution is: 1. using `` conda tos accept --override-channels --channel defaults`` 2. use Miniforge instead of Miniconda. Using solution 2. Solution Tried that don't work: 1. Using ``CONDA_ALWAYS_YES = true `` 4. Using older version of miniconda ``` [Miniconda3-py310_25.5.1-0-Linux-x86_64.sh](https://repo.anaconda.com/miniconda/Miniconda3-py310_25.5.1-0-Linux-x86_64.sh) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158370 Approved by: https://github.com/seemethere Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2025-07-16 10:52:47 +00:00
bobrenjc93	ac706bfc7f	disable multi kernel rocm (#158299 ) Fixes https://github.com/pytorch/pytorch/issues/158274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158299 Approved by: https://github.com/huydhn	2025-07-16 10:20:09 +00:00
Hari Krishna Sai Kodali	9d184bda2f	add device generalization support for distributed tests (#156796 ) MOTIVATION To generalize Distributed test cases for non-CUDA devices CHANGES - test/distributed/checkpoint/test_fsspec.py - test/distributed/checkpoint/test_state_dict.py - test/distributed/test_multi_threaded_pg.py Replaced hard coded device names with torch.accelerator.current_accelerator - torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py support for hccl backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/156796 Approved by: https://github.com/guangyey, https://github.com/ezyang	2025-07-16 09:37:03 +00:00
NikhilAPatel	ea74fdd24a	[Inductor][Triton] Update TMA Compatibility Requirements (#157881 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157881 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-07-16 09:31:44 +00:00
Huy Do	e71bb021b9	Add a periodic test for older NVIDIA driver (#158300 ) This is needed because of the botched landing of https://github.com/pytorch/pytorch/pull/156097 which crashed on older NVIDIA drivers `525.*`. I add a periodic job to install the `525.105.17` on CI, then run: 1. A smoke to make sure that CUDA can be initialized 2. And the whole the test suite on the older driver Pull Request resolved: https://github.com/pytorch/pytorch/pull/158300 Approved by: https://github.com/ngimel	2025-07-16 08:18:18 +00:00
Manuel Candales	fb9a5d248f	Fix torch._numpy to match NumPy when empty ellipsis causes advanced indexing separation (#158297 ) Fixes #141563 In NumPy, an ellipsis always acts as a separator between advanced indices, even when the ellipsis doesn't actually match any dimensions. In PyTorch an empty ellipsis doesn't cause a separation. This leads to differing behavior between Numpy and PyTorch in this edge case. This difference in behavior leads to a bug when using torch.compile: ```python >>> import numpy as np >>> f = lambda x: x[:,(0,1),...,(0,1)].shape >>> a = np.ones((3, 4, 5)) >>> f(a) (2, 3) >>> torch.compile(f)(a) (3, 2) ``` Similarly to #157676, this PR doesn't change PyTorch's behavior, but it fixes the translation layer, ensuring torch._numpy compatibility with NumPy. I am marking this PR as fixing #141563, even though PyTorch behavior isn't modified. Notice that there are still some other bugs in PyTorch's advanced indexing, that need to be fixed (mainly regarding proper accounting of dimensions when multidimensional boolean masks are present). But those need to be fixed at the ATen operator level. Examples: - #71673 - #107699 - #158125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158297 Approved by: https://github.com/soumith	2025-07-16 08:11:53 +00:00
Huamin Li	ddf502c988	[AOTI] add -lstdc++ into aoti link cmd for Meta internal (#158325 ) Differential Revision: D78123716 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158325 Approved by: https://github.com/desertfire	2025-07-16 07:55:08 +00:00
FFFrog	555f356254	[Easy] Show some clear error when torch.ops.load_library fails. (#157524 ) Background: ```Shell torch 2.5.1+cpu torchvision 0.20.1 ``` ```Python import torch import torchvision Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module> from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils # usort:skip File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module> def meta_nms(dets, scores, iou_threshold): File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 795, in register use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1) File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 184, in _register_fake handle = entry.fake_impl.register(func_to_register, source) File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"): RuntimeError: operator torchvision::nms does not exist ``` Cause: ``` torchvision's .so file lacks some symbol definitions, because these symbols come from CUDA, but the current environment does not have CUDA and GPU. The above error message is very confusing. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157524 Approved by: https://github.com/ezyang	2025-07-16 07:33:22 +00:00
Kaichao You	59f9b25f3c	[cuda][cupy] Improve cupy device placement when device is provided (#158320 ) This is an improvement over https://github.com/pytorch/pytorch/pull/132595 . That PR improves the case where `device` is not given. This PR tries to improve the case where `device` is given but the first step of auto-infer device from `cudaPointerGetAttributes` can be wrong (undesired). See https://github.com/pytorch/pytorch/issues/158316 for more details on when this can happen. I think this is a reasonable improvement, as people expect `torch.as_tensor` + cupy should be zero-copy as much as possible. However, it does change some behaviors, because previously it might incur a device-to-device copy. I will leave it to pytorch developers to see if the improvement is worthwhile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158320 Approved by: https://github.com/ezyang	2025-07-16 07:12:36 +00:00
saienduri	fedbd1a48e	Enable ROCm 7.0 Alpha docker builds for PyTorch CI (#158390 ) This PR adds ROCm 7.0 alpha docker builds to start testing latest ROCm in PyTorch CI and enable new MI350x hardware. Highlights: * Stop building `pytorch-linux-jammy-rocm-n-1-py3` docker images, as they're not currently used in any CI workflows * Add `pytorch-linux-noble-rocm-alpha-py3` docker images that will use ROCm alpha (newer than latest official release) builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/158390 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-07-16 06:09:37 +00:00
drisspg	5484890539	Add better typing to avaialbe kernel options for flex attention (#158383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158383 Approved by: https://github.com/joydddd, https://github.com/BoyuanFeng	2025-07-16 06:06:29 +00:00
Xuehai Pan	61a7b09ef3	[BE][Easy] split build system `requirements.txt` to a separate file (#158111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158111 Approved by: https://github.com/ezyang	2025-07-16 05:03:30 +00:00
Denghui Dong	e92e3eaf4e	[Profiler] the doc of _ExperimentalConfig is incorrectly truncated by commas (#156586 ) Hi team, Please help review this trivial fix. Without this change: ``` python >>> import torch >>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__) __init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None capture_overload_names (bool) : whether to include ATen overload names in the profile ``` With this change: ```python >>> import torch >>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__) __init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None An experimental config for Kineto features. Please note thatbackward compatibility is not guaranteed. profiler_metrics : a list of CUPTI profiler metrics used to measure GPU performance events. If this list contains values Kineto runs in CUPTI profiler mode profiler_measure_per_kernel (bool) : whether to profile metrics per kernel or for the entire measurement duration. verbose (bool) : whether the trace file has `Call stack` field or not. performance_events : a list of profiler events to be used for measurement. enable_cuda_sync_events : for CUDA profiling mode, enable adding CUDA synchronization events that expose CUDA device, stream and event synchronization activities. This feature is new and currently disabled by default. adjust_profiler_step (bool) : whether to adjust the profiler step to match the parent python event duration. This feature is new and currently disabled by default. disable_external_correlation (bool) : whether to disable external correlation profile_all_threads (bool) : whether to profile all threads capture_overload_names (bool) : whether to include ATen overload names in the profile ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156586 Approved by: https://github.com/sraikund16, https://github.com/cyyever	2025-07-16 04:10:49 +00:00
Will Constable	0a9d450168	[DTensor] implement histc (#158298 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158298 Approved by: https://github.com/zpcore, https://github.com/XilunWu	2025-07-16 04:10:32 +00:00
Edward Z. Yang	e265b719bd	Extract out prepare_aot_module_simplified for use in next PR (#158319 ) Also a small amount of extra code cleanup. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158319 Approved by: https://github.com/jingsh ghstack dependencies: #158149, #158150, #158173, #158176, #158213, #158251	2025-07-16 03:59:41 +00:00
Edward Z. Yang	7637c9718a	Move functions from torch._functorch.aot_autograd that are not frontend functions to frontend_utils (#158251 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158251 Approved by: https://github.com/jamesjwu ghstack dependencies: #158149, #158150, #158173, #158176, #158213	2025-07-16 03:59:41 +00:00
Edward Z. Yang	49d0332cef	Introduce stages to aot_dispatch (#158213 ) The starting point for this refactor is that I need access to the fully general joint graph representation in an export-like interface, but I then subsequently need a way to feed this joint graph into the rest of the compilation pipeline so I can get an actual callable that I can run once I've finished modifying it. Previously, people had added export capabilities to AOTAutograd by having an export flag that toggled what exactly the functions return and triggering aot_dispatch to go to a different "export" implementation, but I've found this difficult to understand and has lead to a bit of duplicate code for the export path. So the idea here is to reorganize the structure of the function calls in AOTAutograd. Here, it is helpful to first describe how things used to work: * Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These call: * create_aot_dispatcher_function. This does a bunch of stuff (forward metadata collection) and adds many context managers. This calls: * One of aot_dispatch_base, aot_dispatch_export or aot_dispatch_autograd, which: * Call aot_dispatch_autograd_graph or aot_dispatch_base_graph to actually do the graph capture * Do some base/export/autograd specific post-processing on the graph Notice the pattern of nested function invocations means that there is no way to easily get the graph capture result from the autograd case; furthermore, the export path is "bolted" on to force the entire chain of functions to have a different return result than normal, and no way to resume the rest of the post-processing to actually get a callable. Here is the new structure: * Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These now orchestrate this top level flow: * Start a context manager (stack); this stateful context block takes care of all of the nested context managers which originally necessitated the nested call structure * Call create_aot_state to do initial setup and setup all the context managers on stack. These context managers do NOT exit upon return of this. * Call aot_stage1_graph_capture to do the graph capture * Call aot_stage2_compile or aot_stage2_export depending on what postprocessing you want With this new structure, it's now possible (although not done in this PR) to return the graph after aot_stage1_graph_capture and do something with it, before running aot_stage2_compile to finish the job. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158213 Approved by: https://github.com/jamesjwu ghstack dependencies: #158149, #158150, #158173, #158176	2025-07-16 03:59:32 +00:00
Edward Z. Yang	84dec060b7	Hoist choose_dispatcher to top level, remove unnecessary returns (#158176 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158176 Approved by: https://github.com/jamesjwu ghstack dependencies: #158149, #158150, #158173	2025-07-16 03:56:25 +00:00
Edward Z. Yang	5b0df2565e	Pipeline _create_aot_dispatcher_function (#158173 ) Two main things of note: - Review this diff without whitespace changes - To ensure that context managers correctly propagate to later pipeline stages, I am using the ExitStack trick: there is an ExitStack which is in scope for the entire pipeline, and inside of the individual pipeline stages we push context managers onto this stack when we want them to survive into the next pipeline stage. This is not obviously what the best final form of the code is, but create_aot_dispatcher_function is called from multiple locations so I can't just inline the context managers into the call site. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158173 Approved by: https://github.com/jamesjwu, https://github.com/wconstab ghstack dependencies: #158149, #158150	2025-07-16 03:56:25 +00:00
Songhao Jia	0cb36e2d62	cache dict and string rep for better perf (#158372 ) Summary: NodeSouce should not be updated after created, so that it would be better if we cache its dict and string representation for better perf. Test Plan: ci Rollback Plan: Reviewed By: yushangdi Differential Revision: D78298501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158372 Approved by: https://github.com/yushangdi	2025-07-16 02:15:32 +00:00
Xu Han	584a0510b3	[inductor] fix windows path for fresh cache. (#158324 ) `normalize_path_separator` for windows path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158324 Approved by: https://github.com/jansel	2025-07-16 01:54:35 +00:00
yuchengliu1	9768d393fa	add sfdp pattern (#155792 ) add sfdp pattern for MBartForCausalLM/PLBartForCausalLM in transformers==4.44.2. Improve the inference performance of these model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155792 Approved by: https://github.com/Valentine233, https://github.com/jansel	2025-07-16 01:52:05 +00:00
Jiang, Yanbing	900fba4c07	Update warning of TF32 (#158209 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/158209 Approved by: https://github.com/jansel	2025-07-16 01:28:50 +00:00
PyTorch MergeBot	03852ddc22	Revert "[ROCm] logsumexp on ROCm needs scaling back to natural base. (#156903 )" This reverts commit 1ea9cde598ead20194dbb6c5cb26e74e36e6ad55. Reverted https://github.com/pytorch/pytorch/pull/156903 on behalf of https://github.com/atalman due to Breaks torchao and torchtitan nightly builds ([comment](https://github.com/pytorch/pytorch/pull/156903#issuecomment-3076423488))	2025-07-16 01:28:46 +00:00
Xuan Zhang	8554c8007d	[PT2][fusion] ban fusions with large accumulated reads (#157563 ) Problem: Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet ``` total = torch.rand(N, N) for _ in range(r): x = torch.rand(N, N) total = total + x ``` The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like: ``` x_1 = torch.rand(N, N) x_2 = torch.rand(N, N) ... x_r = torch.rand(N, N) total = x_1 + x_2 + ... + x_r ``` Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient. [internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details Solution: Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile. * During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated. * During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`. Results: For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match. <img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-07-16 01:05:25 +00:00
Yidi Wu	651b4a68f2	[hop][dynamo] track run-ahead sym variables in side effects (#158273 ) Before the PR, for code like this: ``` class Example2(torch.nn.Module): def forward(self, x, trigger, target): return torch.cond( trigger == 1, lambda: x + target, lambda: x * target, (), ) m = Example2() x = torch.randn(2) trigger = 0 target = 2 args = (x, trigger, target) ep = torch.export.export( m, args, dynamic_shapes=(None, Dim.DYNAMIC, Dim.DYNAMIC) ) ``` dynamo will wrap "target" (i.e. a symInt) twice, once when we speculate the first lambda and find target is a symint and decides to wrap it up, creating a new SymNodeVariable and a placeholder input to the top-level graph. The second time happens when we speculate the second lambda. Tensors are de-duplicated by checking tracked side effects to make sure object with the same id (though different sources) is mapped to the same TensorVaraible. For symints, two things are missing: 1. it's not in the _can_lift_attrs_to_input list (the change in builder.py) 2. it's not in the tracked by runahead_side_effects, so when speculate_subgraph finishes, they're discarded (the change in side_effects.py) Note: the auto lifting mechanism for HOPs happens at proxy level when we trace the subgraph, which is after SymNodeVariable are created (they're created when realizing the args and bind them to subgraph). At that time, builder has created two unique SymNodeVariable for the same symint so the auto lifting in hops cannot de-dup them. Differential Revision: [D78298163](https://our.internmc.facebook.com/intern/diff/D78298163) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158273 Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519	2025-07-15 23:48:20 +00:00
dolpm	144965ca9a	[BE][S538760] get rid of TORCH_CHECK_.* and CHECK macros (#158269 ) Summary: check will be crit, causing program to exit, which is quite dangerous Test Plan: CI Rollback Plan: Differential Revision: D78050595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158269 Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier	2025-07-15 22:04:12 +00:00
dsashidh	ee0992871c	Add test for user-managed weights with load_state_dict (#157496 ) Summary: Adds a unit test to verify that when 'user_managed=True' is passed to 'update_constant_buffer', the compiled AOTI model properly shares parameter storage with the eager model. The test specifically covers the following: 1. Passes model weights to the AOTI model with 'user_managed=True''. 2. Updates the eager model weights using 'load_state_dict()', which performs in-place 3. Asserts that the compiled AOTI model reflects the updated weights, confirming shared memory behavior. Fixes: #157474 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157496 Approved by: https://github.com/desertfire	2025-07-15 21:17:24 +00:00
Yiming Zhou	05dfd312cf	[3/n] Remove references to TorchScript in PyTorch docs (#158315 ) Summary: - cpp_index.rst - fx.md - jit_builtin_functions.rst - jit_python_reference.md - jit_unsupported.md cpu_threading large_scale_deployment Test Plan: CI Rollback Plan: Differential Revision: D78309320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158315 Approved by: https://github.com/svekars, https://github.com/zhxchen17	2025-07-15 21:14:18 +00:00
Andrey Talman	abeae997a3	Use brew suggested miniconda install command (#158347 ) Use ```brew install --cask miniconda``` as specified by https://formulae.brew.sh/cask/miniconda Forward fix After: https://github.com/pytorch/pytorch/pull/156898#issuecomment-3074207175 Seeing in CI: ``` Run if [[ -n "$REINSTALL_BREW_MINICONDA" ]]; then ==> Caveats Please run the following to setup your shell: conda init "$(basename "${SHELL}")" Alternatively, manually add the following to your shell init: eval "$(conda "shell.$(basename "${SHELL}")" hook)" ==> Downloading https://repo.anaconda.com/miniconda/Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh Already downloaded: /Users/ec2-user/Library/Caches/Homebrew/downloads/2e356e8b147647692e4da77ce4c0c14eefee65ec86f29cc7e8c21a26ac9397ca--Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh ==> Installing Cask miniconda ==> Running installer script 'Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh' PREFIX=/opt/homebrew/Caskroom/miniconda/base Unpacking payload ... entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. Installing base environment... Preparing transaction: ...working... done Executing transaction: ...working... done entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. installation finished. ==> Linking Binary 'conda' to '/opt/homebrew/bin/conda' 🍺 miniconda was successfully installed! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158347 Approved by: https://github.com/seemethere	2025-07-15 21:08:25 +00:00
Ti-Tai Wang	3f83e3eeca	[ONNX] Remove legacy registration and dispatcher (#158283 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158283 Approved by: https://github.com/Skylion007, https://github.com/justinchuby ghstack dependencies: #158258, #158262, #158282	2025-07-15 21:00:49 +00:00
Yiming Zhou	0640cfa38c	[2/n] Remove references to TorchScript in PyTorch docs (#158306 ) Summary: Removed jit_language_reference.md Test Plan: CI Rollback Plan: Differential Revision: D78308133 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158306 Approved by: https://github.com/svekars, https://github.com/zhxchen17	2025-07-15 20:57:23 +00:00
Ti-Tai Wang	e4c17d5e1c	[ONNX] Remove fx_onnx_interpreter.py (#158282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158282 Approved by: https://github.com/Skylion007, https://github.com/justinchuby ghstack dependencies: #158258, #158262	2025-07-15 20:46:06 +00:00
Animesh Jain	cc0faeb80f	[dynamo][guards] Instruction count for guard eval for development work (#158214 ) Its turned off by default. Even the code is hidden before of the define preprocessing flag. It will be used only for development work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158214 Approved by: https://github.com/StrongerXi ghstack dependencies: #158215	2025-07-15 20:29:23 +00:00
Ti-Tai Wang	205241a0d5	[ONNX] Remove legacy dynamo graph extractor (#158262 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158262 Approved by: https://github.com/justinchuby ghstack dependencies: #158258	2025-07-15 20:21:49 +00:00
Yiming Zhou	19625daf88	[1/n] Remove references to TorchScript in PyTorch docs (#158305 ) Summary: Removed jit_language_reference_v2.md Test Plan: CI Rollback Plan: Differential Revision: D78308009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158305 Approved by: https://github.com/jingsh, https://github.com/svekars	2025-07-15 20:16:53 +00:00
Sam Larsen	dbf7d421da	[BE][testing] fix aot_inductor_package internally (#158270 ) Summary: We have internal test failure for several aot_inductor_package tests. It looks like we're translating args like: ``` -Wl,--script=/home/slarsen/local/fbsource2/buck-out/v2/gen/fbcode/7ce8f48f92bc4ee6/caffe2/test/inductor/__aot_inductor_package__/aot_inductor_package#link-tree/torch/_inductor/script.ld ``` To: ``` -Wl,--script=/home/slarsen/local/fbsource2/buck-out/v2/gen/fbcode/7ce8f48f92bc4ee6/caffe2/test/inductor/__aot_inductor_package__/aot_inductor_package#link-tree/torch/_inductor//tmp/jZMktZ/tmpsqoxb_cq/data/aotinductor/model/script.ld ``` This PR changes to strings like: ``` -Wl,--script=/tmp/jZMktZ/tmpsqoxb_cq/data/aotinductor/model/script.ld ``` Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:aot_inductor_package --run-disabled` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158270 Approved by: https://github.com/desertfire	2025-07-15 20:15:18 +00:00
Animesh Jain	b86d5cef68	[dynamo][tensor] Skip HASATTR attribute on tensor guards (#158215 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158215 Approved by: https://github.com/StrongerXi	2025-07-15 20:10:47 +00:00
Jane Xu	30587195d3	Migrate c10/macros/cmake_macros.h.in to torch/headeronly (#158035 ) Summary: As above, also changes a bunch of the build files to be better Test Plan: internal and external CI did run buck2 build fbcode//caffe2:torch and it succeeded Rollback Plan: Reviewed By: swolchok Differential Revision: D78016591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158035 Approved by: https://github.com/swolchok	2025-07-15 19:52:59 +00:00
Aaron Orenstein	250ae2531c	Fix types in graphs.py (#158192 ) Added type annotations for torch/cuda/graphs.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/158192 Approved by: https://github.com/oulgen	2025-07-15 19:49:38 +00:00
Songhao Jia	011026205a	make node source hashable (#158322 ) Summary: as title Test Plan: ci Rollback Plan: Reviewed By: yushangdi Differential Revision: D78296410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158322 Approved by: https://github.com/yushangdi	2025-07-15 19:31:00 +00:00
Menglu Yu	4657a84bc5	[Optimus][fp8_activation_quantization] Only log when there's some node to be quantized (#158129 ) Summary: We add some extra check on whether there's some node has been marked as should quantize, otherwise we skip the quantizaton and tlparse log. Rollback Plan: Differential Revision: D78173788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158129 Approved by: https://github.com/Skylion007, https://github.com/avicizhu	2025-07-15 19:22:26 +00:00
Ti-Tai Wang	5606c516fd	[ONNX] Remove legacy Dort (#158258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158258 Approved by: https://github.com/justinchuby, https://github.com/malfet	2025-07-15 19:14:06 +00:00
Edward Z. Yang	7afb834f93	Inline dispatch_and_compile into its call site. (#158150 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158150 Approved by: https://github.com/jamesjwu, https://github.com/wconstab ghstack dependencies: #158149	2025-07-15 19:08:55 +00:00
Edward Z. Yang	148789ddd8	Avoid AOTAutogradCache.load in stack trace on cache miss path (#158149 ) The general context for the upcoming stack of commits is I am attempting to "pipeline" AOTAutograd. Instead of having function f call function g which is the next "stage" of compilation, instead f should return with its outputs, which are then piped to g for the next stage. This will make it easier to implement early exit / resume pipeline without forcing callback structure, which is good for export-style use cases. It also reduces the size of our stack traces, which makes tools like Perfetto happy. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158149 Approved by: https://github.com/jamesjwu	2025-07-15 19:08:55 +00:00
albanD	3beb915004	Update CODEOWNERS for dataloading (#158348 ) Adding Scott Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/158348 Approved by: https://github.com/scotts, https://github.com/janeyx99	2025-07-15 19:06:18 +00:00
Shangdi Yu	cf3247b74a	Standalone compile API in _Exporter (#158139 ) Given an `package: _ExportPackage`, users can get a ready-to-use workspace in `tmp_dir` by calling: ```python package._compiled_and_package( tmp_dir + "/pt2_pacakge_name.pt2", True, package_example_inputs = True ) ``` `tmp_dir` will contains: - `main.cpp` (an example cpp file that create the models, if package_example_inputs is True, it'll also load the example inputs and run the models) - `CMakeLists.txt` - `pt2_pacakge_name/` (this is where the models are) - `pt2_pacakge_name.pt2` - `inputs.pt` files if package_example_inputs is True Remaining TODOs - support loading contants/weights - the `package_example_inputs = True` option only supports a list of Tensors for now - eventually we should remove the `torch` dependency, and use `SlimTensor`/`StableIValue` instead. Test Plan: ``` python test/inductor/test_aot_inductor_package.py -k test_compile_with_exporter ``` Example generated `main.cpp`: ```cpp #include <dlfcn.h> #include <fstream> #include <iostream> #include <memory> #include <torch/torch.h> #include <vector> #include <torch/csrc/inductor/aoti_torch/tensor_converter.h> #include "package/data/aotinductor/Plus__default/Plus__default.h" #include "package/data/aotinductor/Minus__default/Minus__default.h" using torch::aot_inductor::AOTInductorModelPlus__default; using torch::aot_inductor::AOTInductorModelMinus__default; using torch::aot_inductor::ConstantHandle; using torch::aot_inductor::ConstantMap; int main(int argc, char* argv[]) { std::string device_str = "cpu"; try { c10::Device device(device_str); // Load input tensors for model Plus__default std::vector<at::Tensor> input_tensors1; for (int j = 0; j < 2; ++j) { std::string filename = "Plus__default_input_" + std::to_string(j) + ".pt"; std::ifstream in(filename, std::ios::binary); if (!in.is_open()) { std::cerr << "Failed to open file: " << filename << std::endl; return 1; } std::vector<char> buffer((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>()); torch::IValue ivalue = torch::pickle_load(buffer); input_tensors1.push_back(ivalue.toTensor().to(device)); } // Load input tensors for model Minus__default std::vector<at::Tensor> input_tensors2; for (int j = 0; j < 2; ++j) { std::string filename = "Minus__default_input_" + std::to_string(j) + ".pt"; std::ifstream in(filename, std::ios::binary); if (!in.is_open()) { std::cerr << "Failed to open file: " << filename << std::endl; return 1; } std::vector<char> buffer((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>()); torch::IValue ivalue = torch::pickle_load(buffer); input_tensors2.push_back(ivalue.toTensor().to(device)); } // Create array of input handles auto input_handles1 = torch::aot_inductor::unsafe_alloc_new_handles_from_tensors(input_tensors1); auto input_handles2 = torch::aot_inductor::unsafe_alloc_new_handles_from_tensors(input_tensors2); // Create array for output handles AtenTensorHandle output_handle1; AtenTensorHandle output_handle2; // Create and load models auto constants_map1 = std::make_shared<ConstantMap>(); auto constants_array1 = std::make_shared<std::vector<ConstantHandle>>(); auto model1 = AOTInductorModelPlus__default::Create( constants_map1, constants_array1, device_str, "package/data/aotinductor/Plus__default/"); model1->load_constants(); auto constants_map2 = std::make_shared<ConstantMap>(); auto constants_array2 = std::make_shared<std::vector<ConstantHandle>>(); auto model2 = AOTInductorModelMinus__default::Create( constants_map2, constants_array2, device_str, "package/data/aotinductor/Minus__default/"); model2->load_constants(); // Run the models torch::aot_inductor::DeviceStreamType stream1 = nullptr; model1->run(&input_handles1[0], &output_handle1, stream1, nullptr); torch::aot_inductor::DeviceStreamType stream2 = nullptr; model2->run(&input_handles2[0], &output_handle2, stream2, nullptr); // Convert output handles to tensors auto output_tensor1 = torch::aot_inductor::alloc_tensors_by_stealing_from_handles(&output_handle1, 1); auto output_tensor2 = torch::aot_inductor::alloc_tensors_by_stealing_from_handles(&output_handle2, 1); // Validate outputs std::cout << "output_tensor1" << output_tensor1 << std::endl; std::cout << "output_tensor2" << output_tensor2 << std::endl; return 0; } catch (const std::exception &e) { std::cerr << "Error: " << e.what() << std::endl; return 1; } } ``` Rollback Plan: Differential Revision: D78124705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158139 Approved by: https://github.com/desertfire	2025-07-15 18:47:56 +00:00
PyTorch MergeBot	46915b1361	Revert "Introduce AcceleratorAllocatorConfig as the common class (#149601 )" This reverts commit 1e8e9f745e43fa38bbfc7b67b30bc66c0e7ebbd6. Reverted https://github.com/pytorch/pytorch/pull/149601 on behalf of https://github.com/huydhn due to See https://github.com/pytorch/pytorch/pull/149601#discussion_r2208325379 ([comment](https://github.com/pytorch/pytorch/pull/149601#issuecomment-3074965720))	2025-07-15 18:40:59 +00:00
Robert Hardwick	8c3f206457	Fix AArch64 segfaults by disabling strict-aliasing in GridSamplerKernel for GCC 12 and above (#158117 ) This PR disables `strict-aliasing` GCC C++ optimization flag on all AArch64 cpus for GCC versions 12 and above. Pull Request #152825 upgraded gcc version from 11 to 13 in manywheel which caused several segmentation faults in unit tests ( not visible in CI workflows because the jammy gcc version has not been updated yet ). We Identified the problem also exists in GCC12 hence the ` __GNUC__ >= 12` Fixes #157626 fixes these tests failures when pytorch is built in GCC12 and above ``` test_ops.py::TestCommonCPU::test_noncontiguous_samples_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault test_ops.py::TestCommonCPU::test_dtypes_grid_sampler_2d_cpu Fatal Python error: Segmentation fault test_ops.py::TestMathBitsCPU::test_neg_view_nn_functional_grid_sample_cpu_float64 free(): invalid next size (fast) test_ops.py::TestCompositeComplianceCPU::test_backward_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault test_ops.py::TestCommonCPU::test_dtypes_nn_functional_grid_sample_cpu Fatal Python error: Segmentation fault ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158117 Approved by: https://github.com/malfet	2025-07-15 18:26:38 +00:00
PyTorch MergeBot	41971335c9	Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" This reverts commit e241a07e6b88aa49d604803bc5a6562f0d9f94d2. Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))	2025-07-15 18:24:36 +00:00
PyTorch MergeBot	ea5f88dca6	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit e40ade5182233f548b25f2732effe3719d16e9ad. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))	2025-07-15 18:24:36 +00:00
PyTorch MergeBot	f2ecf6145f	Revert "Enable AcceleratorAllocatorConfig key check (#157908 )" This reverts commit 65fcca4f8c97de82d35d51ad9b790d10433e9b91. Reverted https://github.com/pytorch/pytorch/pull/157908 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing internally per https://github.com/pytorch/pytorch/pull/157908#discussion_r2208204782 ([comment](https://github.com/pytorch/pytorch/pull/157908#issuecomment-3074833696))	2025-07-15 18:17:43 +00:00
PyTorch MergeBot	b26da7741b	Revert "[CI] Fixes CI for CUDA Version > 12.9 (#157385 )" This reverts commit 6c5227ba00a2904365af566c24b4681cd01a041c. Reverted https://github.com/pytorch/pytorch/pull/157385 on behalf of https://github.com/clee2000 due to broke some slow tests test_cpp_extensions_jit.py::TestCppExtensionJIT::test_jit_cuda_archflags [GH job link](https://github.com/pytorch/pytorch/actions/runs/16286465717/job/45986677885) [HUD commit link](`6c5227ba00`) ([comment](https://github.com/pytorch/pytorch/pull/157385#issuecomment-3074737541))	2025-07-15 18:06:52 +00:00
Menglu Yu	243b12e565	[Optimus] add einsum_to_pointwise_pass pattern (#155666 ) Summary: More context: https://docs.google.com/document/d/1ipiskqG13ZKNX1SGygB3QnHcSyXNQ8pACazPIcS4bnI/edit?tab=t.0 Test Plan: ### how to enable ``` torch._inductor.config.pre_grad_fusion_options={ "einsum_to_pointwise_pass": {}, }, ``` ### unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test 'fbcode//mode/dev-nosan' //caffe2/test/inductor:kernel_optimization ``` Buck UI: https://www.internalfb.com/buck2/267263ff-6f5b-4fff-bfc0-d8f013440ba0 Test UI: https://www.internalfb.com/intern/testinfra/testrun/5629499820839168 Network: Up: 61KiB Down: 675KiB (reSessionID-fda8edfc-6eef-4bf0-b268-0f8d2e666571) Loading targets. Remaining 0/1 1 dirs read, 2310 targets declared Analyzing targets. Remaining 0/345 284 actions, 329 artifacts declared Executing actions. Remaining 0/18334 8.0s exec time total Command: test. Finished 6 local Time elapsed: 1:15.5s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ### local reproduce baseline: \| Metric \| Value \| \|:----------------------\|:------------\| \| Batch size \| 4096 \| \| GPU type \| H100 \| \| Latency \| 196.06 ms \| \| Model size \| 1205.21 MB \| \| Flops \| 7671.30 G \| \| Flops/example \| 1.87 G \| \| TFLOPS/sec \| 39.13 \| \| MFU \| 4.89% \| \| Activation/example \| 1.51 MB \| \| CPU time total \| 602.28 ms \| \| GPU time total \| 798.60 ms \| \| Estimated avg BW \| 234.62 GB/s \| \| Estimated avg BW util \| 9.78% \| Trace link: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/efficient_module_suite/fused_attention_mlp.Jun_09_22_12_38_trace.json.gz&bucket=pyper_traces with the pattern: \| Metric \| Value \| \|:----------------------\|:------------\| \| Batch size \| 4096 \| \| GPU type \| H100 \| \| Latency \| 184.94 ms \| \| Model size \| 1205.21 MB \| \| Flops \| 7671.30 G \| \| Flops/example \| 1.87 G \| \| TFLOPS/sec \| 41.48 \| \| MFU \| 5.18% \| \| Activation/example \| 1.15 MB \| \| CPU time total \| 562.44 ms \| \| GPU time total \| 754.36 ms \| \| Estimated avg BW \| 201.40 GB/s \| \| Estimated avg BW util \| 8.39% \| Trace link: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/efficient_module_suite/fused_attention_mlp.Jun_10_22_03_34_trace.json.gz&bucket=pyper_traces ### E2E baseline: f713998364 with patter: Rollback Plan: Differential Revision: D76400889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155666 Approved by: https://github.com/Yuzhen11	2025-07-15 17:50:23 +00:00
vfdev	b7b1109f49	Expose opt_einsum in torch.backends (#157740 ) Fixes the following issue: ``` :/tmp# python -c "import torch; print(torch.__version__)" 2.7.1+cu126 :/tmp# python -c "import torch; print(torch.backends.opt_einsum.is_available())" Traceback (most recent call last): File "<string>", line 1, in <module> AttributeError: module 'torch.backends' has no attribute 'opt_einsum' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157740 Approved by: https://github.com/Skylion007, https://github.com/benjaminglass1	2025-07-15 17:46:43 +00:00
PyTorch MergeBot	26807dcf27	Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563 )" This reverts commit c062550a3598d27c2d6572db7c0f4ff90a84cc84. Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/clee2000 due to broke test_linear_and_cel on main `c062550a35`, caused OOM? Also broken on PR, Dr. CI classification is wrong (claims the test is disabled by an issue but the issue is for a different test). Also I'm pretty sure the expected results json is supposed to have a ton of empty lines, its to prevent merge conflicts, I will add it to the linter ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3074355331))	2025-07-15 16:35:55 +00:00
PyTorch MergeBot	4f36743f5e	Revert "[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062 )" This reverts commit 5a54db14e3843cfa87fd8d27487dbf2f2dfb6c47. Reverted https://github.com/pytorch/pytorch/pull/158062 on behalf of https://github.com/clee2000 due to sorry I want to revert something else and this is causing a merge conflict, all you should need to do is rebase and remerged ([comment](https://github.com/pytorch/pytorch/pull/158062#issuecomment-3074342140))	2025-07-15 16:31:13 +00:00
dsashidh	05d7288e31	Fix incorrect bin edge description in histogramdd docs (#158275 ) Fixes #124435 This updates the torch.histogramdd documentation to correctly state that bins are inclusive of their left edges, not exclusive as currently written. There was a previous PR addressing this but it was closed due to inactivity. This picks that up and applies the fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158275 Approved by: https://github.com/albanD	2025-07-15 16:25:01 +00:00
IvanKobzarev	5a54db14e3	[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062 ) Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158062 Approved by: https://github.com/wconstab	2025-07-15 14:27:57 +00:00
Aleksandar Samardžić	90618581e9	Fix grouped MM output strides when compiled but not max-autotuned (#158143 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158143 Approved by: https://github.com/ngimel	2025-07-15 11:53:13 +00:00
Andrey Talman	4e13eca713	[BE] Remove CUDA 11.8 artifacts (#158303 ) We are including cufile by default in all CUDA 12+ builds. Since CUDA 11.8 is removed we can safely remove this code Pull Request resolved: https://github.com/pytorch/pytorch/pull/158303 Approved by: https://github.com/Camyll, https://github.com/cyyever	2025-07-15 11:52:08 +00:00
Xiangyang (Mark) Guo	156a377f4c	[AOTI][CPP] add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL (#157949 ) Summary: Add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL to force inline the kernel function when TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL=1. It's disabled by default because force inlining may increase the build time. Differential Revision: D77915987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157949 Approved by: https://github.com/desertfire	2025-07-15 10:51:43 +00:00
henrylhtsang	6200584193	[cutlass backend][BE] remove force disable cache in tests (#158053 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158053 Approved by: https://github.com/coconutruben	2025-07-15 10:35:34 +00:00
Yu, Guangye	e40ade5182	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #150312	2025-07-15 10:14:35 +00:00
Yu, Guangye	e241a07e6b	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD	2025-07-15 10:14:35 +00:00
Huamin Li	7f9fc7e67c	[Inductor] Add CPU_MAX_FIRST_DIMENSION_DECOMPOSITION and CPU_MAX_OTHER_DIMENSION_DECOMPOSITION for decompose_mm_pass (#158183 ) Differential Revision: D78209993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158183 Approved by: https://github.com/houseroad	2025-07-15 10:07:25 +00:00
FFFrog	1b389025ba	Refactor and Improve the OpenReg Module (#158090 ) ---- # Refactor and Improve the OpenReg Module ## Background Since PrivateUse1 has become the main path for integrating new devices with PyTorch, there have been some feature requests related to PrivateUse1 regarding interfaces, documentation, reference examples, etc., such as the following: - https://github.com/pytorch/pytorch/issues/155864 - https://github.com/pytorch/pytorch/issues/144955 - https://github.com/pytorch/pytorch/issues/144845 Taking these requests into consideration and combining them with the position of OpenReg, which is currently used as the test backend for PrivateUse1, I'm planning to make the following optimizations: - Optimize the implementation of OpenReg to make it align with the standard specifications for real backend (C++) access, serving as a reference for new device integration code. - Add comprehensive documentation to the [developer notes](https://docs.pytorch.org/docs/main/notes.html) to guide new accelerator integration, functioning as a reference manual. ## Design Principles: - Minimization Principle: Keep the code small and clear; only implement the minimum set of code required for verification and as an integration reference. - Authenticity Principle: Integrate OpenReg in the same way that real accelerators access PyTorch. ## More Infos: Pleaes refer to [this](`6b8020f1ab/test/cpp_extensions/open_registration_extension/torch_openreg/README.md`) for more information about `OpenReg`. ## Current Progress: - Refer to the implementation of [torch_xla](https://github.com/pytorch/xla) to refactor all of OpenReg's code, making it easier to understand. - Ensure all tests in [test/test_openreg.py](https://github.com/FFFrog/pytorch/blob/openreg/test/test_openreg.py) pass after refactoring. ## Next Steps: - Add more features to cover all integration points. - Gradually add user guides and documentation to the [developer notes](https://docs.pytorch.org/docs/main/notes.html). Pull Request resolved: https://github.com/pytorch/pytorch/pull/158090 Approved by: https://github.com/seemethere, https://github.com/albanD	2025-07-15 08:10:05 +00:00
AaronWang04	6c5227ba00	[CI] Fixes CI for CUDA Version > 12.9 (#157385 ) Compute capabilities older than volta (inclusive) is no longer supported in CUDA Version > 12.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157385 Approved by: https://github.com/huydhn	2025-07-15 07:04:54 +00:00
wengshiy	c8c221c0b3	[Inductor][Float8] Add float8_e4m3fn into assertion dtype list. (#157684 ) Fix assert issue. Add float8_e4m3fn into dtype list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157684 Approved by: https://github.com/Xia-Weiwen, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-07-15 06:02:01 +00:00
codingwithsurya	3341c131b7	[SymmMem] Fix NCCL Hang in NVSHMEM Triton Wait Until Test (#158167 ) The `test_triton_wait_until` test was hanging due to an NCCL synchronization issue stemming from mismatched NVSHMEM operations. Specifically, the flag variable was updated using `nvshmemx_signal_op` (a signaling operation), but waited on with `nvshmem_wait_until` (intended for put/get updates). Per NVSHMEM documentation (see documentation reference section below), signal-updated variables require `nvshmem_signal_wait_until` for proper completion guarantees, so the mismatch caused a deadlock and NCCL hang. Fix: - A simple fix was to replace the flag update with a regular `nvshmem_putmem_block` (via `put_kernel`) to match `nvshmem_wait_until`. I also added a fence (`nvshmem_fence`) between data and flag puts on the sender (Rank 1) for ordered delivery. - In a follow-up PR I will add a kernel/test to demonstrate usage of `nvshmemx_signal_op` Testing: - I ran `python test/distributed/test_nvshmem_triton.py` and `python test/distributed/test_nvshmem_triton.py -k test_triton_wait_until` - I also verified with debug prints (Sender completes puts/fence before receiver's wait returns, and assertions confirm correct state). Multiple runs show no hangs or failures. Documentation Referenced: - [NVSHMEM Point-To-Point Synchronization](https://docs.nvidia.com/nvshmem/api/gen/api/sync.html) explicitly states: "the sig_addr object at the calling PE is expected only to be updated as a signal, through the signaling operations available in Section NVSHMEM_PUT_SIGNAL and Section NVSHMEM_PUT_SIGNAL_NBI" - [NVIDIA's Official Ring Broadcast Example](https://docs.nvidia.com/nvshmem/api/examples.html) demonstrates the correct pairing: `nvshmemx_signal_op` with `nvshmem_signal_wait_until` (not `nvshmem_wait_until`) - [NVSHMEM Signaling Operations](https://docs.nvidia.com/nvshmem/api/gen/api/signal.html) documents that signal operations work on special "signal data objects" with specific atomicity guarantees distinct from regular RMA operations Pull Request resolved: https://github.com/pytorch/pytorch/pull/158167 Approved by: https://github.com/Skylion007, https://github.com/fduwjj	2025-07-15 05:57:27 +00:00
Gabriel Ferns	9cd521de4d	Fix torchrec multiprocess tests (#158159 ) Summary: The new version of `get_device_tflops` imported something from testing, which imported common_utils.py, which disabled global flags. Test Plan: Fixing existing tests Rollback Plan: Differential Revision: D78192700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158159 Approved by: https://github.com/nipung90, https://github.com/huydhn	2025-07-15 05:44:37 +00:00
albanD	058fb1790f	Fix compilation and "import torch" issues for cpython 3.14 (#158184 ) Beginning of process for 3.14 bringup. State of things from this PR: - Nothing too scary looking from the Dynamo CPython side, nothing we heavily rely on seems to be missing @williamwen42 - The existing check that makes torch.compile() nicely fail is working as expected. So all these empty functions shouldn't cause any weirdness. - The `__module__` update changes look suspicious, we should investigate what is the reason and impact of that, in particular for our public API checking @jbschlosser - Leaving the weakref.py thread safety change as a follow up to keep this a bit simpler. I vendored the whole struct in the meantime FYI @ezyang EDIT: The `__module__` change is even more cursed than I though due to changes to Union and Optional type where the `__module__` field cannot be changed anymore. See https://github.com/python/cpython/issues/132139 for details. For now, I'm just skipping the `__module__` setting for 3.14 which will trip the public API checks. Will revisit once I have a final answer on the cpython issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158184 Approved by: https://github.com/msaroufim	2025-07-15 05:06:55 +00:00
Xilun Wu	add0b450bd	[DTensor][BE] improve DTensor ops correctness check utils (#158112 ) Summary Implemented the test pattern described in https://github.com/pytorch/pytorch/pull/157991#discussion_r2196363170 as a util method in `DTensorTestBase`. The difference to `DTensorTestBase._test_op` is: 1. allowing users to specify the `Partial` placement. 2. supporting tree-like output structure. Test so far only adopt `DTensorTestBase._test_op_on_dtensor` in `DistTensorOpsTest.test_split_on_partial`. `pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158112 Approved by: https://github.com/Skylion007, https://github.com/zpcore ghstack dependencies: #158051	2025-07-15 04:50:34 +00:00
Xilun Wu	4c1fabf2c9	[DTensor] have split_strategy return OpStrategy instead of TupleStrategy (#158051 ) Summary `split_strategy` used `TupleStrategy` as return type because DTensor sharding propagation's `OpStrategy` support on multi-returns only applies to `Tuple`. However, `TupleStrategy`'s not a good fit for `split` op. `TupleStrategy` was initially introduced to handle the sharding strategy of `foreach_` ops where the input args can be split into independent subsets regarding sharding decisions, so are the outputs. To address the misuse, this PR adds `OpStrategy` propagation for `List[Tensor]` (note that this support is INCOMPLETE because it only checks the return type to be `torch.ListType`). Nevertheless, the logic for `Tuple` returns also made similar assumption so I think it's fine to unblock in such a way. Besides adding `OpStrategy` support to ops having `List[Tensor]` return type, this PR also changes `split_strategy`'s return from `TupleStrategy` to `OpStrategy`. Test* `pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158051 Approved by: https://github.com/wconstab, https://github.com/zpcore	2025-07-15 04:50:34 +00:00
Ti-Tai Wang	a2ad16be72	[ONNX] Remove legacy Dort tests (#158294 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158294 Approved by: https://github.com/justinchuby ghstack dependencies: #158255, #158256, #158257	2025-07-15 04:44:14 +00:00
Ti-Tai Wang	5fb07acbc3	[ONNX] Remove legacy modularization (#158257 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158257 Approved by: https://github.com/justinchuby ghstack dependencies: #158255, #158256	2025-07-15 04:36:01 +00:00
Ti-Tai Wang	336bff6d58	[ONNX] Remove legacy graph passes (#158256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158256 Approved by: https://github.com/justinchuby ghstack dependencies: #158255	2025-07-15 04:27:30 +00:00
Ti-Tai Wang	12151c96d9	[ONNX] Remove legacy io_adapter (#158255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158255 Approved by: https://github.com/justinchuby	2025-07-15 03:39:18 +00:00
Will Constable	4486a6dbfd	[DTensor] Fix grouped_mm strategy for invalid stride cases (#158245 ) local_tensor input to grouped_mm has a stride requirement. (see `_meta_grouped_mm_common` in meta_registrations.py or `check_valid_strides_and_return_transposed` in native/cuda/Blas.cpp) Don't allow sharding a tensor if its shape would result in an incompatible local_tensor stride. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158245 Approved by: https://github.com/zpcore, https://github.com/XilunWu	2025-07-15 03:29:49 +00:00
Arsh Zahed	a5e68814d5	Allow dynamic shapes for DTensor slice (#157953 ) This PR allows for symints in `gen_slice_strategy` which is the strategy for `aten.slice.Tensor`. Previously, using dynamic shapes with slicing would result in ``` File ".../pytorch/torch/distributed/tensor/_ops/_tensor_ops.py", line 348, in gen_slice_strategy assert isinstance(end, int) torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function getitem>((DTensor(local_tensor=FakeTensor(..., device='cuda:0', size=(s3, 2)), device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Shard(dim=0),)), slice(None, (s77//2), None)), *{}): got AssertionError() ``` Questions before merge: 1. `dim` is still asserted to be int. Is this fine, or is this potentially dynamic as well? 2. I'm using argtype ignore for `normalize_dim`. Should I instead change types for `normalize_dim` and further dependency to be `IntLike` as well? Pull Request resolved: https://github.com/pytorch/pytorch/pull/157953 Approved by: https://github.com/wconstab	2025-07-15 00:54:01 +00:00
James Wu	ef4cca2d79	[precompile] Increment frame and add compile ids when loading packages (#158028 ) When loading a package and calling package.install(backends), we create a new frame and compile id for each package load, so that tlparse and chromium events still show compile times on warm start. There is an argument for not doing this in AOT precompile, as no "compile" occurs. So for now, we put it in `package.install`, which hopefully won't be a thing for AOT precompile. ## Recompiles Recompiles get saved to the same frame and code entry, so on warm start, each recompile will get collapsed into the same entry. Therefore, dynamo compiles that have recompiles on cold start (0/0, 0/1, 0/2, etc) will all get collapsed into a single compile id (0/0), as warm start will load all of the entries properly. ## Graph breaks Graph breaks get their own compile id, and therefore their own code entry. These are replicated on warm start, so if cold start you had 4 different graphs (and therefore 4 compile ids), you'll have 4 compile ids on warm start as well. ## Test plan Added a frame counter check to existing unit tests for automatic dynamic, showing that old and new frame counter between old and new load is the same. This is the chromium event for test_automatic_dynamo_graph_breaks_device_cuda: ``` python test/dynamo/test_package.py -k test_automatic_dynamo_graph_breaks_device_cuda ``` <img width="2216" height="508" alt="image" src="https://github.com/user-attachments/assets/f604ed33-5c31-464b-9320-d67b2e6f57a1" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158028 Approved by: https://github.com/oulgen	2025-07-15 00:53:52 +00:00
Songhao Jia	1c6057fd17	add eq function to NodeSource (#158170 ) Summary: add eq function to NodeSouce by comparing their dict representation. Test Plan: ci Rollback Plan: Differential Revision: D78200762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158170 Approved by: https://github.com/ezyang, https://github.com/yushangdi	2025-07-15 00:50:06 +00:00
henrylhtsang	7e433d5f42	[cutlass backend] cache a few things for codegen and properties (#158158 ) Differential Revision: [D78193404](https://our.internmc.facebook.com/intern/diff/D78193404/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158158 Approved by: https://github.com/ColinPeppler	2025-07-15 00:18:31 +00:00
Tristan Rice	b7def5ff1c	dist2: add support for passing custom configs directly to PG (#158147 ) This is intended to make it easier to have backend specific "hints" that can be provided by the user to hint about certain options. ```py import torch.distributed._dist2 as dist2 pg = dist2.new_group(backend="my_custom_backend", device=..., timeout=..., foo=1234, bar="1234") pg.allreduce(...) ``` Test plan: ``` pytest test/distributed/test_dist2.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158147 Approved by: https://github.com/fduwjj	2025-07-15 00:02:54 +00:00
Simon Fan	7cf31b4a42	[dynamo] fix NamedTupleVariable cloning (#158190 ) FIXES https://github.com/pytorch/pytorch/issues/157945 ## Explanation 1. Some VTs add additional attrs e.g. NamedTupleVariable has "dynamic_attributes" `a0308edb6c/torch/_dynamo/variables/lists.py (L1048-L1051)` 2. VT.clone passes everything by dict, includes "dynamic_attributes" `a0308edb6c/torch/_dynamo/variables/base.py (L255-L259)` 3. Non-handled args become kwargs in VT's `__init__`, `super().__init__()` passes kwargs to Base VT `a0308edb6c/torch/_dynamo/variables/lists.py (L1048-L1051)` 4. Base VT's `__init__` gets unexpected "dynamic_attributes" kwarg `a0308edb6c/torch/_dynamo/variables/base.py (L609-L613)` You could also let Base VT's `__init__` ignore additional kwargs, but that seemed a bit too permissive, and I don't think many VT's add these derived class only attrs. ## After fix ```python ===== __compiled_fn_1_7f9541ed_e166_43fe_8322_c5225ce4207f ===== /home/xmfan/core/miniconda3/envs/0712/lib/python3.12/site-packages/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, L_x_: "f32[4, 8, 6][48, 6, 1]cpu"): l_x_ = L_x_ # File: /home/xmfan/core/a/torchtitan/wtf.py:10 in forward, code: U, S = torch.linalg.svd(x)[:2] linalg_svd = torch._C._linalg.linalg_svd(l_x_); l_x_ = None U: "f32[4, 8, 8][64, 1, 8]cpu" = linalg_svd[0] S: "f32[4, 6][6, 1]cpu" = linalg_svd[1]; linalg_svd = None # File: /home/xmfan/core/a/torchtitan/wtf.py:11 in forward, code: reduced = U[:, :, :self.k] @ torch.diag_embed(S[:, :self.k]) getitem_3: "f32[4, 8, 5][64, 1, 8]cpu" = U[(slice(None, None, None), slice(None, None, None), slice(None, 5, None))]; U = None getitem_4: "f32[4, 5][6, 1]cpu" = S[(slice(None, None, None), slice(None, 5, None))]; S = None diag_embed: "f32[4, 5, 5][25, 5, 1]cpu" = torch.diag_embed(getitem_4); getitem_4 = None reduced: "f32[4, 8, 5][40, 5, 1]cpu" = getitem_3 @ diag_embed; getitem_3 = diag_embed = None return (reduced,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158190 Approved by: https://github.com/StrongerXi	2025-07-14 23:39:25 +00:00
Catherine Lee	08799217ae	[CI] Move main branch rocm binary builds to its own workflow (#158161 ) Petition to move out of ciflow/trunk and into ciflow/rocm because it's a long pole for TTS <img width="1192" height="312" alt="image" src="https://github.com/user-attachments/assets/b12a097a-3763-4c62-b09f-094ee9ae1c37" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158161 Approved by: https://github.com/seemethere	2025-07-14 23:07:49 +00:00
Catherine Lee	48315181c7	[CI] Do not run inductor rocm on ciflow/inductor (#158162 ) Petition to only run inductor-rocm on ciflow/inductor-rocm and not ciflow/inductor because it's a long pole for TTS <img width="1266" height="315" alt="image" src="https://github.com/user-attachments/assets/b3587bf7-b1a6-45f3-9b6a-c0e6d473d13b" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158162 Approved by: https://github.com/seemethere	2025-07-14 23:07:45 +00:00
Eli Uriegas	38371f693b	ci: Switch lintrunner-noclang to use linter image (#158261 ) This changes the image the lintrunner jobs utilizes to be the base linter image instead of the CUDA image. This is done to reduce the image size and speed up the build time. This was switched in https://github.com/pytorch/pytorch/pull/110502 when clang used to run in the lintrunner jobs but it is now split out so we can use the default image for non-clang jobs. Difference in pull time (from running job): ~5min --> ~1min (80% reduction), this should result in an overall runtime decrease of ~25min --> ~20min (20% reduction) Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158261 Approved by: https://github.com/Camyll, https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/Skylion007	2025-07-14 22:54:51 +00:00
Xuan Zhang	c062550a35	[PT2][fusion] ban fusions with large accumulated reads (#157563 ) Problem: Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet ``` total = torch.rand(N, N) for _ in range(r): x = torch.rand(N, N) total = total + x ``` The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like: ``` x_1 = torch.rand(N, N) x_2 = torch.rand(N, N) ... x_r = torch.rand(N, N) total = x_1 + x_2 + ... + x_r ``` Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient. [internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details Solution: Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile. * During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated. * During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`. Results: For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match. <img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-07-14 22:27:21 +00:00
yuchengliu1	9345279c6e	skip inductor/test_torchinductor_opinfo in windows (#158225 ) During enabling inductor CI in Windows, `test_torchinductor_opinfo.py` cost too many time (about 12 hours). This UT was seriously exceeding the time limit of CI. The compiler building was slower 4x in Windows than Linux after analyzing. Thus, we decide to skip the UT temporary and @xuhancn will keep searching the solution of compiler building in Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158225 Approved by: https://github.com/jansel Co-authored-by: Xu Han <xu.han@outlook.com>	2025-07-14 22:14:52 +00:00
Joona Havukainen	194539e9c3	Address NaNs if SDPA is called with all values masked from query (#157727 ) Fixes #156707 Detect if all values along the softmax axis are infs and overwrite the outputs for those computations with zeros before the final matmul. The behavior should be aligned with the CPU implementation. These types of cases where all values along the dimension in the attention mask are false leading to the undefined outputs in softmax occur with left padded batches for generation in HF transformers according to the original issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157727 Approved by: https://github.com/malfet	2025-07-14 22:09:35 +00:00
Ethan Wee	bcf50636ba	[CI] Removing --user flag from all pip install commands (#154900 ) Related to https://github.com/pytorch/pytorch/issues/148335 python virtualenv doesn't support using `--user` flag: ``` ERROR: Can not perform a '--user' install. User site-packages are not visible in this virtualenv. + python3 -m pip install --progress-bar off --user ninja==1.10.2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154900 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <jithun.nair@amd.com>	2025-07-14 21:09:42 +00:00
fduwjj	6b2bef10af	[c10d] Prototype of `group_split` for dist2 work (#157716 ) This is to implement group_split as proposed in [docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89](https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157716 Approved by: https://github.com/d4l3k	2025-07-14 21:04:12 +00:00
Huy Do	1e4d8b5a4a	Fix land race typos from #157290 (#158272 ) TSIA, this is a new grammar linter being added recently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158272 Approved by: https://github.com/clee2000	2025-07-14 20:55:13 +00:00
dolpm	725c327284	[nativert] add memory overlap debug assertion (#157290 ) Summary: better safe than sorry. will throw if memory overlap detected when using planned tensors and debug mode is enabled -- this will make our planning unit tests more robust. Test Plan: ci Rollback Plan: Differential Revision: D77327841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157290 Approved by: https://github.com/SherlockNoMad, https://github.com/zhxchen17	2025-07-14 19:12:41 +00:00
henrylhtsang	f87d117939	redo of [Inductor][Cutlass] verify cutlass has cache_file attribute before moving...resolves cutlass cute exception (#158206 ) trying to land https://github.com/pytorch/pytorch/pull/156672 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158206 Approved by: https://github.com/lessw2020, https://github.com/Skylion007	2025-07-14 18:50:23 +00:00
Tianyu Liu	5633283574	[reland][DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#158204 ) This PR is identical to https://github.com/pytorch/pytorch/pull/157216, which got reverted because of removing an outdated import of `torch._dynamo` https://www.internalfb.com/diff/D78021229?transaction_fbid=1713683499308113 The issue has been fixed by @weifengpy by D78199546, so this PR should be good to re-land. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158204 Approved by: https://github.com/weifengpy	2025-07-14 18:07:21 +00:00
Sudarshan Raghunathan	5b10b0a96f	Slightly improve error message from repeat_interleave kernel (#157996 ) Summary: In many investigations relating to invalid feature values, the three-argument form of `repeat_interleave` currently prints the following message if there is an inconsistency between `sum(repeats)` and `output_size`: ``` Assertion `result_size == cumsum_ptr[size - 1]` failed. ``` This is a bit hard for model authors to understand so I made the error slightly more comprehensible. After the fix the stdout contains the actual values of these parameters: https://fburl.com/mlhub/cfyyhh3q ``` Invalid input! In `repeat_interleave`, the `output_size` argument (949487) must be the same as the sum of the elements in the `repeats` tensor (949687). ``` In many cases, this is potentially useful information since we know for example that the difference between the two values above (949687-949487=200) happens to be the lengths of one of the features. ## What are my concerns with this change? 1. Outputs from `__assert_fail` go to `stderr` whereas `printf` writes to `stdout`. This is not the usual debugging flow where all logs can be found in `stderr`. I could not find a way to redirect `printf` to stderr or `__assert_fail` to stdout 2. Two checks happen instead of one in the error path. I wanted to preserve the semantics of what happens inside `__assert_fail`. 3. I have not seen this pattern in other PyTorch kernels but `repeat_interleave` with three arguments seems special in other ways too. Test Plan: * Built an ephemeral package with my changes: https://www.internalfb.com/intern/servicelab/build/736441058/ * Verified that a job with these changes indeed prints out the expected message to stdout: https://fburl.com/mlhub/jgbqk8eg * I will export to GH and run CI/CD tests. Rollback Plan: steps: - manual.note: content: >- Just reverting this diff should be sufficient. Since this change is in CUDA kernels, I do not believe there is a way to change the error message via a JK. Reviewed By: mradmila Differential Revision: D77904753 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157996 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-07-14 17:55:14 +00:00
James Wu	fb462cec8d	Normalize placeholder names in AOTAutogradCache (#157916 ) This PR adds a pass to sanitize_gm_for_cache which normalizes all placeholder names across input dynamo graphs to AOTAutogradCache. This is safe because nothing underneath AOTAutograd uses the node names on the original dynamo graph: AOTAutograd re-traces with its own nodes, and guards are in terms of original sources rather than placeholder names. Note that the dynamo output graphs traced by tlparse will not show this change because it's done before this sanitization step. The aot autograd outputs also will not change because AOTAutograd's own traced graphs don't use the original placeholders of the dynamo graph. Thus, this change is essentially a no-op from everyone's perspective except for cache key checks. Fixes #157792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157916 Approved by: https://github.com/zou3519	2025-07-14 17:45:11 +00:00
Catherine Lee	9b0013c6bb	[CI] Update mobile build docker image (#158153 ) The docker image got removed and then the job started building its own -> takes a long time I don't know why it uses the asan image <img width="1906" height="330" alt="image" src="https://github.com/user-attachments/assets/72fbf40c-3cd6-44ea-b61b-6335d2a4b589" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158153 Approved by: https://github.com/Skylion007	2025-07-14 17:35:58 +00:00
PyTorch MergeBot	6ea91f0672	Revert "[Inductor] Set the default value of min_chunk_size to 512 (#150762 )" This reverts commit 3321acc92e24859dbe2ac6499067d1afde5622c3. Reverted https://github.com/pytorch/pytorch/pull/150762 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but an inductor compilation error shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/150762#issuecomment-3070286787))	2025-07-14 16:58:13 +00:00
PyTorch MergeBot	6fe7456aa1	Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" This reverts commit 03b307575a98dc1d953c9d3521a9489e0e61e70c. Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing to build PyTorch internally ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3070218901))	2025-07-14 16:33:48 +00:00
PyTorch MergeBot	e8cca7bac7	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit 85857181ebca86e9c709e9922a9d9ef41a9c4ef9. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing to build PyTorch internally ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3070218901))	2025-07-14 16:33:48 +00:00
Guilherme Leobas	59c3cac454	Tag CPython test files with the commit or tag they were copied from. (#158038 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158038 Approved by: https://github.com/XuehaiPan, https://github.com/zou3519 ghstack dependencies: #157799, #157800, #157801, #157802, #156981	2025-07-14 15:42:19 +00:00
Ke Wen	826f12b829	[SymmMem] Avoid library mismatch in CMake search (#157836 ) Before, if NVSHMEM is installed at BOTH system location (e.g. `/usr/local`) and conda location (e.g. `/path/to/conda/lib/python3.10/site-packages/nvidia/nvshmem`, there can be a mismatch in where host lib and device lib are found: ``` -- NVSHMEM_HOME set to: '' -- NVSHMEM wheel installed at: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem' -- NVSHMEM_HOST_LIB: '/usr/local/lib/libnvshmem_host.so' -- NVSHMEM_DEVICE_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a' -- NVSHMEM_INCLUDE_DIR: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include' ``` The reason is that CMake prioritize name search over dir search. In the script below, CMake will search all locations for `libnvshmem_host.so` first, before it searches for `.so.3`. ``` find_library(NVSHMEM_HOST_LIB # In pip install case, the lib suffix is `.so.3` instead of `.so` NAMES nvshmem_host nvshmem_host.so.3 HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR} PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64) ``` This PR adds the `NAMES_PER_DIR` flag, according to CMake's doc: > The NAMES_PER_DIR option tells this command to consider one directory at a time and search for all names in it. After this PR: ``` -- NVSHMEM_HOME set to: '' -- NVSHMEM wheel installed at: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem' -- NVSHMEM_HOST_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_host.so.3' -- NVSHMEM_DEVICE_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a' -- NVSHMEM_INCLUDE_DIR: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157836 Approved by: https://github.com/fegin, https://github.com/fduwjj ghstack dependencies: #157513, #157695	2025-07-14 14:13:02 +00:00
Ting Lu	86d8af6a6c	Add sm_70 to windows 12.9 build (#158126 ) Please see: https://github.com/pytorch/pytorch/issues/157517 Volta architectures will be kept for 12.8/12.9 builds for release 2.8 (12.8 win build does not need change since already including sm70) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158126 Approved by: https://github.com/Skylion007, https://github.com/atalman	2025-07-14 13:11:10 +00:00
Andrey Talman	0bb733ba23	Add cuda 12.4 build in CI (#157958 ) Fixes to https://github.com/pytorch/pytorch/issues/156747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157958 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-07-14 13:01:16 +00:00
morrison-turnansky	0f21fa84fb	Documentation Fix: torch.empty_like memory preservation (#158050 ) updated docs for torch.empty_like to reflect view and dense memory behavior Fixes #158022 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158050 Approved by: https://github.com/ngimel, https://github.com/cyyever	2025-07-14 06:02:54 +00:00
Tom McTiernan	aa11628576	Issue warning with reference to user code rather than torch (#155112 ) Re-raising of #129959 as that was closed. Warning message before: ``` /home/admin/.local/share/hatch/env/virtual/toms-project-1/Qv9k_r_5/dev/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:120: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling. ``` Warning message after: ``` /path/to/my/code:91: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling. ``` Helps the user find where the issue stems from in their code. What do you think? (Looks like "skip_file_prefixes" is not available until Python 3.12 minimum...) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155112 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-07-14 05:24:23 +00:00
Nikita Shulga	9ca080db87	[MPS] Extend atomic operations to all int types (#158179 ) That fixes `index_put(..., accumulate=True)` for all dtypes int64 operation is not really atomic, but eventually consistent from the `index_put_accumulate` kernel point of view: i.e. by the end of the operation results in the global memory are indeed accumulation of the operands at given indices Pull Request resolved: https://github.com/pytorch/pytorch/pull/158179 Approved by: https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #158064, #158178	2025-07-14 04:25:05 +00:00
Xinya Zhang	1ea9cde598	[ROCm] logsumexp on ROCm needs scaling back to natural base. (#156903 ) Fixes #156012 This is a temporary solution that makes context parallelism working before logsumexp behavior changes landed in AOTriton. After discussion we are not going to release AOTriton 0.10.1 to fix this due to * Even if the interface is not changed, changing the behavior of returned logsumexp tensor should still be considered as an ABI break. Such changes do not fall into the "ABI compatible" category and should be postponed to next release. * AOTriton 0.11 is scheduled to be released before end of July, which is less than five weeks Pull Request resolved: https://github.com/pytorch/pytorch/pull/156903 Approved by: https://github.com/jeffdaily, https://github.com/XilunWu	2025-07-14 02:50:36 +00:00
Michaelgathara	edb92e16ba	feat(dynamo): raise UnsupportedError for ndarray.astype(object) (#157810 ) Fixes #157720 ### What's in this PR? This PR improves the error handling in `torch.compile` for `ndarray.astype('O')` (or `object`). It now explicitly raises a `torch._dynamo.exc.Unsupported` exception with a clear explanation, instead of failing with a less intuitive error during fake tensor propagation. This is achieved by adding a check within `NumpyNdarrayVariable.call_method` for this specific `astype` pattern. A new test, `test_ndarray_astype_object_graph_break`, is also added to `test/test_numpy_interop.py` to verify this new behavior. ### Background Previously, attempting to `torch.compile` a function containing `ndarray.astype('O')` would result in a `TorchRuntimeError` wrapping a `TypeError: data type 'O' not understood`. This error message, originating deep within the tensor mechanism, was not very user-friendly and didn't clearly state why it was unsupported. This change makes the failure more explicit and provides a better user experience by giving a direct, actionable error message. Old Behavior (Error Traceback): ``` torch.dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: ... got TypeError("data type 'O' not understood") ``` New Behavior (Error Message): ``` torch.dynamo.exc.Unsupported: ndarray.astype(object) Explanation: ndarray.astype('O') or ndarray.astype(object) is not supported by torch.compile, as there is no equivalent to object type in torch. ``` ### Testing A new test has been added to `test_numpy_interop.py` which decorates a function containing `ndarray.astype("O")` with `torch.compile`. The test asserts that a `torch._dynamo.exc.Unsupported` exception is raised, confirming the new error handling works as expected. The test can be run with: `pytest test/test_numpy_interop.py -k test_ndarray_astype_object_graph_break` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157810 Approved by: https://github.com/jansel	2025-07-14 01:22:49 +00:00
Sun, Jiayi	3321acc92e	[Inductor] Set the default value of min_chunk_size to 512 (#150762 ) Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized. I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150762 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-07-14 01:14:30 +00:00
Valentine233	1f57e0e04d	[CPU] Support GQA for flash attention (#157893 ) As many models require GQA, we support it in flash attention for CPU path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157893 Approved by: https://github.com/mingfeima, https://github.com/jansel	2025-07-13 09:49:02 +00:00
Yu, Guangye	c68af9af1b	Fix XPU CI UT test_circular_dependencies (#158189 ) # Motivation fix https://github.com/pytorch/pytorch/issues/110040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158189 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-07-13 09:30:57 +00:00
Nikita Shulga	5aee022d8b	[BE] Move repeated code into helper functions (#158178 ) Namely `index_get_offsets`, giving thread index computes offsets into input, output and indices tensors And `index_apply_indices` applies offests to either input or output tensor index Pull Request resolved: https://github.com/pytorch/pytorch/pull/158178 Approved by: https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #158064	2025-07-12 18:24:12 +00:00
Jason Ansel	31326a9ad7	Fix typo in torch.set_float32_matmul_precision docs (#158191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158191 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-07-12 18:23:11 +00:00
Xuehai Pan	a0308edb6c	[build] remove `wheel` from build requirements (#158027 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158027 Approved by: https://github.com/Skylion007	2025-07-12 16:45:51 +00:00
Bob Ren	9508d73307	remove allow-untyped-defs from torch/ao/nn/intrinsic/quantized/dynamic/modules/linear_relu.py (#157848 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157848 Approved by: https://github.com/Skylion007 ghstack dependencies: #157847	2025-07-12 15:42:12 +00:00
Bob Ren	066bf29334	remove allow-untyped-defs from torch/_higher_order_ops/run_const_graph.py (#157847 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157847 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2025-07-12 15:42:12 +00:00
bobrenjc93	5221448574	multi-kernel matmuls based on varying hint sizes (#156628 ) The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts: https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/ https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/ https://fb.workplace.com/groups/257735836456307/posts/906589324904285/ Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size: ![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301) This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case: ![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213) This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes: ![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1) Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096: ![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce) ## How to review this PR At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points: 1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments. 2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels. 3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape. 4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR. ## Results The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec Before ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0948 \| 0.3124 \| 4.9477 256 \| 0.2243 \| 0.2256 \| 3.3880 4096 \| 0.3384 \| 0.3404 \| 3.3010 ``` After ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0951 \| 0.2289 \| 3.3013 256 \| 0.0952 \| 0.2258 \| 3.4045 4096 \| 0.0957 \| 0.2231 \| 3.3146 ``` We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938 ![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed) NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result. For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0 HUD benchmark runs: base: https://github.com/pytorch/pytorch/actions/runs/15889871988 head: https://github.com/pytorch/pytorch/actions/runs/15889876842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628 Approved by: https://github.com/jansel	2025-07-12 15:08:21 +00:00
Aastha Shah	191693ac85	adding arg values and arg types to Strobelight USDT (#155185 ) Summary: This diff makes changes to the USDT added by RihamSelim in D44636587. The "operator_start" USDT passes in the memory addresses of operator arguments and the argument types. This is so we can record argument values and types in the Strobelight GPUEvent Profiler. The previous diff records the ATEN operator, and this diff lays the groundwork to record ATEN op arguments. Test Plan: I ensured this code builds by running the example in this diff, and testing profiler changes in this diff. Reviewed By: RihamSelim Differential Revision: D75606556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155185 Approved by: https://github.com/malfet	2025-07-12 12:00:08 +00:00
Xu Han	aacb944079	[aot inductor] fix clang-asan for consts_cpp. (#158175 ) From the perivous PR: https://github.com/pytorch/pytorch/pull/157608 , I added `format_consts_to_cpp` to build consts bytes. But it still raise clang ASAN `stack alloction`, when build large size consts. This PR: 1. add `test_aot_inductor_consts_cpp_build` to stack allocation skip list. 2. add ATTRIBUTE_NO_SANITIZE_ADDRESS to skip ASAN check, because consts array is locate in global area. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158175 Approved by: https://github.com/jansel	2025-07-12 07:14:05 +00:00
William Wen	6b84cb29f9	[dynamo] trace through torch.get_device_module (#157980 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157980 Approved by: https://github.com/anijain2305	2025-07-12 06:25:46 +00:00
Xuehai Pan	7f14b42adf	[BE][2/16] fix typos in torch/ (torch/_*/) (#156312 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312 Approved by: https://github.com/albanD	2025-07-12 05:47:06 +00:00
PyTorch MergeBot	e90148c91d	Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563 )" This reverts commit 4b9a6f7211123511e856ac8c8524bc332a741241. Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I suspect that it might contribute to a string of OOM error in trunk ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3064678929))	2025-07-12 04:52:11 +00:00
Kaichao You	a529a5daf5	[test][distributed][vllm] stabilize the p2p sharing through ipc (#158089 ) vLLM's RLHF integration `cf75cd2098/examples/offline_inference/rlhf_utils.py (L93)` depends on this hidden feature, adding the test so that PyTorch will not break it in a backward-incompatible way. The goal is to create p2p shared tensors across devices, say sharing process 0's memory on GPU 0, to process 1's memory space on GPU 1, when GPU 0 and GPU 1 can use GPU direct p2p access. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158089 Approved by: https://github.com/houseroad, https://github.com/ngimel	2025-07-12 04:41:13 +00:00
PyTorch MergeBot	e15f4248ad	Revert "[BE][2/16] fix typos in torch/ (torch/_*/) (#156312 )" This reverts commit 7a92b5119654c07d15f5c0818e6ae804b01e836c. Reverted https://github.com/pytorch/pytorch/pull/156312 on behalf of https://github.com/XuehaiPan due to landrace ([comment](https://github.com/pytorch/pytorch/pull/156312#issuecomment-3064672250))	2025-07-12 04:40:52 +00:00
Natalia Gimelshein	9056279f81	don't error out in empty_cache under mempool context (#158152 ) Now instead of erroring out on `empty_cache` call during graph capture or under mempool context, we will just silently do nothing. This used to be the behavior for mempools, cudagraphs used to error out, but it's fine to just ignore the call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158152 Approved by: https://github.com/zou3519, https://github.com/eqy	2025-07-12 04:37:05 +00:00
Soumith Chintala	f45f6e86b9	Fix torch._numpy advanced indexing to match NumPy when indices are separated (#157676 ) Written with Claude Code. Fixes https://github.com/pytorch/pytorch/issues/157569 Fixes https://github.com/pytorch/pytorch/issues/158134 NumPy and PyTorch handle advanced indexing differently when advanced indices are separated by slices (e.g., arr[:, [0], :, 0]). PyTorch uses "outer" indexing placing result dimensions in original positions, while NumPy uses "vectorized" indexing moving advanced index dimensions to the front. This adds _numpy_style_advanced_indexing() to detect separated advanced indices and transpose results to match NumPy's dimension ordering, ensuring torch._numpy maintains compatibility with NumPy's indexing behavior. Fixes cases like: - arr[:, [0], :, 0] now returns shape (1, 5, 7) instead of (5, 1, 7) - arr[:, [0, 1], :, 0] now returns shape (2, 5, 7) instead of (5, 2, 7) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157676 Approved by: https://github.com/manuelcandales Co-authored-by: Claude <noreply@anthropic.com>	2025-07-12 04:35:04 +00:00
PyTorch MergeBot	9c189ed29a	Revert "multi-kernel matmuls based on varying hint sizes (#156628 )" This reverts commit 6c795306378c47341d58109da03371bba2bec46e. Reverted https://github.com/pytorch/pytorch/pull/156628 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some ROCM jobs went crazy after this lands, so I try to see if reverting helps ([comment](https://github.com/pytorch/pytorch/pull/156628#issuecomment-3064617123))	2025-07-12 03:48:39 +00:00
Ti-Tai Wang	2eff14c445	[ONNX] Delete torch.onnx.dynamo_export (#158130 ) It's deprecated since torch==2.7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158130 Approved by: https://github.com/justinchuby	2025-07-12 02:30:47 +00:00
Xuehai Pan	7a92b51196	[BE][2/16] fix typos in torch/ (torch/_*/) (#156312 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312 Approved by: https://github.com/albanD	2025-07-12 01:47:22 +00:00
Michael Gathara	8b97e4dd8c	#IS157973/numpy version issue (#158036 ) Fixes #157973 `THPUtils_unpackNumberAsBool` now recognises `numpy.bool_ scalars` explicitly (using `torch::utils::is_numpy_bool`). If the object is a NumPy boolean, we retrieve its truth value via `PyObject_IsTrue` and return it, avoiding the previous failing path that attempted to treat it as an integer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158036 Approved by: https://github.com/jansel	2025-07-12 01:36:28 +00:00
Ankita George	627ba41136	[DCP][HF] [ez]Change where sharded tensors are saved (#158069 ) Summary: Previously was saving sharded tensors to same directory as full tensors. But am realizing this doesn't make sense because on load(), you would be loading for a directory which contains both, with no way to distinguish them, so they should be in separate folders. Test Plan: ensure existing tests pass Rollback Plan: Differential Revision: D78108144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158069 Approved by: https://github.com/teja-rao	2025-07-12 01:02:17 +00:00
Howard Huang	f4406689b8	fix MPCT destroy_pg call (#157952 ) I was seeing hangs / exceptions not raising in some cases. Only call `c10d.destroy_process_group()` for `MultiProcessContinuousTest` in the clean exit case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157952 Approved by: https://github.com/fduwjj ghstack dependencies: #157589	2025-07-12 00:46:19 +00:00
PyTorch MergeBot	7444debaca	Revert "Fix logdet returning finite values for singular matrices on CUDA (#157910 )" This reverts commit 7d4228dbfd13d1ac8fac2c78c042dbb8314f042d. Reverted https://github.com/pytorch/pytorch/pull/157910 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to fail some internal tests accuracy ([comment](https://github.com/pytorch/pytorch/pull/157910#issuecomment-3064368647))	2025-07-12 00:22:51 +00:00
drisspg	8c928372b3	Make Q Indices optional (#157997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157997 Approved by: https://github.com/BoyuanFeng, https://github.com/Chillee	2025-07-12 00:16:20 +00:00
anwang	22f3347fd9	[MTIA Aten Backend] Change relu / relu_ back to use relu kernel (#158101 ) # Context In D75803582, we migrated relu/relu_ from out-of-tree to pytorch in-tree. With that, we also changed it to use the ATen op-layer logic: https://www.internalfb.com/code/fbsource/[04ec3fcd0b09b601ae26a785e595ab960a6ba684]/fbcode/caffe2/aten/src/ATen/native/Activation.cpp?lines=512-520 To summarize: The behavior before D75803582: The Relu operator calls this code(https://fburl.com/code/pezspv40) and launches Relu kernel. The behavior after D75803582: The Relu operator uses the ATen logic, which delegates to the clamp_min operator, and no longer launch Relu kernel. ----------------- But according to my discussion with @vvk, we should keep using the Relu kernel, instead of adopting ATen logic that delegates to clamp_min, because MTIA's Relu kernel has special optimization for MTIA device. # This diff Change relu / relu_ to launch relu kernel, which is same as the original behavior before D75803582. Note: this doesn't mean to revert D75803582, because we still want to move relu/relu_ to in-tree. Differential Revision: [D78109262](https://our.internmc.facebook.com/intern/diff/D78109262/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158101 Approved by: https://github.com/albanD	2025-07-12 00:12:29 +00:00
Tristan Rice	0d77364ee3	dist2: cleanup non-option methods on PG (missing, timeouts) (#158123 ) This updates the ProcessGroup.* API to include timeouts on all non-option based overloaded methods. This also adds 2 missing ones `alltoall_base` and `barrier`. Following design in: https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89 Test plan: ``` pytest test/distributed/test_dist2.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158123 Approved by: https://github.com/Skylion007, https://github.com/fduwjj	2025-07-12 00:06:37 +00:00
Benjamin Glass	f44a9eee47	[AOTI] Add missing ops to set of C-shim ops which can have nullptr returns (#158073 ) Most added ops are backwards ops, which have not been well-tested previously (thus why they were missed). Necessary ops were identified by manual examination of torch/_meta_registrations.py return values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158073 Approved by: https://github.com/desertfire	2025-07-11 23:35:26 +00:00
henrylhtsang	ff7dd1776f	[cutlass backend] Global filter ops before situation based filter ops (#157866 ) The idea of this PR is that, sometimes we are filtering ops based not based on the node specific information. For example, we always filter out simt ops. So I want to group them together into a global filtering function. This can help shrink the config space as well. 20s -> 6s for instantiation 3332. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157866 Approved by: https://github.com/ColinPeppler	2025-07-11 23:13:20 +00:00
Tristan Rice	2a8795a981	[c10d] ProcessGroupGloo: support per operation timeouts (#158128 ) This updates ProcessGroupGloo to support per operation timeouts. Previously the timeouts were ignored even if they were set. * This checks if the timeout is `kUnsetTimeout` and conditionally uses the provided timeout or the default timeout from the context. * This exposes `set_timeout` as a standard method on ProcessGroup/Backend so we can test the global timeout. Test plan: ``` pytest test/distributed/test_c10d_gloo.py -v -k allreduce_timeout ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158128 Approved by: https://github.com/H-Huang, https://github.com/fduwjj	2025-07-11 23:09:50 +00:00
Sidharth	a8ec7babcf	[dynamo] expand_hints does exc() to expand graph_break_hints (#158078 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158078 Approved by: https://github.com/williamwen42	2025-07-11 22:51:28 +00:00
Nikita Shulga	beed033b6e	[MPS] Fix `index_kernel` for large tensors (#158064 ) Move `MetalShaderLibrary::bind_tensors` private method to OperatorUtils.h and extract `iter_tensor_offset` method, that returns an offset from the start of the storage associated with given tensor inside the iterator Migrated `index`, `index_put[_accumulate][_serial]` to the new paradigm that does not require additional tensor for indices nor special handling for 32 vs 64-bit offset, which resulted in almost 2x perf gain for 2000x2000 tensor, see results below before ``` [------------------------------------------------------------ -----------------------------------------------------------] \| 11x50x50 \| 11x100x100 \| 11x500x500 \| 11x1000x1000 \| 11x2000x2000 1 threads: ---------------------------------------------------------------------------------------------------------------- __getitem__ (torch.int8, torch.int64) \| 383.5 \| 379.8 \| 470.9 \| 1232.9 \| 4410.3 __getitem__ (torch.float16, torch.int64) \| 379.6 \| 354.5 \| 533.2 \| 1290.3 \| 4442.2 __getitem__ (torch.float32, torch.int64) \| 360.8 \| 338.6 \| 478.6 \| 1348.9 \| 4870.4 Times are in microseconds (us). ``` and after ``` [------------------------------------------------------------ -----------------------------------------------------------] \| 11x50x50 \| 11x100x100 \| 11x500x500 \| 11x1000x1000 \| 11x2000x2000 1 threads: ---------------------------------------------------------------------------------------------------------------- __getitem__ (torch.int8, torch.int64) \| 349.8 \| 330.5 \| 432.6 \| 764.5 \| 1961.2 __getitem__ (torch.float16, torch.int64) \| 342.5 \| 330.7 \| 434.7 \| 741.0 \| 1969.4 __getitem__ (torch.float32, torch.int64) \| 332.2 \| 326.1 \| 445.4 \| 751.3 \| 1972.6 Times are in microseconds (us). ``` While migrating also fixed index_put_accumulate for boolean types, by using compare_and_exchange trick over uint Fixes https://github.com/pytorch/pytorch/issues/153560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158064 Approved by: https://github.com/dcci	2025-07-11 22:35:44 +00:00
Will Constable	93854e83b7	[DTensor] Rewrite doc of TupleStrategy (#158132 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158132 Approved by: https://github.com/XilunWu	2025-07-11 22:08:57 +00:00
Xuan Zhang	4b9a6f7211	[PT2][fusion] ban fusions with large accumulated reads (#157563 ) Problem: Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet ``` total = torch.rand(N, N) for _ in range(r): x = torch.rand(N, N) total = total + x ``` The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like: ``` x_1 = torch.rand(N, N) x_2 = torch.rand(N, N) ... x_r = torch.rand(N, N) total = x_1 + x_2 + ... + x_r ``` Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient. [internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details Solution: Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile. * During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated. * During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`. Results: For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match. <img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-07-11 21:07:57 +00:00
dsashidh	4ff9b7fa31	Fix diagnostic message for CUDA version mismatch in cuda.cmake (#157370 ) This PR fixes #157354 It fixes the issue in 'cmake/public/cuda.cmake' where a diagnostic message incorrectly showed an empty CUDA version when 'FindCUDA' and header-reported versions differed. The problem was caused by this line: set(${cuda_version_from_findcuda} ${CUDA_VERSION_STRING}) This incorrectly used the value of cuda_version_from_findcuda as a variable name. As a result the version string wasn't assigned and the error message omitted the version. This has been corrected to: set(cuda_version_from_findcuda ${CUDA_VERSION_STRING}) Now the diagnostic message properly displays the CUDA version reported by FindCUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157370 Approved by: https://github.com/soulitzer	2025-07-11 20:58:35 +00:00
eqy	00ae620b9f	[CUDA] Allow cuDNN or flash attn in `test_activation_checkpointing` pattern match check (#153272 ) Seems more robust than maintaining a mirror of dispatch condition based on compute capability etc Pull Request resolved: https://github.com/pytorch/pytorch/pull/153272 Approved by: https://github.com/soulitzer	2025-07-11 20:58:12 +00:00
PyTorch MergeBot	702a304b07	Revert "[CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 )" This reverts commit 9a5278225fc5e7b46d54a65ae1a3f049ee49824f. Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/ngimel due to breaks 525 driver installs ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-3063742807))	2025-07-11 20:36:36 +00:00
eqy	9963845a4e	[CUDA] Support family-conditional compute capabilies in `TORCH_CUDA_ARCH_LIST` (#157999 ) Similar to arch-conditionals, such as 9.0a and 10.0a, family conditionals such as 10.0f enable features specific to a family of architectures, such as between sm100 and sm103 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157999 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-07-11 20:34:59 +00:00
bobrenjc93	6c79530637	multi-kernel matmuls based on varying hint sizes (#156628 ) The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts: https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/ https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/ https://fb.workplace.com/groups/257735836456307/posts/906589324904285/ Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size: ![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301) This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case: ![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213) This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes: ![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1) Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096: ![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce) ## How to review this PR At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points: 1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments. 2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels. 3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape. 4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR. ## Results The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec Before ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0948 \| 0.3124 \| 4.9477 256 \| 0.2243 \| 0.2256 \| 3.3880 4096 \| 0.3384 \| 0.3404 \| 3.3010 ``` After ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0951 \| 0.2289 \| 3.3013 256 \| 0.0952 \| 0.2258 \| 3.4045 4096 \| 0.0957 \| 0.2231 \| 3.3146 ``` We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938 ![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed) NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result. For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0 HUD benchmark runs: base: https://github.com/pytorch/pytorch/actions/runs/15889871988 head: https://github.com/pytorch/pytorch/actions/runs/15889876842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628 Approved by: https://github.com/jansel	2025-07-11 19:38:10 +00:00
Erik Ahlgren	bd364c901d	Fix serialization of nans in torch.export (#155359 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155359 Approved by: https://github.com/angelayi	2025-07-11 19:33:15 +00:00
Simon Mahns	b487003182	[PyTorch Core] MTIA supports arbitrary strides (#157883 ) Summary: Currently, on MTIA the following case will return false ``` options.device().supports_as_strided() ``` As a result, whenever moving a tensor from CPU to MTIA, strides will not be preserved ([see here](`e5edd013ab/aten/src/ATen/native/TensorConversions.cpp (L351)`)). This is a primary reason why deserializing tensors from .pt files will be contiguous. Reviewed By: egienvalue, andyanwang Differential Revision: D77843224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157883 Approved by: https://github.com/albanD, https://github.com/andyanwang	2025-07-11 18:54:21 +00:00
cyy	b0556110e5	Remove unsafe PyTorchError constructor (#154961 ) Use libfmt in call sites of PyTorchError. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154961 Approved by: https://github.com/albanD	2025-07-11 18:22:53 +00:00
Simon Mahns	1cb0597a89	[PyTorch] Deprecate numpy serialization for MTIA (#157884 ) Summary: NumPy based tensor rebuilding from serialization has been deprecated by other backends (eg. [XLA](https://github.com/pytorch/pytorch/pull/137444)). The new flow has CPU storage being constructed with data from the file and then moved to the target backend device. Furthermore, relying on numpy for serialization will fail loudly when torch.load flips weights_only. Reviewed By: andyanwang Differential Revision: D77843238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157884 Approved by: https://github.com/albanD	2025-07-11 17:57:33 +00:00
Simon Mahns	157683d862	[Reducer] Remove custom handling of view tensors for MTIA (#157882 ) Summary: Following implementation of the updated ATen Backend for mtia, and diffs enabling in tree view ops (D75266206, D75385411), we can remove custom logic from reducer to handle MTIA view operations. Test Plan: CI Rollback Plan: Reviewed By: egienvalue Differential Revision: D77843212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157882 Approved by: https://github.com/albanD, https://github.com/andyanwang	2025-07-11 17:56:45 +00:00
PyTorch MergeBot	92ee5bd9f6	Revert "[DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#157216 )" This reverts commit d75d30eeb610b164e69d0678a2e2b2dea81eec0f. Reverted https://github.com/pytorch/pytorch/pull/157216 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it turns out that the internal failure was legit ([comment](https://github.com/pytorch/pytorch/pull/157216#issuecomment-3063075001))	2025-07-11 17:07:26 +00:00
Xu Han	c4cdcda754	[aot] add format_consts_to_cpp function for further development. (#157608 ) Changes: 1. Split `format_consts_to_asm` function, which is current way to convert consts to object. 2. Add `format_consts_to_cpp` function, which would support for more compiler support, such as `msvc` and `icx`. 3. Add `config.aot_inductor.use_consts_asm_build` for `format_consts_to_asm` and `format_consts_to_cpp` control. 4. Add UT for `format_consts_to_cpp`. For `format_consts_to_cpp`, I have local tested it: Case: https://docs.pytorch.org/docs/main/torch.compiler_aot_inductor.html Run it and `cat` cpp code: <img width="674" alt="image" src="https://github.com/user-attachments/assets/d47ccf84-06d2-47f5-8a0d-9a43a9020aa3" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157608 Approved by: https://github.com/desertfire, https://github.com/jansel	2025-07-11 17:02:41 +00:00
Xilun Wu	bb3c911c2d	[DTensor] support split op on Partial placement (#157991 ) Summary To enable use case where the input DTensor to `split` op has `Partial()` placement, this PR treats `Partial()` in the same way with `Replicate()`. That means, `split` op only unshards the `Shard(dim=x)` if `x == split_dim` and keep other placement untouched. Test Added a new test because `test_dtensor_ops` doesn't test `Partial()` placement. `pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157991 Approved by: https://github.com/zpcore	2025-07-11 16:19:31 +00:00
angelayi	1f1f22991d	Restore fake device (#157972 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/157972 Approved by: https://github.com/ezyang	2025-07-11 16:12:01 +00:00
Luca Wehrstedt	27c50799c1	Use new cuBLAS row-wise fp8 matmul for scaled-mm (#157905 ) Most of the work had already been done by @jeffdaily in #154680, but there was one remaining check that needed to be modified in order for `torch._scaled_mm` to use cuBLAS over CUTLASS when available. I tested this change by rebuilding PyTorch locally with CUDA 12.9 and ran `torch._scaled_mm` under the profiler, and observed that the kernel being launched is called `nvjet_qqtst_128x128_128x6_1x1_h_bz_coopA_algo2_ovscale_TNT` (where `ovscale` stands for "outer vector scaling", I believe, which is how cuBLAS calls this scaling mode). I then benchmarked the new kernels against the old CUTLASS ones on a standard 700W H100 GPU. I used the same approach as in #134781, and obtained these speed-ups: ![image](https://github.com/user-attachments/assets/43dfb816-9ccf-40c5-8b2a-571ce9cb511d) ![image](https://github.com/user-attachments/assets/be7ac6f2-e16c-479b-ad5c-f8039caba4b1) We see that the two kernels perform very closely (I'm surprised, I would have expected cuBLAS to outperform CUTLASS across the board), with some thin/skewed shapes becoming worse but some very large shapes becoming better. I guess the questions are whether we consider this a net-zero change (given that there's improvements _and_ degradations), and how large we consider the burden of maintaining our own CUTLASS kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157905 Approved by: https://github.com/eqy, https://github.com/Skylion007, https://github.com/drisspg	2025-07-11 16:11:55 +00:00
Eddie Yan	0797b2b6a8	[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 ) cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282 Approved by: https://github.com/drisspg Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-07-11 16:07:54 +00:00
Aaron Gokaslan	7a08755c5f	[BE][Ez]: Update ruff to 0.12.2 (#157937 ) Updates to the latest version of ruff and apply some fixes that it flagged and silence a few new lints Pull Request resolved: https://github.com/pytorch/pytorch/pull/157937 Approved by: https://github.com/ezyang	2025-07-11 15:16:20 +00:00
Xuehai Pan	0d17029fea	[BE][6/6] fix typos in test/ (test/distributed/) (#157640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157640 Approved by: https://github.com/yewentao256, https://github.com/malfet	2025-07-11 14:09:37 +00:00
Xuehai Pan	4283d96bcd	[build] pin `setuptools>=70.1.0` for integrated `bdist_wheel` command (#157783 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157783 Approved by: https://github.com/Skylion007	2025-07-11 12:10:42 +00:00
Victor Komarov	b4476ca378	Add cudaMallocAsync/cudaFreeAsync to cuda_to_hip_mappings (#158056 ) Summary: Adding both functions as they're required for Hipification of https://fburl.com/code/165r7qhr Test Plan: Tested in D78090513 Rollback Plan: Reviewed By: malfet, jiangyurong609 Differential Revision: D78090693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158056 Approved by: https://github.com/Skylion007	2025-07-11 11:48:19 +00:00
Yu, Guangye	85857181eb	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312	2025-07-11 11:41:34 +00:00
Yu, Guangye	03b307575a	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908	2025-07-11 11:25:43 +00:00
Daisy Deng	8088958793	port 4 dynamo test files to Intel GPU (#157779 ) For https://github.com/pytorch/pytorch/issues/114850, we will port test cases to Intel GPU. Six dynamo test files were ported in PR [#156056](https://github.com/pytorch/pytorch/pull/156056) and [#156575](https://github.com/pytorch/pytorch/pull/156575.) In this PR we will port 4 more dynamo test files. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - added XPU support in decorators like @requires_gpu - enabled XPU for some test path - added xfailIfXPU to skip xpu test when there is a bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157779 Approved by: https://github.com/guangyey, https://github.com/jansel	2025-07-11 10:11:49 +00:00
Xia, Weiwen	e1a20988f3	[Quant][CPU] Enable fp8 qconv (#157076 ) Summary Enable fp8 qconv on CPU. It's part of the plan to enable fp8 static quantization on CPU. This PR only adds FP8 support of the existing int8 qconv op. It does not add a new op nor does it affect frontend or quantization flow. The schema of the qconv op is not changed either. So, the FP8 qconv shares the same op as INT8 qconv and the difference is that src/wei dtype is fp8 instead of int8. The output dtype can be fp8/float32/bfloat16. The implementation uses the oneDNN library. Note: OneDNN does not support quantized fp8 convolution until v3.9 but the version used in PyTorch is v3.7.2. So, the op goes to the reference kernel for now. And we have also update the oneDNN path so that it's compatible with the fp8 dtype. Once oneDNN is upgraded to v3.9 or newer, minimum changes are needed to enable the oneDNN path. And we have ensured that the behavior of the reference kernel is the same as the new oneDNN's implementation. - oneDNN version < 3.9 (now) - Always go to the reference kernel - oneDNN version >= 3.9 (future) - Go to reference kernel on old platforms (without AMX) - Use oneDNN on new platforms (with AMX) Test plan ``` pytest test/quantization/core/test_quantized_op.py -k "qconv and fp8" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157076 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2025-07-11 10:00:57 +00:00
Mwiza Kunda	ed508cc018	[inductor][triton] Add experimental use_tensor_descriptor config option (#157906 ) Refactor to allow TMA descriptors to be used in general codegen. TMA descriptors can only be generated if the conditions listed in the triton documentation for [make_tensor_descriptor](https://triton-lang.org/main/python-api/generated/triton.language.make_tensor_descriptor.html) are met. Some implementation details: - The `TMACompatibilityChecker` class holds and checks the conditions required for a load / store operation to be represented by a tma descriptor load / store - The current TMA API requires that the innermost block size loads atleast 16 bytes of data. e.g. if the block shape is [YBLOCK, XBLOCK] and the tensor dtype is float32, this requires that XBLOCK >= 4. It is therefore required that the triton heuristics are aware of the minimum block sizes for the IO operations in the kernel. The minimum block sizes are determined in the `TMACompatibilityChecker` class and are passed to the triton heuristics when the block sizes are not static. The heuristic config options are then filtered to ensure that the minimum block size restriction is met. Testing: - Refactored test_torchinductor_strided_blocks.py to also test the `use_tensor_descriptor` option. This requires an upgrade to Triton version 3.4.0: https://github.com/pytorch/pytorch/issues/154206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157906 Approved by: https://github.com/jansel	2025-07-11 09:32:40 +00:00
charlifu	02724b5f64	[Bugfix][Inductor] Fix dependency list merged incorrectly for a custom op with multiple mutated inputs and None return type. (#157133 ) This is an attempt to fix a memory allocation issue when using `torch.compile` with a custom layernorm kernel in vllm: ```C++ // In-place fused Add and RMS Normalization. ops.def( "fused_add_rms_norm(Tensor! input, Tensor! residual, Tensor weight, " "float epsilon) -> ()"); ops.impl("fused_add_rms_norm", torch::kCUDA, &fused_add_rms_norm); ``` We observed abnormal extra memory allocations with this op enabled using `torch.compile`: <img width="738" alt="{374E9FCF-FB46-4750-8B60-D31E3ADCE00A}" src="https://github.com/user-attachments/assets/6c45e1aa-ccde-4c56-99dc-bf4776d699d5" /> and without this op: <img width="738" alt="{9BB08EFE-FFE3-4D06-82C0-C70BBE6ADD56}" src="https://github.com/user-attachments/assets/56e2ee43-ab87-492d-834c-69e9cafbb0df" /> After investigation, we found that this is because the compiler considers the two buffers for the two mutated inputs `Tensor input` and `Tensor residual` should share a same dependency list, which makes it can not reuse the buffer of `Tensor input`. ``` buf1.users = [ NodeUser(node=ExternKernelSchedulerNode(name='op2'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op9'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op13'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=False), ] buf16.users = [ NodeUser(node=ExternKernelSchedulerNode(name='op2'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op9'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op13'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=False), ] ``` ``` op13: ExternKernelSchedulerNode(FallbackKernel) op13.writes = [ StarDep(name='buf17', mode=None), StarDep(name='buf18', mode=None), StarDep(name='buf19', mode=None)] op13.unmet_dependencies = [ StarDep(name='buf13', mode=None), StarDep(name='buf16', mode=None), WeakDep(name='buf11', mutating_buf='buf18'), WeakDep(name='buf12', mutating_buf='buf18'), WeakDep(name='buf13', mutating_buf='buf18'), WeakDep(name='buf2', mutating_buf='buf18'), WeakDep(name='buf3', mutating_buf='buf18')] op13.met_dependencies = [StarDep(name='arg11_1', mode=None)] op13.outputs = [ buf17: FallbackKernel buf17.layout = NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0]) buf17.aliases = ['buf16', 'buf1'] buf17.users = [ NodeUser(node=ExternKernelSchedulerNode(name='op2'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op9'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op13'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=False), ] buf18: MutationOutput buf18.layout = NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0]) buf18.mutations = ['buf16'] buf18.users = [ NodeUser(node=ExternKernelSchedulerNode(name='op14'), can_inplace=False, is_weak=False), NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=True), NodeUser(node=ExternKernelSchedulerNode(name='op24'), can_inplace=False, is_weak=True), NodeUser(node=ExternKernelSchedulerNode(name='op31'), can_inplace=False, is_weak=True), NodeUser(node=ExternKernelSchedulerNode(name='op35'), can_inplace=False, is_weak=True), NodeUser(node=ExternKernelSchedulerNode(name='op42'), can_inplace=False, is_weak=True), NodeUser(node=ExternKernelSchedulerNode(name='op46'), can_inplace=False, is_weak=True), NodeUser(node=ExternKernelSchedulerNode(name='op53'), can_inplace=False, is_weak=True), ] buf19: MutationOutput buf19.layout = NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0]) buf19.mutations = ['buf1'] buf19.users = [NodeUser(node=ExternKernelSchedulerNode(name='op20'), can_inplace=False, is_weak=False)] ] op13.node.kernel = torch.ops._C.fused_add_rms_norm.default ``` Here we can see `buf16` shares the same dependency list with `buf1` because `buf16` and `buf1` are in the aliases list of `buf17`. This is incorrect since those two are two separate tensors. And this makes the compiler could not reuse `buf16` for subsequent ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157133 Approved by: https://github.com/jansel	2025-07-11 09:06:31 +00:00
Menglu Yu	44303caabf	[APS] Expose max_autotune lookup table config to frontend (#158070 ) Summary: As titled. We reuse optimus config to receive the yaml config file from users Test Plan: ### how to enable max_autotune lookup table hardcode config ``` inductor.config.post_grad_fusion_options = { "inductor_autotune_lookup_table": <your yaml manifold path> } ``` for example, "manifold://ads_training_p9e/tree/max_autotune/mast_omnifm_v3_1kgpu/mast_omnifm_v3_lookup_table.yaml", see D78052050 Rollback Plan: Reviewed By: PaulZhang12, jackiexu1992 Differential Revision: D77202285 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158070 Approved by: https://github.com/Mingming-Ding	2025-07-11 09:02:52 +00:00
Shivam Raikundalia	11d6ad8b2e	[Docs] Update PT2 Profiler Torch-Compiled Region Image (#158066 ) Summary: In Pytorch 2.5 we added source code attribution to PT2 traces. Each Torch-Compiled Region will now have its frame id and frame compile id associated with it. Update the image in the doc and add a description of this in the doc itself Test Plan: {F1980179183} Rollback Plan: Differential Revision: D78118228 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158066 Approved by: https://github.com/aaronenyeshi	2025-07-11 07:56:45 +00:00
Dmitry Rogozhkin	cd80f9a4c3	xpu: support custom ops with torch.library on xpu backend (#152879 ) Fixes: https://github.com/intel/torch-xpu-ops/issues/1626 This PR started enabling of tests for `torch.library`, but more work is needed. Tests are using `torch._custom_ops` deprecated API planned for removal at pytorch 2.6 (not done). I think cleanup of pytorch would be nice before enabling more tests for xpu. `a2ccda3c60/torch/_custom_op/impl.py (L47)` CC: @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/152879 Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/guangyey, https://github.com/albanD	2025-07-11 07:36:04 +00:00
Yu, Guangye	442aca44d6	Fix XPU broken CI (#158092 ) # Motivation https://github.com/pytorch/pytorch/pull/157739 introduces the new UT `test_sdpfa` that block XPU CI since `_scaled_dot_product_flash_attention is not supported on XPU yet`. # Additional Context See https://github.com/pytorch/pytorch/actions/runs/16201010860/job/45741815895?pr=138222#step:15:6399 fix https://github.com/pytorch/pytorch/issues/158095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158092 Approved by: https://github.com/jansel, https://github.com/malfet	2025-07-11 07:23:27 +00:00
Kurt Mohler	d89f30ad45	[MPS] Avoid calling tensor ops in max_pool3d impl (#157874 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157874 Approved by: https://github.com/malfet	2025-07-11 06:47:29 +00:00
zeshengzong	b4fc42ca80	Add `torch.segment_reduce` docs (#154352 ) Fixes #153138 ## Test Result ![image](https://github.com/user-attachments/assets/62346d62-d048-4259-906b-f8261e10b4cc) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154352 Approved by: https://github.com/albanD	2025-07-11 06:16:38 +00:00
zpcore	cec59b76ca	[2/N] cost coverage improvment (#157738 ) Part of plan https://github.com/pytorch/pytorch/issues/157495. Details: 1. Fill in missing redistribute_cost in `cat` and `slice_scatter`; 2. Expand the `cat` strategy based on placement of each input tensor. Previously `cat` only outputs one strategy. Now it output at the level of number_of_input_tensor*number_OpSpec_each_tensor_input_strategy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157738 Approved by: https://github.com/wconstab	2025-07-11 05:54:16 +00:00
PyTorch MergeBot	ecd73c58ee	Revert "[BE] Replace `std::runtime_error` with `TORCH_CHECK` [2/N] (#152080 )" This reverts commit b85f10ea5006e8ae8fc769f48659ab7ad5eafb69. Reverted https://github.com/pytorch/pytorch/pull/152080 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some internal tests ([comment](https://github.com/pytorch/pytorch/pull/152080#issuecomment-3060337857))	2025-07-11 03:58:31 +00:00
Boyuan Feng	94995eba07	[Log] add a hook for recompile user context (#157961 ) Users may want compile-related but customized logging info to dynamo_compile. One example is to logging the current training iteration index when recompilation happens. In general, current training iteration index is not available to compiler, since the same compiled function may be called multiple times in the same training iteration. The user could provide the training iteration index in a user hook where torch.compile logs it when recompilation happens. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157961 Approved by: https://github.com/masnesral	2025-07-11 03:41:33 +00:00
Jerry Zhang	11a86ad2fa	Remove pytorch quant docs since we are moving to torchao (#157766 ) Summary: att Test Plan: doc page generated from CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/157766 Approved by: https://github.com/Skylion007	2025-07-11 03:21:47 +00:00
Jordan Fix	dd93883231	[exported_program] Remove _postprocess_graph_module_outputs (#158059 ) Summary: Appears to be dead as of https://github.com/pytorch/pytorch/pull/120019. Test Plan: CI Rollback Plan: Differential Revision: D78112302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158059 Approved by: https://github.com/angelayi	2025-07-11 02:40:15 +00:00
Bin Bao	326e751d07	[AOTI] Add device guard when launching autotune kernels (#158034 ) Summary: Fix https://github.com/pytorch/pytorch/issues/157737. When launching Triton kernels in the autotune block, we need to consider the fact that the model may not always be on device 0. The reason this was not caught on CI is because test_on_gpu_device1 requires multi_gpu and was not run on a multi_gpu instance. Added test_on_gpu_device1 and other similar multi_gpu tests back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158034 Approved by: https://github.com/eqy, https://github.com/yushangdi	2025-07-11 02:34:31 +00:00
Soumith Chintala	7d4228dbfd	Fix logdet returning finite values for singular matrices on CUDA (#157910 ) Fixes https://github.com/pytorch/pytorch/issues/154312 Fix logdet returning finite values for singular matrices on CUDA (https://github.com/pytorch/pytorch/issues/154312 https://github.com/pytorch/pytorch/issues/154312) PyTorch's logdet function returns mathematically incorrect finite values for singular matrices on CUDA devices instead of the expected -inf. This occurs because cuSOLVER and LAPACK produce tiny non-zero diagonal elements (~1e-16) instead of exact zeros for singular matrices. Problem: Issue https://github.com/pytorch/pytorch/issues/154312 matrix returns finite values instead of -inf for singular matrices. Solution: Implemented NumPy-style two-tier singularity detection with GPU sync point removal: 1. Primary detection: Use LAPACK's built-in singularity detection via info parameter 2. Backup detection: Apply threshold-based detection for numerical edge cases 3. Zero GPU sync points: Eliminated all .item(), std::get<0>(), and scalar extractions 4. Pure tensor operations: All computations use tensor operations throughout Performance Impact: Based on comprehensive benchmarking across matrix sizes and data types: - Overall Impact: 0.85× average speedup (+18.0% overhead) - CPU Performance: 0.84× average speedup (+18.8% overhead) - CUDA Performance: 0.85× average speedup (+17.3% overhead) Performance Trade-offs: - Small matrices (16×16, 64×64): Higher overhead due to tensor operation setup costs - Large matrices (512×512, 2048×2048): Near-zero overhead, with some cases showing slight improvements - GPU sync elimination: Removes expensive GPU→CPU synchronization bottlenecks Results: - ✅ All singular matrices now correctly return -inf on both CPU and CUDA - ✅ Original issue https://github.com/pytorch/pytorch/issues/154312 matrix now works correctly - ✅ Results match NumPy's slogdet behavior exactly - ✅ Zero GPU synchronization points for improved performance - ✅ Comprehensive edge case testing added Verification: Before: torch.linalg.slogdet(singular_matrix) → finite values (incorrect) After: torch.linalg.slogdet(singular_matrix) → (sign=0, logabsdet=-inf) ✅ The implementation uses pure tensor operations to eliminate GPU sync points while maintaining robust singularity detection through a two-tier approach. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157910 Approved by: https://github.com/lezcano, https://github.com/IvanYashchuk, https://github.com/albanD Co-authored-by: Claude <noreply@anthropic.com>	2025-07-11 02:23:46 +00:00
Yu, Guangye	65fcca4f8c	Enable AcceleratorAllocatorConfig key check (#157908 ) # Motivation Add a mechanism to ensure raise the key if the key is unrecognized in allocator config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157908 Approved by: https://github.com/albanD ghstack dependencies: #149601	2025-07-11 02:11:08 +00:00
Paul Zhang	905b084690	Add size_hints to cache key (#158026 ) Differential Revision: D78089705 Previously to support overriding autotune configs for post fusion kernels in Inductor with a lookup table, we only keyed on the source code. However, the same source code could have multiple optimal configs, due to the input sizes. With this, we have many collisions in our lookup table, leading to subpar configs. A way around this is to add the size_hints to the lookup key as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/158026 Approved by: https://github.com/jansel	2025-07-11 01:47:50 +00:00
Siddartha Pothapragada	37ccc532f7	Update'unit_batch_dynamic_prepacked' tests to use ASSERT_NEAR instead of ASSERT_EQ (#157860 ) (#157861 ) Summary: Replaced ASSERT_FLOAT_EQ which defaults to fixed kMaxUlps ( = 4-ULP , See gtest-internal.h) with ASSERT_NEAR which lets us set epsilon to 1e-3, (approximately 3 ULPs). This allows for slightly stricter and tunable comparison. Test Plan: Before Fix ✗ Fail: qnnpack:pytorch_qnnpack_testApple - FULLY_CONNECTED_SPARSE_OP_8x1/unit_batch_dynamic_prepacked (0.0s) 'Expected equality of these values: output_dynamic[i * outputChannels() + c] Which is: 9.9160004 accumulators_float[i * outputChannels() + c] Which is: 9.9159956 at 0, 17: reference = 9.9159955978393555, optimized = 9.9160003662109375 ------------------------------ After Fix Everything passes Rollback Plan: Differential Revision: D77911682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157861 Approved by: https://github.com/kimishpatel, https://github.com/lucylq, https://github.com/malfet	2025-07-11 01:05:50 +00:00
Guilherme Leobas	7599bebead	Add CPython test `test_itertools` (#156981 ) Test the itertools module Pull Request resolved: https://github.com/pytorch/pytorch/pull/156981 Approved by: https://github.com/zou3519 ghstack dependencies: #157799, #157800, #157801, #157802	2025-07-11 00:12:50 +00:00
Guilherme Leobas	397ca98510	Add CPython test `test_with` (#157802 ) Test with statement behavior and dunder methods __enter__ and __exit__ Pull Request resolved: https://github.com/pytorch/pytorch/pull/157802 Approved by: https://github.com/zou3519 ghstack dependencies: #157799, #157800, #157801	2025-07-11 00:12:50 +00:00
Guilherme Leobas	4809f43867	Add CPython test `test_numeric_tower` (#157801 ) Test abstract numeric types and dunder methods like __int__, __float__, __index__, etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157801 Approved by: https://github.com/zou3519 ghstack dependencies: #157799, #157800	2025-07-11 00:12:50 +00:00
Guilherme Leobas	0ebf2447da	Add CPython test `test_operator` (#157800 ) Test operators via operator module like add, sub, eq, lt, etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157800 Approved by: https://github.com/zou3519 ghstack dependencies: #157799	2025-07-11 00:12:50 +00:00
Guilherme Leobas	91041f559d	Add CPython test `test_bool` (#157799 ) Test dunder methods `__bool__` and `__len__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157799 Approved by: https://github.com/zou3519, https://github.com/XuehaiPan	2025-07-11 00:12:50 +00:00
zpcore	ae86e8f6c8	[1/N] cost coverage improvment (#157504 ) Part of plan https://github.com/pytorch/pytorch/issues/157495. Details: 1. Fill missing redistribute_cost for ops like `aten::detach`, `aten::bernoulli `, `aten::_to_copy`, `aten::bucketize.Tensor`, `aten::stack`, `aten::clone`, `aten::copy_`, `aten::zero_ `. 2. Fix redistribute_cost error in new_factory_strategy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157504 Approved by: https://github.com/wconstab	2025-07-10 23:55:45 +00:00
Max Podkorytov	8b68e5b1bb	[ROCm][Inductor][CK] update API for gemm-multiD change (#156122 ) Fixes for the compilation errors in the generated code Pull Request resolved: https://github.com/pytorch/pytorch/pull/156122 Approved by: https://github.com/chenyang78	2025-07-10 23:12:20 +00:00
PyTorch MergeBot	e517066f41	Revert "[dynamo][fsdp] Consistent behavior of int attributes (#157262 )" This reverts commit 178fe7aa98987111a73534375099f4ad255e8b59. Reverted https://github.com/pytorch/pytorch/pull/157262 on behalf of https://github.com/huydhn due to This fails some internal tests and needs to be relanded ([comment](https://github.com/pytorch/pytorch/pull/157262#issuecomment-3059463896))	2025-07-10 23:11:18 +00:00
Edward Z. Yang	1a195bf7d6	Tests for #158030 (#158033 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158033 Approved by: https://github.com/bdhirsh, https://github.com/albanD ghstack dependencies: #158030	2025-07-10 22:51:28 +00:00
Guilherme Leobas	bfcababbcb	[OrderedDict] Implement explicit OrderedDict dunder method call (#154943 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154943 Approved by: https://github.com/zou3519 ghstack dependencies: #154003, #154793, #154794, #154942	2025-07-10 22:50:39 +00:00
Guilherme Leobas	ba71eb496b	[dict] Implement dict.__eq__ and dict.__ne__ (#154942 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154942 Approved by: https://github.com/zou3519 ghstack dependencies: #154003, #154793, #154794	2025-07-10 22:50:39 +00:00
Guilherme Leobas	ba8d19ec02	[dict] Allow Dynamo to trace through explicit dict dunder method call (#154794 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154794 Approved by: https://github.com/mlazos ghstack dependencies: #154003, #154793	2025-07-10 22:50:39 +00:00
Guilherme Leobas	57d64298a0	[dict] Add dict.popitem (#154793 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154793 Approved by: https://github.com/mlazos, https://github.com/zou3519 ghstack dependencies: #154003	2025-07-10 22:50:39 +00:00
Guilherme Leobas	e84710d1e7	[dict] Raise TypeError in dict methods (#154003 ) Raise TypeError in the following scenarios: * #args mismatch * arg is unhashable Pull Request resolved: https://github.com/pytorch/pytorch/pull/154003 Approved by: https://github.com/mlazos, https://github.com/zou3519	2025-07-10 22:50:39 +00:00
George Wigley	9bf41633d7	Allow Custom Time Unit When Printing Profiler Table (#157913 ) ## Overview This PR adds a kwarg to the `table()` method of the profiler allowing users to specify a time unit to be used for all results in the profiling table. The available options are: `s`, `ms` and `us`. If an invalid unit or no unit is provided, then a time unit is selected based on the size of the value (current default behaviour). ## Testing A unit test has been added to verify this works correctly. ## Documentation I couldn't find any documentation specific to the `table()` function beyond doc strings which have been updated. ## Example Output ``` import torch from torch.profiler import profile with profile() as prof: res = torch.mm(torch.rand(1024, 1024), torch.rand(1024, 1024)) print(prof.key_averages().table(time_unit="s")) print(prof.key_averages().table(time_unit="ms")) print(prof.key_averages().table(time_unit="us")) print(prof.key_averages().table()) ``` ``` ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::rand 0.04% 0.000s 10.36% 0.014s 0.007s 2 aten::empty 0.04% 0.000s 0.04% 0.000s 0.000s 2 aten::uniform_ 10.27% 0.014s 10.27% 0.014s 0.007s 2 aten::mm 89.64% 0.119s 89.64% 0.119s 0.119s 1 aten::resolve_conj 0.00% 0.000s 0.00% 0.000s 0.000s 3 ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 0.133s ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::rand 0.04% 0.055ms 10.36% 13.735ms 6.868ms 2 aten::empty 0.04% 0.054ms 0.04% 0.054ms 0.027ms 2 aten::uniform_ 10.27% 13.626ms 10.27% 13.626ms 6.813ms 2 aten::mm 89.64% 118.892ms 89.64% 118.896ms 118.896ms 1 aten::resolve_conj 0.00% 0.004ms 0.00% 0.004ms 0.001ms 3 ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 132.631ms ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::rand 0.04% 55.495us 10.36% 13735.202us 6867.601us 2 aten::empty 0.04% 54.121us 0.04% 54.121us 27.061us 2 aten::uniform_ 10.27% 13625.586us 10.27% 13625.586us 6812.793us 2 aten::mm 89.64% 118892.284us 89.64% 118895.981us 118895.981us 1 aten::resolve_conj 0.00% 3.697us 0.00% 3.697us 1.232us 3 ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 132631.183us ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::rand 0.04% 55.495us 10.36% 13.735ms 6.868ms 2 aten::empty 0.04% 54.121us 0.04% 54.121us 27.061us 2 aten::uniform_ 10.27% 13.626ms 10.27% 13.626ms 6.813ms 2 aten::mm 89.64% 118.892ms 89.64% 118.896ms 118.896ms 1 aten::resolve_conj 0.00% 3.697us 0.00% 3.697us 1.232us 3 ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 132.631ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157913 Approved by: https://github.com/sraikund16	2025-07-10 22:44:34 +00:00
Tristan Rice	83700b4488	dist2: add group context manager (#157988 ) This adds new context manager based PG management to dist2. This allows for managing the active process group much in the same way as a stream ```py with dist2.process_group(pg): dist2.current_process_group().allreduce(...).wait() ``` matches ```py with torch.cuda.stream(stream): torch.cuda.current_stream().synchronize() ``` Test plan: ``` pytest test/distributed/test_dist2.py -k context ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157988 Approved by: https://github.com/fduwjj	2025-07-10 22:30:19 +00:00
Soumith Chintala	fca7013f85	Fix DCE eliminating random operations by improving is_impure() (#151524 ) (#157981 ) DCE was incorrectly eliminating unused random operations like torch.rand() that have global RNG side effects, causing inconsistent results between eager and compiled execution modes. Root cause: Python random functions (torch.rand, torch.randn, etc.) don't have the _nondeterministic_seeded attribute, so node.is_impure() returns False, allowing DCE to eliminate them despite advancing global RNG state. Solution: Enhanced is_impure() in torch/fx/node.py to recognize Python random functions and mark them as impure when they use global RNG, regardless of the impure_random parameter setting. This ensures consistency between eager and compiled execution even when config.fallback_random=False. Key features: - Handles comprehensive list of random functions: rand, randn, randint, randperm, rand_like, randn_like, randint_like, normal, poisson, bernoulli, multinomial - Generator optimization: Only marks as impure when using global RNG (no generator or generator=None). Operations with explicit generators don't affect global state and can be optimized. - Works with both impure_random=True and impure_random=False cases - Cleaner architecture: addresses root cause rather than working around it Tests: Enhanced test_impure_random to verify both FX tracing and AOT compilation codepaths, ensuring random operations are preserved and eager/compiled execution consistency is maintained. 🤖 Generated with [Claude Code](https://claude.ai/code) Fixes https://github.com/pytorch/pytorch/issues/151524 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157981 Approved by: https://github.com/mlazos Co-authored-by: Claude <noreply@anthropic.com>	2025-07-10 22:24:29 +00:00
Eddie Yan	590607c599	[cuDNN][SDPA] Bump cuDNN frontend submodule version to 1.12.1 (#158044 ) Really we are just interested in this change which fixes an apparent regression for d=256 support on Hopper `bc5f4fd88d` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158044 Approved by: https://github.com/Skylion007	2025-07-10 22:01:18 +00:00
Nikita Shulga	5f1225ef48	[EZ][BE] Delete redundant header (#157966 ) Not sure why it was there in the first place. And why `Indexing.m`` needed to include QScheme.h is also unclear Pull Request resolved: https://github.com/pytorch/pytorch/pull/157966 Approved by: https://github.com/Skylion007	2025-07-10 21:59:36 +00:00
Laith Sakka	96897e721b	Return false in statically_known_multiple_of if numerator has more than 20 unique symbols (#157855 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157855 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #155590, #157845	2025-07-10 21:00:57 +00:00
Laith Sakka	d7e0098bf3	Fix is_unaligned usage of statically_known_true (#157845 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157845 Approved by: https://github.com/ColinPeppler ghstack dependencies: #155590	2025-07-10 21:00:57 +00:00
Sandeep Narendranath Karjala	76ca23c41c	[dynamo] Add FakeProcessGroup support for fx_graph_runnable with distributed collectives (#157162 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Summary: - Modified generate_compiler_repro_string() to automatically detect distributed operations and inject FakeProcessGroup setup code - Added distributed collective tests in test/dynamo/test_fx_graph_runnable.py using FakeProcessGroup API to test distributed collective operations - Generated fx_graph_runnable code now runs successfully standalone when containing distributed operations ```import os os.environ['TORCHINDUCTOR_CACHE_DIR'] = '/var/folders/fd/kcv8m1kn0lqgxz42wvgr46sc0000gn/T/torchinductor_skarjala' import torch from torch import tensor, device import torch.fx as fx from torch._dynamo.testing import rand_strided from math import inf import torch._inductor.inductor_prims import torch.distributed as dist from torch.testing._internal.distributed.fake_pg import FakeStore import torch._dynamo.config import torch._inductor.config import torch._functorch.config import torch.fx.experimental._config torch._functorch.config.functionalize_rng_ops = False torch._functorch.config.fake_tensor_allow_unsafe_data_ptr_access = True torch._functorch.config.unlift_effect_tokens = True isolate_fails_code_str = None # torch version: 2.9.0a0+gitf23d314 # torch cuda version: None # torch git version: f23d31463ca452918e23063409a2bdc55efc0d46 # torch.cuda.is_available()==False, no GPU info collected from torch.nn import * class Repro(torch.nn.Module): def __init__(self) -> None: super().__init__() def forward(self, arg0_1): all_reduce = torch.ops._c10d_functional.all_reduce.default(arg0_1, 'sum', '0') wait_tensor = torch.ops._c10d_functional.wait_tensor.default(all_reduce); all_reduce = None mul = torch.ops.aten.mul.Tensor(wait_tensor, 2) copy_ = torch.ops.aten.copy_.default(arg0_1, wait_tensor); arg0_1 = wait_tensor = copy_ = None return (mul,) def load_args(reader): buf0 = reader.storage(None, 64) reader.tensor(buf0, (4, 4), is_leaf=True) # arg0_1 load_args._version = 0 mod = Repro() if __name__ == '__main__': from torch._dynamo.repro.after_aot import run_repro # Initialize FakeProcessGroup for distributed operations store = FakeStore() dist.init_process_group( backend="fake", rank=0, world_size=2, store=store ) with torch.no_grad(): run_repro(mod, load_args, accuracy=False, command='run', save_dir=None, tracing_mode='real', check_str=None) # To run it separately, do # mod, args = run_repro(mod, load_args, accuracy=False, command='get_args', save_dir=None, tracing_mode='real', check_str=None) # mod(*args) dist.destroy_process_group() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157162 Approved by: https://github.com/xmfan	2025-07-10 20:30:27 +00:00
Aleksandar Samardžić	a3ec6d64b2	Update test after CUTLASS upgrade (#157903 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157903 Approved by: https://github.com/ngimel	2025-07-10 20:10:20 +00:00
morrison-turnansky	8c5b070d1f	Documentation Fix: torch.tensor.scatter_ docs (#157929 ) updated torch.tensor.scatter_ docs to reflect proper broadcasting behavior Fixes #157419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157929 Approved by: https://github.com/albanD	2025-07-10 19:22:52 +00:00
Nicolas De Carli	da4e7c77a1	[caffe2] Enable auto vectorization (#157984 ) Summary: We are testing enabling back autovectorization in some codepaths. These resulted in crashes when compiling using clang17, we are now relying on clang19. Test Plan: buck2 build //caffe2/caffe2/fb/transforms:sigrid_interface We are going to deploy it on ads workloads Rollback Plan: Differential Revision: D77448445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157984 Approved by: https://github.com/Skylion007	2025-07-10 19:19:45 +00:00
Sam Larsen	5bd7804be2	Support caching if joint_custom_pre_pass/joint_custom_post_pass implement the proper interface (#157990 ) Summary: Essentially, treat joint_custom_pre_pass/joint_custom_post_pass the same as post_grad_custom_post_pass/post_grad_custom_pre_pass. Test Plan: More unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/157990 Approved by: https://github.com/oulgen	2025-07-10 19:17:11 +00:00
morrison-turnansky	e172309880	Documentation Fix: Torch gather broadcasting (#157920 ) updated torch gather docs to reflect proper broadcasting behavior for specific backends Fixes #157425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157920 Approved by: https://github.com/albanD	2025-07-10 19:08:51 +00:00
Edward Z. Yang	e2f64eedaf	Fix DTensor handling of conjugate bit. (#158030 ) Fixes https://github.com/pytorch/pytorch/issues/130646 specifically for DTensor Fixes https://github.com/pytorch/torchtitan/issues/267 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158030 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2025-07-10 18:28:12 +00:00
zeshengzong	2db1a54465	Add deprecation hint for accelerator APIs (#158013 ) [torch.accelerator.set_device_idx](https://docs.pytorch.org/docs/stable/generated/torch.accelerator.set_device_idx.html#torch.accelerator.set_device_idx) and [torch.accelerator.current_device_idx](https://docs.pytorch.org/docs/stable/generated/torch.accelerator.current_device_idx.html#torch.accelerator.current_device_idx) are deprecated, but not reflect in their docs. ## Test Result ### Before ![image](https://github.com/user-attachments/assets/6e0d8c4a-d5e5-420c-8f3a-b2742f0fe263) ![image](https://github.com/user-attachments/assets/4bd99b15-31dc-4043-82e8-3d2c1dfcb57b) ![image](https://github.com/user-attachments/assets/a3d342da-79f2-4950-b17a-d01257603c97) ### After ![image](https://github.com/user-attachments/assets/faf138a8-bd92-4f31-bd7c-4414aee6da5b) ![image](https://github.com/user-attachments/assets/212456bc-1c6b-48c6-9d8c-075d5096b900) ![image](https://github.com/user-attachments/assets/49bb9c8c-203e-424e-bdc0-0f197239146e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158013 Approved by: https://github.com/guangyey, https://github.com/albanD	2025-07-10 18:09:22 +00:00
Scott Wolchok	e3f8141c25	Fix UB in BFloat16 round_to_nearest_even (#157942 ) Type punning using unions is undefined behavior in C++ (you may not access a member of a union that is not the active member). bit_cast is the right way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157942 Approved by: https://github.com/Skylion007	2025-07-10 18:03:39 +00:00
henrylhtsang	a9ac9f2635	[cutlass backend] Change serialization protocol to use more json and cache (#157840 ) Differential Revision: [D77949177](https://our.internmc.facebook.com/intern/diff/D77949177/) What this diff does: * use lru_cache for serialization and deserialization * json dumps more. This seems to help perf. For instantiation level 3332, the loading time decreases from 33s to 20s (roughly 40%) decrease. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157840 Approved by: https://github.com/ColinPeppler ghstack dependencies: #157839	2025-07-10 17:44:33 +00:00
fduwjj	1d0f45d5d1	[c10d][PGNCCL] Cleanup unused params for nccl comm split (#157978 ) Previously we add global ranks as a input params for nccl comm. Now this is not needed, let's clean that up. Differential Revision: [D78051047](https://our.internmc.facebook.com/intern/diff/D78051047) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157978 Approved by: https://github.com/Skylion007	2025-07-10 17:36:23 +00:00
Edward Z. Yang	b40c0b61eb	Make guard collective logging less chatty (#157995 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157995 Approved by: https://github.com/Microve, https://github.com/albanD, https://github.com/Skylion007	2025-07-10 17:18:37 +00:00
henrylhtsang	fb45649df7	[cutlass backend] Make config request key depend on serialization.py and cutlass_utils.py (#157839 ) Differential Revision: [D77893241](https://our.internmc.facebook.com/intern/diff/D77893241/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157839 Approved by: https://github.com/ColinPeppler	2025-07-10 17:09:32 +00:00
Catherine Lee	7caf6c801d	[ez][CI] Add docker instructions for linux build (#157974 ) Copied from linux-test.yml I'm not sure how necessary this is because the wiki also has this info, and has more details about it Pull Request resolved: https://github.com/pytorch/pytorch/pull/157974 Approved by: https://github.com/huydhn	2025-07-10 16:15:28 +00:00
PyTorch MergeBot	493bd625e2	Revert "[BE]: Reduce binary size 40% using aggressive fatbin compression. (#157791 )" This reverts commit 9bdf87e8918b9a3f78d7bcb8a770c19f7c82ac15. Reverted https://github.com/pytorch/pytorch/pull/157791 on behalf of https://github.com/albanD due to Reverting to avoid regressing on the driver supported ([comment](https://github.com/pytorch/pytorch/pull/157791#issuecomment-3058091176))	2025-07-10 16:14:06 +00:00
Shangdi Yu	4781d72faa	[AOTI] codegen for static linkage (#157129 ) Design doc: https://docs.google.com/document/d/1ncV7RpJ8xDwy8-_aCBfvZmpTTL824C-aoNPBLLVkOHM/edit?tab=t.0 (internal) - Add codegen for static linkage - refactor test code for test_compile_after_package tests For now, the following options must be used together with `"aot_inductor.compile_standalone": True`. "aot_inductor.package_cpp_only": True, Will change `"aot_inductor.package_cpp_only"` to be automatically set to True in followup PR. ``` python test/inductor/test_aot_inductor_package.py -k test_compile_after_package python test/inductor/test_aot_inductor_package.py -k test_run_static_linkage_model ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157129 Approved by: https://github.com/desertfire	2025-07-10 16:03:50 +00:00
Aaron Gokaslan	9bdf87e891	[BE]: Reduce binary size 40% using aggressive fatbin compression. (#157791 ) NVCC apparently has a [compression-mode flag](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#compress-mode-default-size-speed-balance-none-compress-mode) to tell it how you want to compress the fatbinary since 12.4. This mode defaults to speed (pick a low compression mode that loads the file quickly). Since we are running into PyPi size issues, this will allow us to upload smaller wheel files. From: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#compress-mode-default-size-speed-balance-none-compress-mode ``` size Uses a compression mode more focused on reduced binary size, at the cost of compression and decompression time. ``` Up to 37.2% reduction in binary size with virtually no drawback (except potentially a little slower loading of the .so at PyTorch startup). 694 MB for CUDA 12.9 builds with 6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX vs 1.08GB for CUDA 12.9 builds with 7.5;8.0;8.6;9.0;10.0;12.0+PTX CUDA 12.9 *694MB* vs *1.08GB* CUDA 12.8 *604MB* vs *845MB* This ends up saving PyPi.org approximately 19.6 PiB of bandwidth per month for the CUDA 12.9 case. This will also allow us to add back CUDA 12.8 12.0+PTX which will make the package forward compatible on newer GPUs. Undoing the need for PR https://github.com/pytorch/pytorch/pull/157516 and https://github.com/pytorch/pytorch/pull/157634 <img alt="Screenshot 2025-07-08 at 5 36 44 PM" width="1061" src="https://private-user-images.githubusercontent.com/7563158/463890713-a53ec774-b036-4c0b-a5d5-301756e3644f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTIwNzY3OTIsIm5iZiI6MTc1MjA3NjQ5MiwicGF0aCI6Ii83NTYzMTU4LzQ2Mzg5MDcxMy1hNTNlYzc3NC1iMDM2LTRjMGItYTVkNS0zMDE3NTZlMzY0NGYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDcwOSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTA3MDlUMTU1NDUyWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9Yzg1OGExN2VjYmI3ZDFhNjIwZDk0NTBjOWFlZDIzYzY3MmExYTFiOGZhZjc0NTI1ZTk2YzM3YzdhYzkyYzZlMiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.2-YmmfXrBFuXCrjDCQ_iTgbtbwv9xNFqM6Goc_liDKE"> More details can be found in Nvidia's technical blog for CUDA 12.4: https://developer.nvidia.com/blog/runtime-fatbin-creation-using-the-nvidia-cuda-toolkit-12-4-compiler/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/157791 Approved by: https://github.com/malfet, https://github.com/atalman	2025-07-10 15:51:04 +00:00
Aditya Tewari	f85954e043	Update OpenBLAS commit (#151547 ) Motivation: Update OpenBLAS and change build script to enable SBGEMM kernels . Update pytorch `jammy` builds for aarch64 to use `install_openblas.sh` instead of `conda_install` Link to full [TorchInductor Performance Dashboard AArch64](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2006%20Jun%202025%2009%3A46%3A35%20GMT&stopTime=Fri%2C%2013%20Jun%202025%2009%3A46%3A35%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=adi/update_openblas&lCommit=0218b65bcf61971c1861cfe8bc586168b73aeb5f&rBranch=main&rCommit=9d59b516e9b3026948918e3ff8c2ef55a33d13ad) 1. This shows a promising speedup across most of the HF models in benchmark, specifically giving a significant boost to SDPA layers. 2. Overall torch-bench pass-rate (cpp_wrapper mode) increased `[87%, 65/75 → 96%, 72/75]` <img width="676" alt="Screenshot 2025-06-20 at 17 05 15" src="https://github.com/user-attachments/assets/2ca9c1bc-80c6-464a-8db6-b758f2476582" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/151547 Approved by: https://github.com/malfet, https://github.com/snadampal, https://github.com/fadara01 Co-authored-by: Christopher Sidebottom <chris.sidebottom@arm.com> Co-authored-by: Ryo Suzuki <ryo.suzuki@arm.com> Co-authored-by: Ye Tao <ye.tao@arm.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-07-10 14:58:12 +00:00
Sam Larsen	7702855228	[logging] dynamo_timed the synchronize in CachingAutotuner make_launchers (#157747 ) Summary: There's some evidence that some very long compile times are actually attributable to the sync. This should make it easier to say for sure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157747 Approved by: https://github.com/aorenste, https://github.com/mlazos	2025-07-10 14:48:51 +00:00
Frank Lin	9a5278225f	[CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 ) Fixes #154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097 Approved by: https://github.com/syed-ahmed, https://github.com/wujingyue, https://github.com/atalman Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-07-10 14:38:18 +00:00
Howard Huang	8532033679	RPC tutorial audit (#157938 ) Fix [T228333894](https://www.internalfb.com/intern/tasks/?t=228333894) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157938 Approved by: https://github.com/AlannaBurke	2025-07-10 14:15:37 +00:00
IvanKobzarev	8dff457f42	[simple_fsdp] Port fx pass to bucket reduce_scatters (#157780 ) Porting fx passes for reduce_scatters bucketing (similar to all_gather bucketing) for simple_fsdp and autoparallel testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157780 Approved by: https://github.com/wconstab	2025-07-10 14:04:43 +00:00
rzou	a9537b626c	[standalone_compile] Fix single Tensor outputs from split_module (#157803 ) We assumed that the output in an FX graph would always just be a list[Tensor], even in the single tensor return case. It is possible for the output to be a single Tensor. This can happen by calling torch.fx.split_module on the module. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/157803 Approved by: https://github.com/oulgen	2025-07-10 12:49:03 +00:00
Raymond Li	82765dad16	Fix logging of config_suppress_errors and config_inline_inbuilt_nn_modules (#157947 ) Currently ~50% of the time we fail or crash before logging metrics, so moving where this is logged will let us have more comprehensive (less-null) data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157947 Approved by: https://github.com/masnesral, https://github.com/jovianjaison	2025-07-10 12:05:43 +00:00
David Berard	cd995bfb2a	[inductor] re-enable TMA templates w/ AOTI (#157819 ) Follow-up from #155896: now that AOTI can codegen non-null TMA workspace args, we can re-enable TMA templates w/ AOTI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157819 Approved by: https://github.com/drisspg	2025-07-10 08:35:29 +00:00
Yu, Guangye	1e8e9f745e	Introduce AcceleratorAllocatorConfig as the common class (#149601 ) # Motivation This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path. # Design Rule ## Overall This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`). Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair ## Naming Convention: - Public API names in `AcceleratorAllocatorConfig` should be device-generic. - Members prefixed with `pinned_` are specific to the host/pinned allocator. - Environment variable names should be generic across backends. - Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]` ## Environment Variables: - The default environment variable for configuration is `PYTORCH_ALLOC_CONF`. - For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601 Approved by: https://github.com/albanD	2025-07-10 07:05:39 +00:00
Xuehai Pan	af3d069094	[BE][Easy] remove unused build-time dependency `astunparse` and change `astunparse.unparse` -> `ast.unparse` (#157907 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157907 Approved by: https://github.com/Skylion007	2025-07-10 07:04:42 +00:00
Meng, Chunhuan	ba0d0de5e6	Enable set SDPA backend by torch.nn.attention.sdpa_kernel on XPU (#156669 ) Introduces support for a new `OVERRIDEABLE` backend in the SDPA module, improves backend selection logic, and adds corresponding tests. In addition, a fallback mechanism was added when a specific backend is unavailable, enhancing user configurability. ### Backend Support and Selection Enhancements: * Added `at::SDPBackend::overrideable` to the list of available SDPA backends in the `Context` class (`aten/src/ATen/Context.h`). * Updated the backend selection logic in `select_sdp_backend_xpu` to include the `OVERRIDEABLE` backend and added a fallback mechanism for unsupported `FLASH_ATTENTION` on XPU. * Adjusted error messaging in `_fused_sdp_choice_xpu` to reflect the inclusion of the `OVERRIDEABLE` backend. (`aten/src/ATen/native/mkldnn/xpu/Attention.cpp`) ### Test Additions for Backend Fallback and Selection: * Added new unit tests to validate fallback behavior for `FLASH_ATTENTION` to `OVERRIDEABLE` and to verify correct backend selection when `MATH` is enabled. (`test/test_transformers.py`,) ### Codebase Updates for Backend Integration: * Introduced `OVERRIDEABLE` as a new member of the `_SDPBackend` enum. (`torch/_C/__init__.pyi.in`) * Extended `_backend_names` and updated related methods to handle the `OVERRIDEABLE` backend. (`torch/nn/attention/__init__.py`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156669 Approved by: https://github.com/guangyey, https://github.com/drisspg	2025-07-10 06:52:22 +00:00
Pian Pawakapan	4cc13c4af6	[dynamic shapes] avoid unnecessary slices (#157528 ) Fixes #157289, by extending optimization to slices where the end index exceeds the size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157528 Approved by: https://github.com/angelayi	2025-07-10 06:34:46 +00:00
FFFrog	565fd07909	[Easy] Make the error message shown by THPUtils_unpackLong to be clearer (#157886 ) As the title stated. The error message of `THPUtils_unpackLong` is the same as `THPUtils_unpackInt` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157886 Approved by: https://github.com/Skylion007	2025-07-10 06:26:13 +00:00
Yuanhao Ji	b85f10ea50	[BE] Replace `std::runtime_error` with `TORCH_CHECK` [2/N] (#152080 ) Part of: #148114 Related commits: - #151880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152080 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-07-10 06:02:47 +00:00
Jithun Nair	fadc936fad	Updates to build and test on Noble (Ubuntu24.04) and py3.12 (#152240 ) This PR enables Ubuntu24.04 testing on CI: * Builds a base docker image using Noble (Ubuntu24.04) and py3.12 for ROCm N version * Builds and tests PyTorch on Ubuntu24.04 as part of the `rocm-mi300` workflow Pull Request resolved: https://github.com/pytorch/pytorch/pull/152240 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2025-07-10 05:55:42 +00:00
Ewart, Timothee	b7860c7863	Implement fast exp for AVX2 and AVX512 for the flash attention (#151441 ) Implement fexp for avx2 and avx512 Cristiano and all propose a clever exp using the IEEE representation with a fine control of the precision, especially useful for mix computation of the flash attention. - Implement Fast Exponential Computation on SIMD Architectures A. Cristiano I. Malossi, Yves Ineichen, Costas Bekas, and Alessandro Curioni - AVX2 and AVX512 float only, up to 20% faster for mix precision flash attention than the current implementation. - For the other types legacy implementation. Precision 1 ULP only valid in hybrid mode fp32 -> f16 due to the cast during the store operation in the flash attention: Benchmark Machine Xeon 6972P, results in TOPs, Python forward pass flash attention numhead 16, Head dimension 64 \|Seq. L.\| PT \| fexp \| \|-------\|------\|------\| \| 512 \| 0.8 \| 1.3 \| \| 1024 \| 1.7 \| 1.7 \| \| 2048 \| 6 \| 6.1 \| \| 4096 \| 16 \| 16.8 \| \| 8192 \| 30.6 \| 32.3 \| \| 16384 \| 40 \| 40.8 \| \| 32768 \| 44.9 \| 51.4 \| \| 65536 \| 45.8 \| 54.4 \| numhead 16, Head dimension 128 \|Seq. L.\| PT \| fexp \| \|-------\|------\|------\| \| 512 \| 2.5 \| 4.1 \| \| 1024 \| 3.3 \| 4 \| \| 2048 \| 11.4 \| 10.5 \| \| 4096 \| 27.4 \| 28.4 \| \| 8192 \| 44.4 \| 46 \| \| 16384 \| 64.2 \| 68.1 \| \| 32768 \| 77.8 \| 83 \| \| 65536 \| 82.1 \| 88.1 \| numhead 16, Head dimension 256 \|Seq. L.\| PT \| fexp \| \|-------\|------\|------\| \| 512 \| 1.7 \| 3.4 \| \| 1024 \| 4.2 \| 6.5 \| \| 2048 \| 14.6 \| 16.1 \| \| 4096 \| 30.1 \| 31.1 \| \| 8192 \| 60 \| 62 \| \| 16384 \| 83.3 \| 87.3 \| \| 32768 \| 98.7 \| 106 \| \| 65536 \| 102.2\| 107.1\| Pull Request resolved: https://github.com/pytorch/pytorch/pull/151441 Approved by: https://github.com/mingfeima	2025-07-10 05:51:31 +00:00
Avik Chaudhuri	9222552572	[non-strict export] uncovered cases of select and slice (#157821 ) Summary: `None` and `Ellipsis` in multi-dimensional indexing was previously not covered. Moreover, we introduce a small optimization for `slice(None)` and a passthrough when symints do not appear in the indexing. The remaining case is where indexing is by tensor, which is fairly complicated; we passthrough in that case. Test Plan: added tests Rollback Plan: Differential Revision: D77943929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157821 Approved by: https://github.com/pianpwk	2025-07-10 05:48:12 +00:00
Sheng Fu	3584e84c24	Fixed the function to get the origin nodes of fused triton kernel. (#157578 ) Summary: This DIFF is to fix the following issue: In python source code for CompiledFxGraph,the FX graph segment for the Triton kernel is broken. For example, the following function def fn(a, b, c): x = torch.nn.functional.linear(a, b) x = x.sin() x = x.t() + c return x Inductor compiled this FX graph into two nodes: the first one is mm, the second one is a triton kernel for sin + transpose + add. The FX graph segment for the triton kernel is like the following: Graph fragment: %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %arg2_1), kwargs = {}) Basically only "add" node in the FX graph. The root cause is function caffe2/torch/_inductor/utils.py:gather_origins does not detect the realized node correctly. To fix this issue, the IRNode is checked if it is one of the following IRNode: ir.ComputedBuffer, ir.InputsKernel, ir.InputBuffer, ir.ReinterpretView, ir.TemplateBuffer, If it is one of them, it is realized, otherwise, it is not. Test Plan: buck2 run mode/opt caffe2/test/inductor:provenance_tracing -- caffe2.test.inductor.test_provenance_tracing.TestProvenanceTracingArtifact.test_triton_kernel_to_post_grad_tracing_cuda Rollback Plan: Differential Revision: D77748371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157578 Approved by: https://github.com/mlazos	2025-07-10 05:34:50 +00:00
Dmitry Rogozhkin	b146ca74f0	docs: add get_default_backend_for_device to distributed documentation (#156783 ) `torch.distributed.get_default_backend_for_device()` API was added to torch 2.6, but is still missing in distributed documentation. This commit addresses the gap. CC: @guangyey, @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/156783 Approved by: https://github.com/guangyey, https://github.com/malfet	2025-07-10 05:11:30 +00:00
CaoE	eddddea908	Upgrade MKL in CI (#154198 ) This PR is to upgrade MKL in CI as PyTorch release uses MKL 2024.2 while MKL in CI is 2021.4. MKL 2021.4 can't trigger issues like https://github.com/pytorch/pytorch/issues/154477 caused by MKL upgrading in Torch release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154198 Approved by: https://github.com/leslie-fang-intel, https://github.com/malfet ghstack dependencies: #154585	2025-07-10 05:09:51 +00:00
bobrenjc93	80bcaa4195	have dynamic sources only apply to sizes and not strides (#157960 ) @animesh pointed out using whitelist for strides can result in confusing graphs as follows ``` s60: "Sym(s60)", L_hidden_states_: "bf16[1, 4096, 3072][s60, 3072, 1]cuda:0" ``` We probably want to capture the relationship between sizes and strides anyways so let's make it so the whitelist only makes the sizes dynamic. That same graph now looks lik ethis ``` L_hidden_states_: "bf16[1, 4096, 64][262144, 64, 1]cuda:0" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157960 Approved by: https://github.com/pianpwk	2025-07-10 05:03:51 +00:00
PyTorch UpdateBot	88cd9f34b0	[audio hash update] update the pinned audio hash (#157873 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157873 Approved by: https://github.com/pytorchbot	2025-07-10 04:59:50 +00:00
zeshengzong	2b19d85d70	FractionalMaxPool3d add kernel_size check (#155549 ) Fixes #96316 ## Test Result ```python >>> import torch >>> from torch.func import jacrev, grad, vmap >>> >>> torch.manual_seed(420) <torch._C.Generator object at 0x7fe4767810d0> >>> >>> input = torch.randn(1, 1, 5, 5, 5, requires_grad=True) >>> >>> def func(input): ... model = torch.nn.FractionalMaxPool3d(kernel_size=0, output_size=(1, 1, 1)) ... output = model(input) ... return output ... >>> >>> func(input).sum().backward() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 2, in func File "/home/zong/code/pytorch/torch/nn/modules/pooling.py", line 1054, in __init__ raise ValueError(f"kernel_size must greater than 0, but got {kernel_size}") ValueError: kernel_size must greater than 0, but got 0 ``` ![image](https://github.com/user-attachments/assets/52780ce7-3951-4d1c-95a4-5ce2bf65c727) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155549 Approved by: https://github.com/albanD	2025-07-10 04:55:06 +00:00
CaoE	06a40b6850	Fix MKL error: Inconsistent configuration parameters (#154585 ) Fixes #154477. PyTorch release uses 2024.2 MKL, which has some changes to the usage of DFTI: if `DFTI_NUMBER_OF_TRANSFORMS > 1`, `DFTI_INPUT_DISTANCE` and `DFTI_OUTPUT_DISTANCE` also needs to be explicitly set to a positive integer. In addition, the requirement "the datasets to be transformed cannot contain common elements" should also be satisfied. This means that we need to avoid the case where the input strides have 0. See https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2024-2/configuring-data-layouts.html and https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-2/dfti-number-of-transforms.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/154585 Approved by: https://github.com/leslie-fang-intel, https://github.com/soumith, https://github.com/malfet	2025-07-10 03:42:38 +00:00
Shangdi Yu	0a624c2dc5	Fix from_node's graph_id in unlift() (#157943 ) Summary: We should use the node before deepcopy in NodeSource Test Plan: ``` buck run fbcode//caffe2/test:test_export -- -r test_from_node_metadata_export ``` Rollback Plan: Differential Revision: D78022070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157943 Approved by: https://github.com/angelayi, https://github.com/Gasoonjia	2025-07-10 03:23:55 +00:00
Paul Zhang	4cfc0a3208	[Inductor] Introduce Lookup Table for Overriding Triton Kernel autotune configs post fusion (#157924 ) Summary: Introduce lookup table for kernels post fusion, hashing on inductor generated source code Rollback Plan: Differential Revision: D77866885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157924 Approved by: https://github.com/jansel	2025-07-10 03:23:50 +00:00
Ankita George	3232b57cd8	Updates to safetensors checkpoint consolidation script to be faster (#157936 ) Summary: - adding mmap-ing - more efficient writing in larger chunks latency from ~150s to ~6s for simple row-wise consolidation of a 7gb model sharded across 4 ranks Test Plan: ran consolidation with the following code: ``` from torch.distributed.checkpoint._consolidate_hf_safetensors import consolidate_safetensors_files import time start_time = time.time() consolidate_safetensors_files(base_path, consolidated_path) end_time = time.time() print(f"Time taken: {end_time - start_time} seconds") ``` With the old code this was taking a couple minutes and this is now down to ~6s. Internal users can find the tensor shards in the manifold path: manifold://ankita_test_bucket/tree/safetensors Rollback Plan: Differential Revision: D77960054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157936 Approved by: https://github.com/teja-rao, https://github.com/pradeepfn	2025-07-10 02:50:20 +00:00
Ankita George	3404c1f0cf	[HF][DCP] Upload local consolidated files to remote storage if needed (#157371 ) If the final output file is in remote storage, then create a local temp directory to write the files and upload the files to the remotes storage after they are written. Add a new config to the storage writer, `enable_consolidation`, so we don't need to rely on the presence of the `consolidation_output_path` to decide if consolidation is enabled. If `enable_consolidation` is True and `consolidation_output_path` isn't provided, the consolidated safetensors will be added to the same path as the sharded ones. Differential Revision: [D77554585](https://our.internmc.facebook.com/intern/diff/D77554585/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157371 Approved by: https://github.com/pradeepfn	2025-07-10 02:40:25 +00:00
FFFrog	aab949aa96	Deprecated pkg_resources and use distributions instead (#151915 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151915 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/albanD	2025-07-10 01:51:26 +00:00
Edward Yang	6442ae9256	Make the name assert actually do something, and reserve some more names (#157342 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157342 Approved by: https://github.com/albanD	2025-07-10 01:39:40 +00:00
Edward Yang	db188503cb	[BE] Remove stale pyre-fixme (#157816 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157816 Approved by: https://github.com/Skylion007, https://github.com/jingsh, https://github.com/albanD	2025-07-10 01:33:32 +00:00
Edward Yang	693116f765	[doc] DeviceMesh invariant on DTensorSpec (#157806 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157806 Approved by: https://github.com/Skylion007, https://github.com/wanchaol ghstack dependencies: #157805	2025-07-10 01:27:40 +00:00
Edward Yang	9a4ac71b58	[doc] Document an invariant in OpSpec (#157805 ) I am not sure if this is actually true though, please reject this PR if it is not. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157805 Approved by: https://github.com/wanchaol, https://github.com/zpcore	2025-07-10 01:27:40 +00:00
Michelle Madubuike	8387984257	Improve error message for torch.binomial enforcing float inputs (#157658 ) Fixes #157195 ### Summary: Fixed Issue 157195 by adding a new error message for torch.binomial in aten/src/ATen/native/Distributions.cpp ### Explanation According to the issue, ``` import torch torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5])) ``` `RuntimeError: Found dtype Float but expected Long` It looks like we are getting a Tensor error rather than a binomial function error. Since the error is coming from pytorch/aten/src/ATen/TensorIterator.cpp, it seems like it is trying to align the tensor data to the same datatype for smooth tensor computations instead of giving a binomial function error. I tried using both arguments as longs and both as ints and got the right binomial function error ``` torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5]).long()) NotImplementedError: "binomial_cpu" not implemented for 'Long' ``` ``` torch.binomial(torch.tensor([10.0]).int(), torch.tensor([0.5]).int()) NotImplementedError: "binomial_cpu" not implemented for 'Int' ``` But when I have both as different datatypes, the TensorIterator.cpp error comes back trying to align the datatypes. `RuntimeError: Found dtype Float but expected Long` I then tried finding where the NotImplementation Error was documented and found it in pytorch/aten/src/ATen/Dispatch.h in lines 193 - 211 ``` #define AT_DISPATCH_SWITCH(TYPE, NAME, ...) \ [&] { \ const auto& the_type = TYPE; \ constexpr const char* at_dispatch_name = NAME; \ /* don't use TYPE again in case it is an expensive or side-effect op / \ at::ScalarType _st = ::detail::scalar_type(the_type); \ RECORD_KERNEL_FUNCTION_DTYPE(at_dispatch_name, _st); \ switch (_st) { \ __VA_ARGS__ \ default: \ TORCH_CHECK_NOT_IMPLEMENTED( \ false, \ '"', \ at_dispatch_name, \ "\" not implemented for '", \ toString(_st), \ "'"); \ } \ }() ``` In the AT_DISPATCH_SWITCH* function, it picks a tensor and its datatype and checks if the Tensor datatype matches the supported datatypes. If not we get the Not Implemented error. Unfortunately, I think the AT_DISPATCH_SWITCH function, uses the `common_dtype` from TensorIterator in order to run. So TensorIterator.cpp needs to happen before the AT_DISPATCH_SWITCH function. ### Summary: We are getting the wrong error message because TensorIterator.cpp gets called and errors out due to Tensor datatype mismatch before we can get the right error message in Dispatch.h for torch.binomial not supporting that datatype. ### Options for the Fix Option 1: Make the error message in TensorIterator.cpp more general so it applies to torch.binomial. An error message along the lines `RunTime Error : "Tensor Datatypes", op.target_dtype," and ", common_dtype_, "are different "` Option 2: Add an error message for the binomial function datatype mismatch before the the TensorIterator.cpp error message gets called. Although Option 1 seemed easier I think Option 2 might be better as it is more specific to the binomial function while Option1 would affect all Tensors with datatype mismatch. This PR applies the fix for Option 2 After Fix : ``` torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5])) RuntimeError: Binomial function arguments count and prob must have same datatype of type Float, got: count = Long, prob = Float ``` ``` torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5]).long()) NotImplementedError: "binomial_cpu" not implemented for 'Long' ``` @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/157658 Approved by: https://github.com/soulitzer	2025-07-10 00:58:56 +00:00
Brian Hirsh	54a7e5b598	_aot_export_function: allow keeping input mutations in the graph (#157730 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157730 Approved by: https://github.com/ezyang	2025-07-10 00:47:51 +00:00
zeshengzong	ed03492238	Add check `nested_tensor_from_jagged` param `jagged_dim >= 1` (#157770 ) Fixes #157404 ## Test Result ```bash pytest test/test_nestedtensor.py ...............................................s..........ssssss.................................................................................................s.s..sssss..s...ss............................................................. [ 44%] ...........................................................sssss....sss...s.........ss....s....sss.........s.sss...s..s......s............s.sss.ss...............s.....................s....s......................s.s.....s....s..s..ssssssssss [ 59%] sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss..ssssss.ssssssssssssssssssssssssssssssssssssssssssssssssssssssssss.ssssssss...............................s........................................... [ 74%] .......sss...................................................................................................................................................................................................................................... [ 89%] ....sss.......................................................................................................................................................... [100%] ==================================================================================================== 1317 passed, 258 skipped in 2504.27s (0:41:44) ==================================================================================================== ``` ![image](https://github.com/user-attachments/assets/dcc8e46d-b88f-4580-b4ad-0999bad33ec9) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157770 Approved by: https://github.com/soulitzer Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>	2025-07-10 00:34:39 +00:00
Pian Pawakapan	752f202ef3	[PGO] include module int attributes in PGO state (#157518 ) Dynamo specializes on int module attributes by default. This includes them in PGO state despite specialization, if they're involved in guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157518 Approved by: https://github.com/bobrenjc93	2025-07-09 23:57:54 +00:00
Tristan Rice	ed051c3084	torch.distributed: add initial _dist2 prototype API (#157841 ) This adds the initial dist2 API as proposed in https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89 This is a WIP experimental API and is a sandbox for a number of new features and quality of life improvements/changes to c10d. Test plan: ``` pytest test/distributed/test_dist2.py ``` Docs ``` cd docs make html ``` ![Screenshot 2025-07-08 at 13-39-23 Object Oriented Distributed API - torch distributed _dist2 — PyTorch main documentation](https://github.com/user-attachments/assets/9c03a7ec-09e5-42b9-8478-1ec28bc2b6bd) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157841 Approved by: https://github.com/fduwjj	2025-07-09 23:40:43 +00:00
Xuan Zhang	39456edbba	[PT2][memory] mutation size correctness (#157562 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157562 Approved by: https://github.com/yf225	2025-07-09 22:14:20 +00:00
Aaron Gokaslan	a1dad2f2d2	[BE][Ez]: Autotype torch/profiler with ruff ANN (#157923 ) Apply ruff autotyping fixes to add annotations to torch profiler Pull Request resolved: https://github.com/pytorch/pytorch/pull/157923 Approved by: https://github.com/albanD, https://github.com/sraikund16	2025-07-09 22:07:50 +00:00
Colin Peppler	53ab73090e	[inductor] support unbacked symint in sdpfa (#157739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157739 Approved by: https://github.com/laithsakka	2025-07-09 22:01:29 +00:00
Ti-Tai Wang	08e9dd280f	[ONNX] Support symbolic arguments in onnx exporter (#157734 ) Previous to this PR, torch.onnx.export(..., dynamo=True, veriy=True, report=True) does not support symbolic arguments. Such examples are like follwing: ```python class M(torch.nn.Module): def forward(self, a, x): return a + torch.tensor(1) + x op = torch.onnx.export(M(), (1, torch.ones(2)), dynamic_shapes=(torch.export.Dim.DYNAMIC, {0: torch.export.Dim.DYNAMIC}), dynamo=True, report=True) ``` symbolic arguments are like constant arguments that they don't have tensor_meta wither. Besides, torch.export.export supports model inputs having constants, which is different from the legacy issue: https://github.com/pytorch/pytorch/issues/99534 where we tried to get the FX directly from dynamo export. Thus, `_remove_non_tensor` is deleted from args processing. NOTE: If the ConstantArugment shows up in exported_program, it was kept to align the length of inputs to nn.Module, but it's irrelevant to the model graph, hwich is why in ONNX model the input is omitted. The test `test_constant_argument_user_input_is_omitted_in_onnx_graph` needs #157719 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157734 Approved by: https://github.com/justinchuby	2025-07-09 21:15:45 +00:00
Aaron Gokaslan	163f0d8f2a	[BE][Ez]: Auto add return type annotations for methods in torch/nn/module (#157925 ) Automatically type a bunch of methods in nn.Module using ruff's type inference rules Pull Request resolved: https://github.com/pytorch/pytorch/pull/157925 Approved by: https://github.com/albanD	2025-07-09 21:12:25 +00:00
Ryan Guo	f742b32a2f	[dynamo] Avoid recompiling over unused objects (#156891 ) Dynamo was aggressively specializing on lazy VTs over `set_name_hint` in `STORE_FAST`, etc., and `isinstance` in `LOAD_FAST_CHECK`. This causes regional `torch.compile` from optimizing ComfyUI GGUF + LoRA to either (1). exceed the recompialtion limit of 8, which results in suboptimal performance, and (2). even if recompilation limit is increased, the compilation time gets unnecessarily high (180s v.s. 20s for Flux). This patch fixes the recompilation issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156891 Approved by: https://github.com/williamwen42, https://github.com/mlazos	2025-07-09 20:14:34 +00:00
Jane Xu	317520bf6e	Add an ovrsource target for torch/headeronly (#157912 ) Summary: no idea how this works Test Plan: will things just pass? Rollback Plan: Differential Revision: D77965219 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157912 Approved by: https://github.com/albanD	2025-07-09 19:32:03 +00:00
PyTorch MergeBot	dfa2649434	Revert "[Inductor] Fix epilogue fusion decision with 1 Triton caller as choice (#156500 )" This reverts commit c48d0f4643b7a69ebe24069e932ce1465a31cdbe. Reverted https://github.com/pytorch/pytorch/pull/156500 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156500#issuecomment-3053680762))	2025-07-09 18:56:10 +00:00
Shangdi Yu	52772765e0	Change AOTI_RUNTIME_DEVICE_CHECK to be device device specific (#157818 ) Summary: Change AOTI_RUNTIME_DEVICE_CHECK to the following depending on device: AOTI_RUNTIME_CUDA_CHECK AOTI_RUNTIME_XPU_CHECK AOTI_RUNTIME_CPU_CHECK Currently in the codebase, only `AOTI_RUNTIME_CUDA_CHECK` is used. This shouldn't change anything as of now, but we do this to prepare for simultaneouly loading multiple backends (e..g CPU and CUDA) in AOTI standalone. We don't want people writing `AOTI_RUNTIME_DEVICE_CHECK` for both CPU and CUDA checks. This could cause compilation problems when we statically link both CPU and CUDA models. Test Plan: CI Rollback Plan: Reviewed By: muchulee8 Differential Revision: D77742977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157818 Approved by: https://github.com/jingsh	2025-07-09 18:34:56 +00:00
Jack Francis Dalton	c54778625e	Update `is_sparse` doc to mention that it is sparse_coo specific (#157378 ) ## Issue being addressed `is_sparse` presents itself as determining if a tensor is sparse. HOWEVER, it only does checks against the tensor for `sparse_coo`. This has lead to confusion from developers as when non-coo sparse tensors are provided it return false, despite those tensors being sparse. ## Considered Remedy Fixing this is do-able however would result in complexity as existing systems may depend on this behavior remaining consistent, and even inside of pytorch is_sparse is used by `bform` which states that it supports only `sparse_csr and sparse_coo` meaning additional work/thought would have to go into solving for `sparse_csc` and `sparse_bsr` ## Remedy provided in this PR In lieu of these complications the lowest risk highest gain action was to add clear warning messaging to the function for now to avoid confusion to developers utilizing the function. The rest of the function behavior remains identical ## Issue content Addresses issue number: #101385 Original issue: https://github.com/pytorch/pytorch/issues/101385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157378 Approved by: https://github.com/soulitzer	2025-07-09 18:22:14 +00:00
mori360	81c7445eb9	[FSDP2] Use reduceOpSum for world size 1 (#157529 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157529 Approved by: https://github.com/Skylion007, https://github.com/lw, https://github.com/weifengpy	2025-07-09 18:08:48 +00:00
Shivam Raikundalia	28aae93f24	[Memory Snapshot] Fix Linter for Global Annotations flag in Snapshot (#157858 ) Summary: We added the ability to make Annotating Global or Local based on an input flag in PyTorch but didn't add the args to the linter Reviewed By: mzzchy Differential Revision: D77959409 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157858 Approved by: https://github.com/mzzchy	2025-07-09 17:28:22 +00:00
Xiangyang (Mark) Guo	b354328ecd	[AOTI] add flag AOT_INDUCTOR_ENABLE_LTO (#157773 ) Add env var AOT_INDUCTOR_ENABLE_LTO to enable clang's ThinLTO by setting AOT_INDUCTOR_ENABLE_LTO=1. The LTO is disabled by default because it may increase the build time. Rollback Plan: Differential Revision: D77899195 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157773 Approved by: https://github.com/desertfire	2025-07-09 16:54:19 +00:00
Tianyu Liu	d75d30eeb6	[DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#157216 ) This is to unblock "dp2ep" Expert Parallel + TP integration in torchtitan https://github.com/pytorch/torchtitan/pull/1324. It does two things: 1. Slightly modifies the glue code for FSDP/HSDP + TP to work with FSDP/HSDP + EP and FSDP/HSDP + EP + TP. I kept the name `FSDPParam._tp_spec` to make the change minimal. We can consider renaming it in the future if it confuses people, but I heard @wanchaol has a plan to rewrite DTensor strided sharding entirely. 2. Lifts the check of `_validate_tp_mesh_dim` for `torch.distributed.tensor.parallel.parallelize_module`, as in EP or EP+TP this check is too strict. In particular it assumes a DeviceMesh must have `mesh_dim_names` which is not always true. I'm also removing the file `torch/distributed/tensor/parallel/_utils.py` it belongs entirely, as the other check `_deprecate_warnings`, added two years ago, is not used any more. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157216 Approved by: https://github.com/wanchaol, https://github.com/weifengpy	2025-07-09 16:49:34 +00:00
PyTorch MergeBot	cb711c8fa0	Revert "[BE] always use `uv pip` if possible in `pip_init.py` for `lintrunner init` (#157199 )" This reverts commit 754699610b0abec2fe3f5a73269b1dd09a330445. Reverted https://github.com/pytorch/pytorch/pull/157199 on behalf of https://github.com/malfet due to It breaks lintrunner init` for default environments, see https://github.com/pytorch/pytorch/issues/152999 ([comment](https://github.com/pytorch/pytorch/pull/157199#issuecomment-3053279711))	2025-07-09 16:26:47 +00:00
Nikita Shulga	981c99fdff	Uninstall brew miniconda while running MacOS testing (#156898 ) That results in torch.compile being unable to produce working artifacts But reinstall it later, when done Should fix https://github.com/pytorch/pytorch/issues/156833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156898 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-07-09 16:02:55 +00:00
FFFrog	054cd4ca28	[CPU Generator] Remove the unused CPUGeneratorImplStateLegacy in set_state (#153934 ) As the title stated. The old state named CPUGeneratorImplStateLegacy in set_state will not been used, so just remove it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153934 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet, https://github.com/atalman	2025-07-09 15:45:19 +00:00
Svetlana Karslioglu	f4d60a68dd	Adding a change to kick off the theme pull (#157732 ) Adding a small change so that Docker container is rebuild and reflects the latest changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157732 Approved by: https://github.com/malfet	2025-07-09 15:43:00 +00:00
PyTorch MergeBot	6defd5084e	Revert "[PT2][memory] mutation size correctness (#157562 )" This reverts commit 86670b39fa3df63a652a9a06b59b73f92d70c392. Reverted https://github.com/pytorch/pytorch/pull/157562 on behalf of https://github.com/xuanzhang816 due to internal_test_failure ([comment](https://github.com/pytorch/pytorch/pull/157562#issuecomment-3053115025))	2025-07-09 15:38:29 +00:00
Catherine Lee	b4e3c9ea34	[ez][CI][testing] Set upload artifacts while running to default true if in CI (#157868 ) I was confused about why the distributed tests weren't showing up quickly on HUD, its because the call of run_tests.py for distributed didn't include upload artifacts while running flag, so set it to default to IS_CI so I don't need to put the flag everywhere Pull Request resolved: https://github.com/pytorch/pytorch/pull/157868 Approved by: https://github.com/huydhn	2025-07-09 15:21:25 +00:00
Aaron Gokaslan	fcc682be4b	[BE][Ez]: Fully type nn.utils.clip_grad (#154801 ) Full types clip_grad and exposed typing annotations that were hidden by a bad decorator Pull Request resolved: https://github.com/pytorch/pytorch/pull/154801 Approved by: https://github.com/jansel	2025-07-09 14:27:51 +00:00
Aaron Gokaslan	ed6ae20cf0	[BE][Ez]: Update mimalloc submodule to 2.2.4 (#157794 ) Fixes a few minor bugfixes with the previous release and better compiler support. Should be a NOOP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157794 Approved by: https://github.com/atalman	2025-07-09 14:03:07 +00:00
Jane Xu	02a9d9095f	[BE] remove commented out code in c10/ovrsource_defs.bzl (#157856 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157856 Approved by: https://github.com/swolchok, https://github.com/albanD	2025-07-09 13:28:56 +00:00
Aidyn-A	86eaf452c3	[Easy][Profiler] Fix pattern matcher of profiler (#157711 ) Per title, as it fails with the following error if "+PTX" was used in `TORCH_CUDA_ARCH_LIST`: ``` File "/usr/local/lib/python3.12/dist-packages/torch/profiler/_pattern_matcher.py", line 313, in skip has_tf32 = all(int(arch[3:]) >= 80 for arch in torch.cuda.get_arch_list()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/profiler/_pattern_matcher.py", line 313, in <genexpr> has_tf32 = all(int(arch[3:]) >= 80 for arch in torch.cuda.get_arch_list()) ^^^^^^^^^^^^^ ValueError: invalid literal for int() with base 10: 'pute_120' ``` Because slicing `arch[3:]` will not end up on having only digits for `compute_120` element of `torch.cuda.get_arch_list()`: ```python >>> torch.cuda.get_arch_list() ['sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120'] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157711 Approved by: https://github.com/Skylion007, https://github.com/sraikund16	2025-07-09 12:09:46 +00:00
Ting Lu	297daa1d30	[aarch64] Add sm_80 to CUDA SBSA build (#157843 ) related to https://github.com/pytorch/pytorch/issues/152690 This adds sm_80 to CUDA SBSA builds (12.9), so that we will be able to support Ampere family (e.g: sm_86) and Ada family (e.g: sm_89) on CUDA SBSA builds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157843 Approved by: https://github.com/Skylion007, https://github.com/atalman	2025-07-09 11:46:34 +00:00
FFFrog	a355158fcb	[Easy] Fix the compilation warning (#157889 ) Background: ```Shell [1376/2332] Building CUDA object caffe2/CMakeFiles/torch_...h/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu.o /root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(450): warning #68-D: integer conversion resulted in a change of sign size_t numelIn_ = -1; ^ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" /root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(451): warning #68-D: integer conversion resulted in a change of sign size_t numelOut_ = -1; ^ /root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(450): warning #68-D: integer conversion resulted in a change of sign size_t numelIn_ = -1; ^ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" /root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(451): warning #68-D: integer conversion resulted in a change of sign size_t numelOut_ = -1; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157889 Approved by: https://github.com/mlazos	2025-07-09 11:41:02 +00:00
Xuehai Pan	4dce5b71a0	[build] modernize build-frontend: `python setup.py develop/install` -> `[uv ]pip install --no-build-isolation [-e ].` (#156027 ) Modernize the development installation: ```bash # python setup.py develop python -m pip install --no-build-isolation -e . # python setup.py install python -m pip install --no-build-isolation . ``` Now, the `python setup.py develop` is a wrapper around `python -m pip install -e .` since `setuptools>=80.0`: - pypa/setuptools#4955 `python setup.py install` is deprecated and will emit a warning during run. The warning will become an error on October 31, 2025. - `9c4d383631/setuptools/command/install.py (L58-L67)` > ```python > SetuptoolsDeprecationWarning.emit( > "setup.py install is deprecated.", > """ > Please avoid running ``setup.py`` directly. > Instead, use pypa/build, pypa/installer or other > standards-based tools. > """, > see_url="https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html", > due_date=(2025, 10, 31), > ) > ``` - pypa/setuptools#3849 Additional Resource: - [Why you shouldn't invoke setup.py directly](https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156027 Approved by: https://github.com/ezyang	2025-07-09 11:24:27 +00:00
Xuehai Pan	fc0376e8b1	[BE][2/6] fix typos in test/ (test/test_*.py) (#157636 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157636 Approved by: https://github.com/yewentao256, https://github.com/mlazos ghstack dependencies: #156311, #156609	2025-07-09 11:02:23 +00:00
Xuehai Pan	ffe11b2bf2	[BE] fix typo in torch/distributed/tensor/: childs -> children (#156609 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156609 Approved by: https://github.com/wanchaol, https://github.com/cyyever ghstack dependencies: #156311	2025-07-09 11:02:23 +00:00
Xuehai Pan	4cc8b60d1b	[BE][1/16] fix typos in torch/ (#156311 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156311 Approved by: https://github.com/albanD	2025-07-09 11:02:22 +00:00
Syed Tousif Ahmed	f5bbaa2253	Fixes typo in nccl_window_registration test (#157293 ) As mentioned here: https://github.com/pytorch/pytorch/pull/155134#discussion_r2175605192 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157293 Approved by: https://github.com/Skylion007	2025-07-09 11:01:18 +00:00
Xuehai Pan	924fc52e18	[BE] add a linter to check consistency for cmake minimum version in requirements (#156961 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156961 Approved by: https://github.com/ezyang, https://github.com/malfet	2025-07-09 10:44:17 +00:00
PyTorch MergeBot	b83d8827bc	Revert "Deprecate DataLoader pin_memory_device param (#146821 )" This reverts commit ab655816b8f76f511fb2262d45276d8d1b13d59c. Reverted https://github.com/pytorch/pytorch/pull/146821 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146821#issuecomment-3052093902))	2025-07-09 10:29:31 +00:00
thenumberouscode	6f23f53599	[inductor] fix tensor.to(uint8) error when tensor src type is float (#157267 ) The cpu inductor processes .to(torch.uint8) incorrectly, leading to numerical inconsistencies. The convert_float_to_int8 function may return incorrect results for negative inputs, such as -2.xx, when the data type is uint8_t, producing 0 instead of 255. This issue stems from the clamping logic; we should avoid converting min_val to uint8_t too early Fixes https://github.com/pytorch/pytorch/issues/156788 @leslie-fang-intel Pull Request resolved: https://github.com/pytorch/pytorch/pull/157267 Approved by: https://github.com/leslie-fang-intel	2025-07-09 07:03:38 +00:00
Menglu Yu	e3f2597b45	[Optimus] Fix normalization pass in the aten IR (#157857 ) Summary: We found there's a special case in recent APS model where the input tensor has smaller size compared to the split size. It will be automatically truncated in split.Tensor thus we add extra condition check for split_with_sizes when do the normalization. Test Plan: ### unit ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_aten_normalization ``` Buck UI: https://www.internalfb.com/buck2/2ecd1ef8-8efe-4245-b4c8-282c23645b3c Test UI: https://www.internalfb.com/intern/testinfra/testrun/7599824648585787 Network: Up: 3.9GiB Down: 9.2GiB (reSessionID-1396c91e-0dd2-457b-a49b-a6ab1f2a7d8f) Loading targets. Remaining 0/5344 99617 dirs read, 1074949 targets declared Analyzing targets. Remaining 0/123279 4988547 actions, 5966764 artifacts declared Executing actions. Remaining 0/728058 209:52:59.9s exec time total Command: test. Finished 12466 local, 209448 remote, 1226 cache (1% hit) 42:10.5s exec time cached (0%) Time elapsed: 26:07.6s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ### E2E before fix: aps-afoc_apop_pt2_v0-db2fe0449a after fix: aps-afoc_apop_pt2_v0-755ad0cdc6 Rollback Plan: Differential Revision: D77961394 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157857 Approved by: https://github.com/anijain2305	2025-07-09 05:38:15 +00:00
Shangdi Yu	effe376db0	Adding aoti_standalone config (#157731 ) Summary: When `compile_standalone` is True, we set `package_cpp_only` to True as well. We raise an error if `package_cpp_only` is explicitly set to False in config. Test Plan: ``` buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r TestAOTInductorConfig ``` Rollback Plan: Differential Revision: D77889754 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157731 Approved by: https://github.com/desertfire	2025-07-09 04:30:04 +00:00
Xu Han	fcbf7c749a	[Windows][Inductor] normalize_path_separator compiler path (#157835 ) Fixes #157673 For the call trace: ``` ...... File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\common.py", line 2569, in reduction return self.kernel.reduction(dtype, src_dtype, reduction_type, value) File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 2155, in reduction self._gen_parallel_reduction_buffers(acc, acc_type, reduction_type, init_dtype) File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 1942, in _gen_parallel_reduction_buffers reduction_prefix_array( File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 335, in reduction_prefix_array if cpp_builder.is_msvc_cl() File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\cpp_builder.py", line 317, in is_msvc_cl return _is_msvc_cl(get_cpp_compiler()) File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\cpp_builder.py", line 240, in _is_msvc_cl subprocess.check_output([cpp_compiler, "/help"], stderr=subprocess.STDOUT) torch._inductor.exc.InductorError: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 0: invalid continuation byte ``` On non-English language pack msvc environment, compiler path has raised `utf-8` issue. I add the `normalize_path_separator` to normalize the compiler path and avoid the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157835 Approved by: https://github.com/jansel	2025-07-09 04:02:20 +00:00
soulitzer	8bda95228f	[autograd] Avoid creating and recording event when unnecessary (#157503 ) Today, we always create and record an events in two places: 1) Upon seeing the first producer, we record an event on the producer, and we wait for this event in two places: (1) when the engine goes to run the consumer, the consumer stream waits for this event. (2) prior to doing accumulation, the accumulation stream waits for this event. 2) After doing accumulation, we record an event on the accumulation stream and wait for this event in a single place: when the engine goes to run the consumer. We do not actually need to record the event in the cases where the 1st producer stream is the same as the consumer and as the accumulation stream, and where the accumulation stream is the same as the consumer stream. Removing this unnecessary create + record event should save a few us for each instance avoided. Fixes https://github.com/pytorch/pytorch/issues/157407 ---- Manual test plan: - [x] @eqy to confirm perf is restored - [x] Running the repro originally reported before/after the patch Pull Request resolved: https://github.com/pytorch/pytorch/pull/157503 Approved by: https://github.com/eqy ghstack dependencies: #155715	2025-07-09 03:36:14 +00:00
florian	8d070187e3	fix type hints for interpolation functions (#157202 ) Fixes #129053 Previously interpolate had a bad signature and not correct type hints. This fixes this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157202 Approved by: https://github.com/ezyang, https://github.com/albanD	2025-07-09 03:11:37 +00:00
Jing Xu	c515385b0a	Add Intel GPU info collection to the collect env script (#157351 ) https://github.com/pytorch/pytorch/pull/137846 was mistakenly closed. Reopen a PR to land the PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157351 Approved by: https://github.com/guangyey, https://github.com/malfet	2025-07-09 03:01:41 +00:00
Nikita Shulga	d6237721c0	[Build] Make PyTorch compilable with gcc-14 on ARM (#157867 ) Fixes numerous ICEs in vreg allocations for SVE+BF16 ``` /pytorch/aten/src/ATen/ParallelOpenMP.h:25:9: error: unrecognizable insn: 25 \| #pragma omp parallel \| ^~~ (insn 257 256 258 30 (set (reg:VNx8BF 449 [ bf16_vec1_217 ]) (unspec:VNx8BF [ (reg:VNx8BF 455) (reg:VNx8BF 456) ] UNSPEC_IORF)) "/pytorch/aten/src/ATen/cpu/vec/sve/vec_bfloat16.h":228:31 discrim 1 -1 (nil)) during RTL pass: vregs /pytorch/aten/src/ATen/ParallelOpenMP.h:25:9: internal compiler error: in extract_insn, at recog.cc:2812 0xd73c33 internal_error(char const, ...) ???:0 0xd73d1f fancy_abort(char const, int, char const) ???:0 0x890053 _fatal_insn(char const, rtx_def const, char const, int, char const) ???:0 0x890087 _fatal_insn_not_found(rtx_def const, char const, int, char const) ???:0 0x1379093 extract_insn(rtx_insn) ???:0 ``` And one in RTL-expand pass while compiling Activation.cpp ``` during RTL pass: expand In file included from /pytorch/aten/src/ATen/native/cpu/Activation.cpp:12, from /pytorch/build/aten/src/ATen/native/cpu/Activation.cpp.DEFAULT.cpp:1: /pytorch/aten/src/ATen/native/cpu/Activation.cpp: In lambda function: /pytorch/aten/src/ATen/native/cpu/Activation.cpp:94:7: internal compiler error: Segmentation fault 94 \| }); \| ^ /pytorch/aten/src/ATen/Dispatch.h:201:7: note: in definition of macro 'AT_DISPATCH_SWITCH' 201 \| __VA_ARGS__ \ \| ^~~~~~~~~~~ /pytorch/aten/src/ATen/Dispatch.h:72:3: note: in expansion of macro 'AT_PRIVATE_CASE_TYPE_USING_HINT' 72 \| AT_PRIVATE_CASE_TYPE_USING_HINT(enum_type, scalar_t, __VA_ARGS__) \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /pytorch/aten/src/ATen/Dispatch.h:214:3: note: in expansion of macro 'AT_DISPATCH_CASE' 214 \| AT_DISPATCH_CASE(at::ScalarType::Double, __VA_ARGS__) \ \| ^~~~~~~~~~~~~~~~ /pytorch/aten/src/ATen/Dispatch.h:218:34: note: in expansion of macro 'AT_DISPATCH_CASE_FLOATING_TYPES' 218 \| AT_DISPATCH_SWITCH(TYPE, NAME, AT_DISPATCH_CASE_FLOATING_TYPES(__VA_ARGS__)) \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /pytorch/aten/src/ATen/native/cpu/Activation.cpp:70:5: note: in expansion of macro 'AT_DISPATCH_FLOATING_TYPES' 70 \| AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "log_sigmoid_cpu", [&] { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ 0xd73c33 internal_error(char const, ...) ???:0 0x134f987 rebuild_jump_labels(rtx_insn) ???:0 ``` Interestingly enough, attempt to compile `Unfold2d.cpp` for `-march=armv8-a+sve` (i.e. without sve+bf16) support also causes ICE ``` /pytorch/aten/src/ATen/native/cpu/Unfold2d.cpp:221:1: error: unrecognizable insn: 221 \| } \| ^ (insn 2918 2917 2919 296 (set (reg:VNx8BI 5917) (unspec:VNx16BI [ (reg:VNx8BI 5920) (reg:VNx8BI 5922) (const_vector:VNx4BI [ (const_int 0 [0]) repeated x8 ]) ] UNSPEC_TRN1_CONV)) "/usr/include/aarch64-linux-gnu/bits/string_fortified.h":29:33 discrim 1 -1 (expr_list:REG_EQUAL (const_vector:VNx8BI [ (const_int 1 [0x1]) repeated x9 (const_int 0 [0]) (const_int 1 [0x1]) repeated x2 (const_int 0 [0]) repeated x4 ]) (nil))) during RTL pass: vregs ``` Which could be worked around by adding ```patch diff --git a/aten/src/ATen/native/cpu/Unfold2d.cpp b/aten/src/ATen/native/cpu/Unfold2d.cpp index 8ef0741e77af0a..59c76505dd6246 100644 --- a/aten/src/ATen/native/cpu/Unfold2d.cpp +++ b/aten/src/ATen/native/cpu/Unfold2d.cpp @@ -169,6 +169,10 @@ static void unfolded2d_acc_channels_last( / note: due to write issues, this one cannot be parallelized as well as * unfolded2d_copy / +#if defined(__GNUC__) && __GNUC__ == 14 && defined(__ARM_FEATURE_SVE) +// Workaround for gcc-14.2.0 ICE during RTL pass: vregs when compiling for SVE +__attribute__((optimize("no-tree-vectorize"))) +#endif void unfolded2d_acc_kernel( ScalarType dtype, void finput_data, ``` Fixes https://github.com/pytorch/pytorch/issues/157842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157867 Approved by: https://github.com/atalman, https://github.com/Skylion007	2025-07-09 02:59:08 +00:00
Hao Zhang(张浩)	ab8874bd26	Suppress warning when using native arch for jit loading cuda extensions. (#156923 ) Previeusly, if users want to let pytorch determine the cuda arch when jit loading cuda extensions, they should left environment variable `TORCH_CUDA_ARCH_LIST` empty, but which will raise an warning. This commit add an option to set `TORCH_CUDA_ARCH_LIST=native`, to tell pytorch users want to use native cuda arch intentionally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156923 Approved by: https://github.com/ezyang	2025-07-09 02:51:20 +00:00
fduwjj	bc6e0661a6	Fix more H100 CI (#157829 ) Follow @d4l3k 's fix in https://github.com/pytorch/pytorch/pull/157826/files. Two more fixes might be needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157829 Approved by: https://github.com/davidberard98, https://github.com/d4l3k	2025-07-09 01:28:05 +00:00
Bin Bao	e5edd013ab	[AOTI] Skip test_simple_multi_arch_embed_kernel_binary_True_cuda (#157301 ) Summary: For https://github.com/pytorch/pytorch/issues/156930, still no clue on what went wrong as it is not reproducible locally, but somehow the problem seems only exists when embed_kernel_binary is True. Let's skip it for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157301 Approved by: https://github.com/yushangdi	2025-07-09 01:18:36 +00:00
xinan.lin	75f489d37f	[Break XPU][Inductor UT] Align tolerance of newly added case with cuda. (#157702 ) Align tolerance with cuda for the newly added case `test_comprehensive_logcumsumexp_xpu_float16` in #157512. Fixes #157697 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157702 Approved by: https://github.com/jansel	2025-07-09 00:55:01 +00:00
Tristan Rice	3eb7084f7a	[ci] fix h100-distributed (#157826 ) This was broken by https://github.com/pytorch/pytorch/pull/157341 This should resolve the permission issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/157826 Approved by: https://github.com/fduwjj, https://github.com/Skylion007, https://github.com/huydhn	2025-07-09 00:27:55 +00:00
PyTorch MergeBot	86251eff40	Revert "Introduce AcceleratorAllocatorConfig as the common class (#149601 )" This reverts commit 55108074c0795be3b617d3b13b06794f63e1f8ca. Reverted https://github.com/pytorch/pytorch/pull/149601 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/149601#issuecomment-3050628047))	2025-07-09 00:07:31 +00:00
Tristan Rice	1b3d69b59f	Work: block_current_stream API (#156883 ) This implements a new `wait_stream` API in Work that matches how `wait` works for ProcessGroupNCCL for CPU based backends such as Gloo. The idea is to support Gloo communication overlap in FSDPv2/HSDP with minimal changes to FSDP. There was a previous attempt to make FSDPv2 use Work.wait but given the extensive stream semantics used it doesn't play nicely. https://github.com/pytorch/pytorch/pull/148780 This uses a "Baton" CUDA kernel which spinlocks on a pinned CPU tensor waiting for it to be set. Test plan: ``` pytest test/distributed/test_c10d_gloo.py -v -k wait_stream pytest test/distributed/test_c10d_nccl.py -v -k wait_stream ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156883 Approved by: https://github.com/kwen2501, https://github.com/fduwjj	2025-07-08 23:55:46 +00:00
Blaine Burton Rister	92f41ccc26	[Inductor] Support precomputed size args in the FX backend. (#157758 ) # Feature If a Triton kernel has a complicated indexing expression, Inductor may decide to precompute it on the host and pass it to the kernel as an argument. This happens in situations like broadcasts with dynamic shapes. This PR adds support for this feature to Inductor's FX IR backend. We generate FX IR for precomputed size args in 3 steps: 1. In `PythonWrapperCodegen`, this PR refactors the relevant code to use a `SymbolicCallArgLine` instead of raw Python strings. This stores a (symbol, expr) pair. (Prior to this PR, it was (str, expr), but changing this to a symbol makes it easier to do substitutions later on.) 2. In `WrapperFxCodegen`, keep a dict of {symbol: expr} arg defs which gets updated whenever we see a `SymbolicCallArgLine`. 3. When the FX backend sees a `KernelCallLine`, it uses this dict to replace symbolic call args with their definitions. In the longer run, it might be desirable to emit FX nodes defining these symbolic call args. That way, we could reuse the size computation when the same kernel is called multiple times. However, I wasn't sure if there was an existing way to generate FX nodes from a sympy expression, and implementing that seemed like overkill for the present purposes. # Test plan Added a new CI test exercising this feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157758 Approved by: https://github.com/jansel	2025-07-08 23:22:17 +00:00
Simon Fan	95bc3da9f8	[c10d] support dynamic shapes for all_to_all_single_autograd (#157521 ) `all_to_all_single_autograd` is not an op, all the code executed until the `all_to_all_single` dispatch is visible to the compiler. This means the `all_to_all_single_autograd` wrapper code must support symints in order to be traceable with dynamic shapes. FIXES https://github.com/pytorch/pytorch/issues/157479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157521 Approved by: https://github.com/wconstab	2025-07-08 23:19:59 +00:00
Sidharth	9f18482d41	[dynamo] removing string literals for weblink generation (#157820 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157820 Approved by: https://github.com/williamwen42	2025-07-08 23:08:06 +00:00
Nikita Shulga	c5b46b5408	[BE] Standardize CPU capabilities name (#157809 ) It's weird to call default x86 CPU capability `NO AVX`, when in reality it's something different. Also it's a bit strange to have it assigned different names on different platforms Fixes https://github.com/pytorch/pytorch/issues/157538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157809 Approved by: https://github.com/Skylion007	2025-07-08 23:06:09 +00:00
Andrey Talman	179dcc10e4	Add sm_70 arch for linux cuda 12.8 and 12.9 builds (#157558 ) Please see: https://github.com/pytorch/pytorch/issues/157517 We would like to keep Volta architectures by default for release 2.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157558 Approved by: https://github.com/Skylion007, https://github.com/Camyll, https://github.com/seemethere, https://github.com/malfet	2025-07-08 23:02:10 +00:00
Sam Larsen	7a41f20794	[inductor] Quiesce Triton compile worker pool after each dynamo compile (#156187 ) For internal usages, keeping the Triton compile worker pool active for the lifetime of the process has caused some challenges, e.g., it slows down and muddies profiling due to the huge number of threads on a box: N threads = 8 ranks * 32 subprocs * M threads started by torch. Also, each subproc can use more than 1GB each. This PR adds the functionality to shutdown worker subprocs after each dynamo compile when using the SubprocPool implementation. The idea is to leave the main sidecar process running, but signal it to tear down its internal ProcessPoolExecutor when compile is finished. Restarting the ProcessPoolExecutor is relatively fast, e.g., 500ms because the ProcessPoolExecutor forks from the sidecar. Changes: * Do not start the ProcessPoolExecutor automatically when compile_fx is imported. Instead, start the sidecar process only. The sidecar process imports torch, so is still slow to start. * Introduce wakeup() and quiesce() calls to the implementation to start and stop the ProcessPoolExecutor. * Add a context manager to automatically quiesce() at the end of dynamo compilation. * Signal a wakeup() in compile_fx only when we have cuda devices. * Add a killswitch so we can turn of quiescing. Testing: For correctness, the stacked change at https://github.com/pytorch/pytorch/pull/156534 enables the feature for OSS so it's exercised in CI. For performance, because of recent compile-time variance (see https://github.com/pytorch/pytorch/issues/152566), it's pretty hard to glean whether there's a regression.... * Training: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/masnesral/210/head&lCommit=1b7315031c3bfad66a1a01700167a9ca1a2ae5f1&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801 * Inference: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/masnesral/210/head&lCommit=1b7315031c3bfad66a1a01700167a9ca1a2ae5f1&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801 The wins (mostly for inference) don't make sense, but I'm also skeptical of the losses (mostly for training). I can't repro any of the slowdowns locally. Furthermore, check out the benchmarking results for the stacked diff, which actually enables the quiescing functionality for OSS. That should only slow down compile since there can only be overhead to stop and start the workers. But the results are somehow better: * Training: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/masnesral/214/head&lCommit=41943253882a019b8ceafcd2bf4cd6acbe0cbca9&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801 * Inference: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/masnesral/214/head&lCommit=41943253882a019b8ceafcd2bf4cd6acbe0cbca9&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156187 Approved by: https://github.com/aorenste, https://github.com/jansel	2025-07-08 22:53:13 +00:00
Animesh Jain	178fe7aa98	[dynamo][fsdp] Consistent behavior of int attributes (#157262 ) Reimpl of https://github.com/pytorch/pytorch/pull/150954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157262 Approved by: https://github.com/bdhirsh	2025-07-08 22:11:33 +00:00
PyTorch MergeBot	2e14069081	Revert "[DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#157216 )" This reverts commit 777eca9f16aeecd7c362a235cf25e6b8e6eda57f. Reverted https://github.com/pytorch/pytorch/pull/157216 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a distributed test in trunk ([comment](https://github.com/pytorch/pytorch/pull/157216#issuecomment-3050258896))	2025-07-08 20:48:51 +00:00
angelayi	391473cca0	[export] Fix lift constants bug (#157719 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/157719 Approved by: https://github.com/yushangdi	2025-07-08 20:33:53 +00:00
lucasb-eyer	b9dc2fa4f7	Add legacy note to autograd.profiler doc. (#157459 ) Via google search I got to `torch.autograd.profiler` and implemented my code with it. Only to be taken by surprise finding `torch.profile.profiler`, which has a note saying the autograd one is legacy. This just adds such note to `autograd.profiler` to avoid this confusion and waste of time to future people in my situation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157459 Approved by: https://github.com/sraikund16	2025-07-08 20:33:23 +00:00
zpcore	a73d9e0aec	Fix einsum strategy shard dim > ndim (#157593 ) Previously we didn't constrain Shard dim to be <= the tensor's ndim. This cause an invalid strategy like `(RR, RS(2)) -> RS(2),` for einsum `bmk,kn->bmn` on the 2d mesh. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157593 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2025-07-08 20:27:17 +00:00
Huy Do	06b3265cb1	Increase nightly C++ docs build timeout to 6h (#157759 ) This job has been timing out since May `261897734a/1`, maybe it's time to figure out if this makes sense. Issues https://github.com/pytorch/pytorch/issues/157763 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157759 Approved by: https://github.com/malfet	2025-07-08 19:28:48 +00:00
Ankita George	dea4864ce0	HF loads dcp - don't do a full deserialize on every file (#157715 ) Summary: These changes in D76442012 got reverted after the PR landed due to aps_models/ads/launchers/pearl/tests/ne/e2e_deterministic_tests:pearl_e2e_ne_tests failing with `Config not loaded due to no timely response from configerator. Likely configerator_proxy or falcon_proxy are not healthy`, but that test failing is definitely transient and unrelated to my changes, so re-creating the diff Test Plan: ensure tests pass Rollback Plan: Differential Revision: D77871099 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157715 Approved by: https://github.com/meetv18	2025-07-08 18:13:27 +00:00
Zeina Migeed	4f5be56612	[Pyrefly][Refactor] Replace dict() calls with literal dict syntax for improved readability (#157735 ) There are 31 places that I spotted which construct literal dictionaries. This PR refactors dictionary construction by replacing` dict(...) `calls with `literal {...}` syntax where applicable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157735 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2025-07-08 18:10:33 +00:00
Howard Huang	0f31445139	Add stack trace of exception to MultiProcContinousTest (#157589 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157589 Approved by: https://github.com/Skylion007	2025-07-08 17:54:35 +00:00
Shangdi Yu	5b4e0255d7	Check FakeScriptObject in _resolve_name_collision (#157736 ) Summary: Fix https://github.com/pytorch/pytorch/issues/157401 torch.equal cannot handle FakeScriptObject inputs. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_aoti_torchbind_name_collision ``` Rollback Plan: Differential Revision: D77894081 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157736 Approved by: https://github.com/angelayi	2025-07-08 17:51:46 +00:00
Weishi.Deng	44d0800d60	[Intel GPU] Set higher tolerance for squeezenet1_1 with bf16 (#156920 ) We need to increase the tolerance slightly to ensure that certain models pass the accuracy check on the XPU device. This pull request preserves the original tolerance threshold for CUDA/CPU devices and introduces a new key, higher_bf16_xpu, which only affects the XPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156920 Approved by: https://github.com/soulitzer	2025-07-08 17:49:54 +00:00
Nikita Shulga	a5c61eb78d	[MPS][BE] Delete `as_strided_tensorimpl_mps` (#157772 ) Because it's just copy-n-paste of `as_strided_tensorimpl` with call to `updateTensorBaseShape`, which is not called/used anywhere else. Fixes https://github.com/pytorch/pytorch/issues/152701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157772 Approved by: https://github.com/Skylion007	2025-07-08 17:02:36 +00:00
henrylhtsang	bbe681ed51	[cutlass backend][BE][ez] Make matmul layouts be row x column (#156656 ) Differential Revision: [D77184232](https://our.internmc.facebook.com/intern/diff/D77184232/) Motivation: * This is the case we care the most. * We are caching the kernels for this row x column layout. So testing on them can potentially make ci run faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156656 Approved by: https://github.com/ColinPeppler	2025-07-08 16:57:33 +00:00
Tianyu Liu	ed911747c2	[dtensor] add support for fused optimizer with parameters across multiple meshes (#157682 ) We are seeing more and more use cases where parameters in a model (under the same optimizer group) are put on different meshes. E.g. - when FSDP and TP are both applied, some parameters are sharded only on the FSDP mesh but not TP mesh (see https://github.com/pytorch/pytorch/pull/153268). - in [dp2ep Expert Parallel](https://github.com/pytorch/torchtitan/pull/1324), the routed experts are sharded on the (global FSDP \ EP) mesh for smaller FSDP and on the EP mesh for EP, whereas other params are sharded on the global FSDP mesh for FSDP. This PR is, in some sense, a continuation of https://github.com/pytorch/pytorch/pull/147869 to tackle the problem when fused optimizers are used. In such cases, the [`fused_adam`](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L15786) / `fused_adamw` has a scalar tensor arg `state_steps` which gets automatically cast to DTensor on the default [`compute_mesh`](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_dispatch.py#L350) (one of the multiple meshes), even though the it could correspond to different meshes. To avoid hitting the cross-mesh propagation exception in `common_pointwise_strategy` and followup redistribute problems, we manually set the target mesh and placements to be the same as input mesh and placements, so that no redistribute will be triggered. This also helps bypass the situation where [`generate_redistribute_costs`](https://github.com/pytorch/pytorch/pull/157682/files#diff-eea32a36dd2d4e58307bc5229402e48048b2ecaef64a7c085495fba1ee10ac89R597) returns infinite cost due to cross mesh redistribute. Moreover, this PR has minimal scope (restricted to the `fused_ops`) and doesn't need to modify other files such as `_sharding_prop.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157682 Approved by: https://github.com/wanchaol	2025-07-08 15:58:30 +00:00
Tianyu Liu	777eca9f16	[DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#157216 ) This is to unblock "dp2ep" Expert Parallel + TP integration in torchtitan https://github.com/pytorch/torchtitan/pull/1324. It does two things: 1. Slightly modifies the glue code for FSDP/HSDP + TP to work with FSDP/HSDP + EP and FSDP/HSDP + EP + TP. I kept the name `FSDPParam._tp_spec` to make the change minimal. We can consider renaming it in the future if it confuses people, but I heard @wanchaol has a plan to rewrite DTensor strided sharding entirely. 2. Lifts the check of `_validate_tp_mesh_dim` for `torch.distributed.tensor.parallel.parallelize_module`, as in EP or EP+TP this check is too strict. In particular it assumes a DeviceMesh must have `mesh_dim_names` which is not always true. I'm also removing the file `torch/distributed/tensor/parallel/_utils.py` it belongs entirely, as the other check `_deprecate_warnings`, added two years ago, is not used any more. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157216 Approved by: https://github.com/wanchaol, https://github.com/weifengpy	2025-07-08 15:57:37 +00:00
Aaron Gokaslan	476874b37f	[BE]: Update NCCL to 2.27.5 (#157108 ) Update NCCL to 2.27.5. Minor version, improves Blackwell, Symmem FP8 support, and fixes a bug with MNVVL. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157108 Approved by: https://github.com/atalman	2025-07-08 15:40:54 +00:00
Steven Troxler	5dc75f72d4	Simplify the base classes of `_PyFutureMeta` (#157757 ) Summary: I'm fairly sure the use of a custom metaclass is a holdover from pre-3.7 where Generic used a custom metaclass so we had to use multiple inheritance to avoid import-time failures. At this point, `type(Generic)` is just `type` so it isn't needed, and we will get the least metaclass from our base classes, which means the `type(torch._C.Future)` isn't needed either, it will happen automatically just by inheritance. Test Plan: I'm fairly confident from local testing that this should be a no-op. But also, Pytorch CI should give us pretty strong signal that this change doesn't break anything in case there's some edge case I missed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157757 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2025-07-08 15:39:56 +00:00
Nikita Shulga	f88d7a7a34	[BE] Do not add `.` after troubleshooting_url (#157753 ) As it gets included into auto-hrefed URLs in say github logs to point to non existing location For example from https://github.com/pytorch/pytorch/actions/runs/16130448756/job/45517004735?pr=157749#step:18:27 > W0708 00:23:20.150000 67082 torch/_dynamo/convert_frame.py:1047] [0/8] To diagnose recompilation issues, see [https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.](https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157753 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-07-08 15:38:24 +00:00
Nikita Shulga	98bb0c0e78	[CI][MacOS] Add `VENV_PATH` to search path (#157749 ) When building/testing PyTorch on MacOS Shoudl prevent some flakiness when conda environment overtakes CI/CD Pull Request resolved: https://github.com/pytorch/pytorch/pull/157749 Approved by: https://github.com/atalman, https://github.com/huydhn	2025-07-08 15:37:45 +00:00
PyTorch MergeBot	76fe88fa56	Revert "Cleanup leftover miniconda brew installation (#156898 )" This reverts commit 214e2959dcdbf91a999d5c0a5d40c91e4442e8c5. Reverted https://github.com/pytorch/pytorch/pull/156898 on behalf of https://github.com/malfet due to Breaks TorchVision builds ([comment](https://github.com/pytorch/pytorch/pull/156898#issuecomment-3049281232))	2025-07-08 14:54:42 +00:00
Xuan Zhang	86670b39fa	[PT2][memory] mutation size correctness (#157562 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157562 Approved by: https://github.com/yf225	2025-07-08 14:02:20 +00:00
Wang, Chuanqi	c78bbdf410	[BE] Update xpu driver repo for CD used almalinux 8.10 (#157356 ) XPU CD docker image built on `quay.io/pypa/manylinux_2_28_x86_64`, which based on almalinux 8.10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157356 Approved by: https://github.com/EikanWang, https://github.com/malfet	2025-07-08 13:59:46 +00:00
rzou	b9afdd9bcc	Add flag to fx.passes.split_module to normalize input names (#157733 ) This is useful for vLLM, which runs AOTAutograd directly on graphs after they have been split. I created a new flag for this instead of reusing `keep_original_node_name` (please let me know if you think I should reuse this). The reasoning is: - The names of the placeholder nodes is different from the targets of the placehoder nodes. The targets are the actual input names. - Backwards compatibility: this API has been out for ~4 years, it looks public, and it has extensive public use. For example, this change would actually be BC-breaking to vLLM (they rely on the subgraph input names being different at the moment). Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/157733 Approved by: https://github.com/ezyang	2025-07-08 13:47:24 +00:00
cyy	7381c77724	Use CMake wholearchive group (#156393 ) Use CMake wholearchive group to simplify code. It may also support more OSes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156393 Approved by: https://github.com/ezyang	2025-07-08 12:20:29 +00:00
zeshengzong	ab655816b8	Deprecate DataLoader pin_memory_device param (#146821 ) Following [ #131858 suggestion](https://github.com/pytorch/pytorch/pull/131858#pullrequestreview-2517760602) to optimize DataLoader code Pull Request resolved: https://github.com/pytorch/pytorch/pull/146821 Approved by: https://github.com/divyanshk Co-authored-by: Divyansh Khanna <divyanshkhanna09@gmail.com>	2025-07-08 09:24:53 +00:00
Aleksei Nikiforov	41e8b826d0	S390x update test marks (#157541 ) Update s390x test marks test_logs_out from test/dynamo/test_logging.py is updated and no longer fails on s390x. test_qengine from test/test_torch.py doesn't work on s390x: no QEngine is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157541 Approved by: https://github.com/huydhn	2025-07-08 09:08:33 +00:00
pralay	5430990bd7	Added philox based RNG context for HPU device in Dtensor scenarios (#156581 ) In this PR, we are enabling `HPU` device-specific function calls for random operations. These calls will manage the setting and unsetting of the `context of Random Number Generator`. While HPU devices typically utilize a `Mersenne-based RNG`, Dtensor-specific random operations employ an `offset-based (Philox) RNG tracker` which is specifically integrated with `CUDA` in scope. To integrate a similar offset-based RNG tracker within the `HPU backend`, a backend-specific device handle function is necessary to identify the execution context of these random operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156581 Approved by: https://github.com/jeromean, https://github.com/wanchaol	2025-07-08 08:50:24 +00:00
Yu, Guangye	55108074c0	Introduce AcceleratorAllocatorConfig as the common class (#149601 ) # Motivation This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path. # Design Rule ## Overall This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`). Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair ## Naming Convention: - Public API names in `AcceleratorAllocatorConfig` should be device-generic. - Members prefixed with `pinned_` are specific to the host/pinned allocator. - Environment variable names should be generic across backends. - Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]` ## Environment Variables: - The default environment variable for configuration is `PYTORCH_ALLOC_CONF`. - For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601 Approved by: https://github.com/albanD	2025-07-08 08:40:47 +00:00
Xuehai Pan	84b77ec128	[BE] add a minimal linter to check `pyproject.toml` consistency (#156017 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156017 Approved by: https://github.com/ezyang	2025-07-08 08:17:36 +00:00
IvanKobzarev	8134684d44	[inductor collectives] sink waits iterative (#157708 ) Differential Revision: [D77861763](https://our.internmc.facebook.com/intern/diff/D77861763) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157708 Approved by: https://github.com/wconstab ghstack dependencies: #157706	2025-07-08 07:17:10 +00:00
Huy Do	2af7c67e48	Mitigate some flaky tests in trunk (#157756 ) (not really fix these issues, but we should be able to close them. This also allows CI from the PR to test them) Fixes https://github.com/pytorch/pytorch/issues/156579 Fixes https://github.com/pytorch/pytorch/issues/156580 Fixes https://github.com/pytorch/pytorch/issues/126867 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157756 Approved by: https://github.com/clee2000	2025-07-08 07:07:11 +00:00
Jithun Nair	38757d94f1	Enable target-determination (TD) for ROCm CI (#156545 ) Target determination sorts the tests in a PR CI run based on heuristics about which tests are more relevant to the PR's changes. This can help provide faster CI signal as well as help alleviate capacity concerns as job durations should decrease due to catching failures earlier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156545 Approved by: https://github.com/jeffdaily, https://github.com/clee2000	2025-07-08 06:27:40 +00:00
Yu, Guangye	1b58e7adab	fix storage use_count (#157694 ) # Motivation https://github.com/pytorch/pytorch/pull/155451 decoupled `torch._C._storage_Use_Count` from CUDA and introduced a corresponding unit test: `815545f2dd/test/test_torch.py (L257-L262)` However, this test fails when PyTorch is built with debug assertions enabled. @clee2000 disabled this UT in https://github.com/pytorch/pytorch/pull/156731. The root cause is that `_cdata` is obtained from an `intrusive_ptr`, not a `weak_intrusive_ptr`. As a result, calling `c10::weak_intrusive_ptr::use_count` on it triggers the internal assertion: `815545f2dd/c10/util/intrusive_ptr.h (L912-L917)` For example: ```python a = torch.randn(10, device=device) # refcount=1, weakcount=1 prev_cf = torch._C._storage_Use_Count(a.untyped_storage()._cdata) # violate the assertation ``` This violates the expected invariant inside `weak_intrusive_ptr::use_count`, which assumes the pointer was originally constructed from a valid `weak_intrusive_ptr`. Actually, `storage_impl` is obtained from an `intrusive_ptr`. `815545f2dd/torch/csrc/Module.cpp (L2105-L2109)` # Solution Use `c10::intrusive_ptr::use_count` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157694 Approved by: https://github.com/albanD	2025-07-08 05:53:12 +00:00
Xuehai Pan	8186af5a26	[BE][Easy] set end-of-line for `.bat` file to CRLF in `.editorconfig` (#156032 ) See also: `54976bca10/.gitattributes (L1)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156032 Approved by: https://github.com/seemethere, https://github.com/ezyang	2025-07-08 05:40:57 +00:00
Xuehai Pan	bdacf08b86	[BE][Easy] add `.editorconfig` setting for C/C++/CUDA/ObjC (#157692 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157692 Approved by: https://github.com/ezyang	2025-07-08 05:37:15 +00:00
drisspg	987314aa96	Split batch-num-heads grid dim between y and z (#157745 ) for #157018 doesn't totally fix the problem but should help alot Pull Request resolved: https://github.com/pytorch/pytorch/pull/157745 Approved by: https://github.com/Chillee	2025-07-08 05:17:43 +00:00
Nikita Shulga	39a8f66d59	[BE] Use `simdgroup_size` constexpr (#157751 ) Instead of every shader defining it separately, move it to `c10/metal/common.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157751 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #157746	2025-07-08 03:46:20 +00:00
Nikita Shulga	0b73f7c871	[EZ][BE] Move array def to `c10/metal/common.h` (#157746 ) And use proper type aliasing instead of weird _ARRAY_NS Also use `uint64_t` instead of `ulong` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157746 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-07-08 03:46:20 +00:00
Avanish Tiwari	a4c7e7f983	[PowerPC]: Fixed build issue that occur because of datatype f8 enablement for onednn in qlinear and prepack (#157469 ) Getting the build issue because of enablement of data type fp8 for onednn in qlinear and qlinear_prepack file after this commit c2185dc4a5626848df37cad214b73d5ae7dd4f17 Currrently cpuinfo is disable for power system because of that it is giving below error. Error: ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope Made a required changes and now build issue got fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157469 Approved by: https://github.com/malfet	2025-07-08 03:45:06 +00:00
cyy	3ee8828c87	[1/N] Don't use CUDA.cmake module (#157188 ) Small changes before removing CUDA.cmake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157188 Approved by: https://github.com/ezyang	2025-07-08 03:05:35 +00:00
Valentine233	f56bfb3030	[CPU] Fix memory access for sbgemm bf16 (#156585 ) Fixes #156022. 1. The original dtype conversion overwrites the whole `n_ldc_` instead of `n_m_` with stride `ldc_`, causing the potential memory issue. 2. Fix the None value issue in attention backward UT, as the sbgemm bf16 could be used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156585 Approved by: https://github.com/mingfeima, https://github.com/aditew01, https://github.com/ezyang	2025-07-08 02:36:28 +00:00
zpcore	12f9942b10	Fix slice op redistribute_cost compute (#157178 ) For slice op backward, my understanding is that the `redistribute_cost` attribute is incorrectly assigned to previous placement strategy: `0decd966af/torch/distributed/tensor/_ops/_tensor_ops.py (L399-L400)` The mistake is hard to be tested since we didn't enforce the `redistribute_cost` for `strategy.strategies` with size one: `2815ade9a8/torch/distributed/tensor/_sharding_prop.py (L491-L499)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157178 Approved by: https://github.com/XilunWu	2025-07-08 02:28:59 +00:00
Ke Wen	c5589074e6	[SymmMem] find_path does not search /usr/local/lib (#157695 ) This PR uses `find_library` to replace `find_path`. It also searches for NVSHMEM host lib and device lib separately. Tested against system install location: /usr/local/lib and /usr/local/include. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695 Approved by: https://github.com/Skylion007 ghstack dependencies: #157513	2025-07-08 01:21:59 +00:00
PyTorch MergeBot	30a1cc11a4	Revert "[CI][MacOS] Add `VENV_PATH` to search path (#157749 )" This reverts commit 85111cd165f108ffabb4a90083d59d7a867ebd9f. Reverted https://github.com/pytorch/pytorch/pull/157749 on behalf of https://github.com/huydhn due to It looks like lint was not green, so revert and reland I guess ([comment](https://github.com/pytorch/pytorch/pull/157749#issuecomment-3047032909))	2025-07-08 01:18:16 +00:00
PyTorch MergeBot	19a01382bc	Revert "[SymmMem] find_path does not search /usr/local/lib (#157695 )" This reverts commit 3effe0c293219b00a0eae7e139fe2d9aed84bc03. Reverted https://github.com/pytorch/pytorch/pull/157695 on behalf of https://github.com/kwen2501 due to Changing it to be landable on 2.8 branch ([comment](https://github.com/pytorch/pytorch/pull/157695#issuecomment-3047020152))	2025-07-08 01:12:01 +00:00
zeshengzong	df72078fe1	[dynamo] Replace unimplemented with unimplemented_v2 in `torch/_dynamo/variables/torch.py` (#157344 ) Fixes part of #147913 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157344 Approved by: https://github.com/williamwen42 Co-authored-by: William Wen <william.wen42@gmail.com>	2025-07-08 00:46:56 +00:00
Nikita Shulga	85111cd165	[CI][MacOS] Add `VENV_PATH` to search path (#157749 ) When building/testing PyTorch on MacOS Shoudl prevent some flakiness when conda environment overtakes CI/CD Pull Request resolved: https://github.com/pytorch/pytorch/pull/157749 Approved by: https://github.com/atalman, https://github.com/huydhn	2025-07-08 00:38:37 +00:00
Aaron Orenstein	edf7bb4f51	Fix unbound local when an error occurs before pool is initialized (#156750 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156750 Approved by: https://github.com/jamesjwu	2025-07-08 00:28:21 +00:00
dependabot[bot]	bbb930aba2	Bump urllib3 from 2.2.2 to 2.5.0 in /tools/build/bazel (#156390 ) Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.2.2 to 2.5.0. - [Release notes](https://github.com/urllib3/urllib3/releases) - [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst) - [Commits](https://github.com/urllib3/urllib3/compare/2.2.2...2.5.0) --- updated-dependencies: - dependency-name: urllib3 dependency-version: 2.5.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-07-07 17:13:21 -07:00
Bob Ren	60b41de0ca	remove allow-untyped-defs from torch/ao/nn/quantized/modules/rnn.py (#157234 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157234 Approved by: https://github.com/jingsh ghstack dependencies: #157231, #157232	2025-07-08 00:11:52 +00:00
Bob Ren	e38a335d7f	remove allow-untyped-defs from torch/backends/cusparselt/__init__.py (#157232 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157232 Approved by: https://github.com/jingsh ghstack dependencies: #157231	2025-07-08 00:11:52 +00:00
Bob Ren	9d8cf24b3b	remove allow-untyped-defs from torch/_classes.py (#157231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157231 Approved by: https://github.com/jingsh	2025-07-08 00:11:52 +00:00
James Wu	be56a8d7ac	Automatically load and save dynamo entries via caching_precompile (#155913 ) This PR adds a new config option, `caching_precompile`, and a `DynamoCache`, which loads and saves Dynamo Cache entries automatically. It also hooks up DynamoCache to PrecompileContext, so that we can save multiple cache entries. When this configuration is turned on, we: - Automatically create and initialize a CompilePackage on every torch.compile - Automatically use BundledAutogradcache - Automatically save the CompilePackage entry to DynamoCache after every compile You can also use PrecompileContext.serialize() to manually serialize a full object. I've added unit tests to exhibit this behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155913 Approved by: https://github.com/zhxchen17	2025-07-07 23:57:17 +00:00
Ke Wen	3effe0c293	[SymmMem] find_path does not search /usr/local/lib (#157695 ) This PR uses `find_library` to replace `find_path`. It also searches for NVSHMEM host lib and device lib separately. Tested against system install location: /usr/local/lib and /usr/local/include. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695 Approved by: https://github.com/Skylion007 ghstack dependencies: #157513	2025-07-07 23:16:45 +00:00
IvanKobzarev	2fde2090d0	[inductor_collectives] Make reorder_collectives_preserve_peak pass grouping nodes (#157706 ) Differential Revision: [D77861765](https://our.internmc.facebook.com/intern/diff/D77861765) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157706 Approved by: https://github.com/wconstab	2025-07-07 23:13:58 +00:00
rzou	5d8d126249	Fix einops x torch.compile interaction (#157600 ) Fixes https://github.com/pytorch/pytorch/issues/157451 If/when einops releases a version greater than 0.8.1, it will just break (without this patch). The history is: - Between 2.6 and 2.7, we tried to delete the einops import (#142847) - That didn't work so well, so we applied a hotfix in 2.7.1. (#153925) - The hotfix wasn't completely correct (0.8.1 is the latest version of einops, so the condition in the hotfix just always evaluates to True!) - It turns out we didn't need to delete the einops import. We already do not eagerly import einops. - I reverted the code back to the state it was in in 2.6. https://github.com/pytorch/pytorch/blob/release/2.6/torch/_dynamo/decorators.py Test Plan: - We have testing in CI for einops 0.6.1, 0.7.0, and 0.8.1. Wait for CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157600 Approved by: https://github.com/guilhermeleobas, https://github.com/anijain2305 ghstack dependencies: #157416	2025-07-07 23:04:02 +00:00
thenumberouscode	378c121d5e	Remove unnecessary warnings during the ATen compilation process. (#157703 ) Comparing uint32_t(num_threads()) with int(kCUDABlockReduceMaxThreads) always results in a compilation warning. Just change the return type of kCUDABlockReduceMaxThreads to uint32_t to avoid it. Fixes https://github.com/pytorch/pytorch/issues/157701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157703 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-07-07 22:49:38 +00:00
Gabriel Ferns	7e83d50845	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-07-07 22:13:34 +00:00
Shangdi Yu	6f05d58f2b	[AOTI] Split aoti_runtime/model.h to prepare for model static linking (#157592 ) Summary: Prepare for https://github.com/pytorch/pytorch/pull/157129. We split the file so we can re-use `model.h` part for codegen a separate header for each model in static linkage. Test Plan: CI Rollback Plan: Differential Revision: D77761249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157592 Approved by: https://github.com/desertfire	2025-07-07 22:13:22 +00:00
lucasb-eyer	a7eb153bba	[MemoryViz] Add file selector button (#157647 ) In some linux desktop environments like mine, there is no drag and dropping of files. Which made the memoryviz impossible for me to use. So this adds a file selector button as an alternative. Tested that it works locally, and also works with multiple files. ![image](https://github.com/user-attachments/assets/dcb61d68-6c6f-42f6-a075-1783d747d1b0) And the button remains when something is loaded, to allow loading something else, but it moves out of the way to save vertical space: ![image](https://github.com/user-attachments/assets/4239d13c-3d80-4790-9696-0906c75e14e6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157647 Approved by: https://github.com/sraikund16	2025-07-07 22:03:51 +00:00
Zeina Migeed	ed6df0e324	correctly import torch.version (#157584 ) The structure is ``` torch/ __init__.py version.py ``` When we import torch, only `torch/__init__.py` is executed by default. The submodules like `version.py` are not automatically imported or attached to the torch module. So without anything in `__init__.py`, `torch.version` may not be found. So in this PR, we make the import explicit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157584 Approved by: https://github.com/ezyang	2025-07-07 21:43:35 +00:00
Ankita George	5c79a55e7e	[oss] Add version to metadata (#155343 ) Summary: We want to add versioning to DCP to the metadata so that whenever planner logic changes, we can use the version on save to determine how to load the data Test Plan: added a test Rollback Plan: Differential Revision: D76135887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155343 Approved by: https://github.com/teja-rao	2025-07-07 20:57:30 +00:00
Andrey Talman	3d06ff82a8	[release] Triton pin update to 3.4 (#156664 ) Triton pin update issue: https://github.com/pytorch/pytorch/issues/154206 Please see post: https://dev-discuss.pytorch.org/t/2-8-final-rc-release-postponed-by-a-week/3101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156664 Approved by: https://github.com/davidberard98	2025-07-07 20:52:25 +00:00
Olga Gerasimova	2efa5eaa65	swa avoid stream sync (#157705 ) Summary: When AveragedModel updates_parameters it calls self.n_averaged == 0 for each parameter, where n_averated is a buffer on GPU. Moving check before the cycle to call sync once It improves update_parameter from 74ms to 57ms ~22% improvement {F1980011097} {F1980011111} Test Plan: CI Rollback Plan: Differential Revision: D77723025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157705 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/janeyx99	2025-07-07 20:47:35 +00:00
zpcore	c2510fcd86	Fix index_put propagate strategy arg unpack error (#157671 ) Fix `index_put` propagate strategy didn't consider optional arg `accumulate`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157671 Approved by: https://github.com/fmassa, https://github.com/wconstab	2025-07-07 20:18:18 +00:00
Kurt Mohler	510c398a4f	Add `max_pool3d` backward pass for MPS (#157498 ) Note on backward precision over fp16: A float16 number has 10 bits of mantissa, 5 bits of exponent, and 1 bit for the sign. If the sign bit is positive, then with a mantissa $m$ and exponent $e$ represented in base 10, the number that the float16 format represents is $(1 + m / 1024) \exp2(e)$. ([source](https://en.wikipedia.org/wiki/Half-precision_floating-point_format)) Consider adding two numbers $a$ and $b$ which have arbitrary mantissas, and say their exponents are $e_a = 1$ (so $2 \le a \lt 4$) and $e_b=-3$ (so $0.175 \le b \lt 0.25$). Assume that the result has the same exponent as $a$. Since the exponents differ by 4, we'll effectively need to truncate the 4 rightmost bits of $b$'s mantissa, which would introduce a maximum error on the order of $(2^4 / 1024) \exp2(-3) \approx 0.002$. The error is nearly the same if $e_b = -2$ (so $0.25 \le b \lt 0.5$), where the 3 rightmost bits are truncated, giving a maximum error on the order of $(2^3 / 1024) \exp2(-2) \approx 0.002$. Same for $e_b=-1$. So if we're adding up nine different numbers that all have exponents -3, -2, or -1, and they sum to a number with exponent 1, then we would expect a maximum error of several times greater than 0.002. In my comments above, summing those particular nine numbers in different ways gave results that ranged between 3.1816 and 3.1758, a difference of $0.0058 \approx 2.9 * 0.002$. That's within the acceptable bounds, and we can safely just increase the error tolerance used in test_output_grad_match for the case of max_pool3d_backward with float16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157498 Approved by: https://github.com/malfet	2025-07-07 19:46:44 +00:00
fduwjj	63a96eaeb8	[DeviceMesh] Add error when users try to slice non contiguous flattened dim submesh (#157523 ) With https://github.com/pytorch/pytorch/issues/157393, we want to first throw a clearer error for users and then fix it in the long-term Pull Request resolved: https://github.com/pytorch/pytorch/pull/157523 Approved by: https://github.com/fegin ghstack dependencies: #157501	2025-07-07 19:43:51 +00:00
fduwjj	2b8d3b1b2b	[DeviceMesh] Use user set backend and pg option even for the global mesh (#157501 ) Short term solution to https://github.com/pytorch/pytorch/issues/156593. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157501 Approved by: https://github.com/fegin, https://github.com/lw	2025-07-07 19:43:51 +00:00
Abhishek Nandy	bf1ebe0531	Fix typo: 'paramter' → 'parameter' in dynamo variable comment (#157651 ) This PR fixes a minor typo in a comment in `torch/_dynamo/variables/torch.py`, changing 'paramter' to the correct spelling 'parameter'. These small but meaningful changes help improve code readability and maintain the overall quality of the codebase. Thanks for your time and review! Pull Request resolved: https://github.com/pytorch/pytorch/pull/157651 Approved by: https://github.com/Skylion007	2025-07-07 19:42:44 +00:00
Sam Larsen	433a247102	[logging] [redo] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156840 ) Summary: This is a redo of https://github.com/pytorch/pytorch/pull/156517, but with pt2_compile_events logging disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156840 Approved by: https://github.com/jamesjwu	2025-07-07 19:09:48 +00:00
Wang, Chuanqi	8a47f9d03b	[CI] Fix xpu ci test sccache issue (#157693 ) With PR #157341 land, it broken the PXU CI test on sccache which has been disabled by #143851. Re-disable it Pull Request resolved: https://github.com/pytorch/pytorch/pull/157693 Approved by: https://github.com/atalman, https://github.com/huydhn	2025-07-07 18:29:38 +00:00
yifanmao	9e5f4a844c	[FSDP2] Fix issue with set_reduce_scatter_divide_factor errors and MixedPrecisionPolicy (#155964 ) fix https://github.com/pytorch/pytorch/issues/155223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155964 Approved by: https://github.com/weifengpy	2025-07-07 17:09:29 +00:00
cyy	7c1f627828	Fix 'dllimport attribute ignored on inline function' (#157670 ) There are lots of warnings in builds: ``` 2025-07-05T16:59:46.9208806Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5043,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes] 2025-07-05T16:59:46.9209030Z 5043 \| inline at::Tensor & Tensor::less_(const at::Scalar & other) const { 2025-07-05T16:59:46.9209104Z \| ^ 2025-07-05T16:59:46.9209671Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5048,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes] 2025-07-05T16:59:46.9209860Z 5048 \| inline at::Tensor & Tensor::less_(const at::Tensor & other) const ``` This PR has fixed them and turned the warning into an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157670 Approved by: https://github.com/albanD	2025-07-07 16:57:48 +00:00
henrylhtsang	b3b4d28f4c	[submodule][cutlass] Update pin to b995f93 v4.0.0 (#157376 ) @Skylion007 seems afk. https://github.com/pytorch/pytorch/pull/153541 https://github.com/NVIDIA/cutlass/releases/tag/v4.0.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157376 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2025-07-07 16:55:47 +00:00
PyTorch MergeBot	ae1094b72b	Revert "[WIP] Automatically load and save dynamo entries via caching_precompile (#155913 )" This reverts commit e466dab164d9236bfe5817ec8e4d24c7b9d3e392. Reverted https://github.com/pytorch/pytorch/pull/155913 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/155913#issuecomment-3045914878))	2025-07-07 16:53:35 +00:00
Guilherme Leobas	eda0a9cc90	[list] Add list.__delitem__ (#156339 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156339 Approved by: https://github.com/zou3519 ghstack dependencies: #153969, #156148, #156242, #156270, #156271	2025-07-07 14:51:32 +00:00
Guilherme Leobas	d74ccf4ffe	[list] Add list.__mul__ and list.__imul__ (#156271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156271 Approved by: https://github.com/zou3519 ghstack dependencies: #153969, #156148, #156242, #156270	2025-07-07 14:51:32 +00:00
Guilherme Leobas	689fba032d	Implement list.__add__ and list.__iadd__ (#156270 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156270 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #153969, #156148, #156242	2025-07-07 14:51:25 +00:00
Guilherme Leobas	c1d69d5dd5	[list] Implement `list.remove` (#156242 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156242 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #153969, #156148	2025-07-07 14:51:17 +00:00
Guilherme Leobas	e49acfc5c5	[list] Raise exception in invalid list method call (#156148 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156148 Approved by: https://github.com/zou3519 ghstack dependencies: #153969	2025-07-07 14:51:10 +00:00
Guilherme Leobas	034e996d37	[list] Implement list.count (#153969 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153969 Approved by: https://github.com/zou3519, https://github.com/XuehaiPan	2025-07-07 14:51:03 +00:00
Yahaya Suleiman	16c3b4143b	[gtest][listing] Enable gtest json listing for the fbcode/caffe2 project (#156816 ) *SUMMARY* The main function in this tests overrides that of the Gtest framework which contains it's `RUN_ALL_TESTS()` function. The main function in this test is called conditionally when conditions apply, in this case, when the C10_MOBILE directive is provided. This is wrong as we always want to call the `RUN_ALL_TEST()` function. In this PR, we only make the test suite available for cases that apply, i.e if the C10_MOBILE directive exist which represents the caching allocator and is only exposed on mobile *TEST PLAN* This tests should run in modes where it applies which should be covered in the CI run. Below shows a sample run in the dev-nosan mode which do not have the cache allocator BEFORE ``` buck test fbcode//caffe2:cpu_caching_allocator_test Discovered 0. Pass 0. Fail 0. Fatal 0. Skip 0. Timeout 0 ⚠ Listing failed: caffe2:cpu_caching_allocator_test Listing tests failed with error: Failed to read from /data/users/ysuleiman/fbsource/buck-out/v2/test/buck-out/v2/test_discovery/fbcode/6dcc55a61c1b90b3/default/tpx_execution_dir/gtest_output_file.json. Listing process stdout: , stderr: ``` AFTER ``` buck test '@fbcode//mode/dev-nosan' fbcode//caffe2:cpu_caching_allocator_test Analyzing targets. Remaining 0/46242 1871690 actions, 2251668 artifacts declared Executing actions. Remaining 0/257870 83:28:24.4s exec time total Command: test. Finished 10 remote, 112314 cache (99% hit) 83:22:43.5s exec time cached (99%) Time elapsed: 2:57.7s Tests finished: Pass 0. Fail 0. Fatal 0. Skip 0. Build failure 0 NO TESTS RAN ``` Rollback Plan: steps: - manual.note: content: Revert this diff Reviewed By: patskovn Differential Revision: D77229077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156816 Approved by: https://github.com/kimishpatel	2025-07-07 14:16:43 +00:00
Henry Tsang	54a4d34d10	[fbcode] switch to cutlass-4 (#157579 ) Summary: Update cutlass version to 4. For most use cases. Test Plan: testing in progress Rollback Plan: Differential Revision: D77605011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157579 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2025-07-07 14:12:33 +00:00
PyTorch UpdateBot	78684e27ac	[xla hash update] update the pinned xla hash (#156584 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156584 Approved by: https://github.com/pytorchbot	2025-07-07 12:09:20 +00:00
PyTorch UpdateBot	40e39ae21f	Update slow tests (#157696 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157696 Approved by: https://github.com/pytorchbot	2025-07-07 12:09:06 +00:00
James Wu	e466dab164	[WIP] Automatically load and save dynamo entries via caching_precompile (#155913 ) This PR adds a new config option, `caching_precompile`, and a `DynamoCache`, which loads and saves Dynamo Cache entries automatically. It also hooks up DynamoCache to PrecompileContext, so that we can save multiple cache entries. When this configuration is turned on, we: - Automatically create and initialize a CompilePackage on every torch.compile - Automatically use BundledAutogradcache - Automatically save the CompilePackage entry to DynamoCache after every compile You can also use PrecompileContext.serialize() to manually serialize a full object. I've added unit tests to exhibit this behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155913 Approved by: https://github.com/zhxchen17	2025-07-07 11:56:30 +00:00
Aleksei Nikiforov	d27d36136c	Don't try installing missing cuda dependencies on s390x (#157540 ) Don't try installing missing cuda dependencies on s390x Fixes #157409 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157540 Approved by: https://github.com/seemethere, https://github.com/huydhn	2025-07-07 09:16:38 +00:00
haozhe.zhu	815545f2dd	[inductor] enable bf32 for mkldnn linear pointwise/binary in inductor (#127294 ) When `torch.backends.mkldnn.matmul.fp32_precision=='bf16'`, we also enabled mkldnn linear in inductor path and allow to run with bf16 computation data type. Testplan: ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_unary python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_fp32 python test/inductor/test_mkldnn_pattern_matcher.py -k test_multi_linear_share_same_input ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127294 Approved by: https://github.com/jgong5, https://github.com/jansel Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>	2025-07-07 06:03:41 +00:00
Valentine233	d26ca5de05	Support transpose and pack for bit8 (#156065 ) To be used by CPU INT8 SDPA in torchao. https://github.com/pytorch/ao/pull/2380 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156065 Approved by: https://github.com/mingfeima, https://github.com/ezyang	2025-07-07 01:40:47 +00:00
Lei	2022588295	Fix: Ensure writeback handles NO_SHARD correctly by flattening tensors before copying (#154369 ) Fixes #151223 Because FSDP stores original parameters as views into a flattened tensor, changing the flattened parameter’s tensor directly can desynchronize the views. With the NO_SHARD strategy this caused a shape mismatch error when writing back modified parameters. Ensured writeback handles NO_SHARD correctly by flattening tensors before copying. The logic now flattens the source parameter or gradient when the strategy is unsharded to maintain the expected 1‑D shape for writeback operations Pull Request resolved: https://github.com/pytorch/pytorch/pull/154369 Approved by: https://github.com/weifengpy	2025-07-06 09:20:31 +00:00
Xuehai Pan	02715d0876	[BE][5/6] fix typos in test/ (test/dynamo/) (#157639 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157639 Approved by: https://github.com/yewentao256, https://github.com/jansel ghstack dependencies: #157638	2025-07-06 06:34:25 +00:00
Xuehai Pan	17687eb792	[BE][4/6] fix typos in test/ (test/inductor/) (#157638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638 Approved by: https://github.com/yewentao256, https://github.com/jansel	2025-07-06 06:34:25 +00:00
Sv. Lockal	7cda4017dd	Fix torch.utils.cpp_extension parser for clang version 20.1.7+libcxx (#157666 ) When CC and CXX compiler is set to clang, and clang was compiled with libc++, compilation of torchvision fails with: ``` File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 585, in build_extensions compiler_name, compiler_version = self._check_abi() ^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1034, in _check_abi _, version = get_compiler_abi_compatibility_and_version(compiler) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 449, in get_compiler_abi_compatibility_and_version if tuple(map(int, version)) >= minimum_required_version: ^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: invalid literal for int() with base 10: '7+libcxx' ``` Compiler identification is a valid semantic version: ``` $ clang -dumpfullversion -dumpversion 20.1.7+libcxx ``` After adjusting parser of version, clang is able to compile extensions successfully. Fixes #157665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157666 Approved by: https://github.com/msaroufim	2025-07-06 01:35:00 +00:00
Tom Ritchford	3e56a9cdfb	More testing of Python arithmetic operators between tensors and scalars (see 157266) (#157632 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157632 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2025-07-05 17:48:27 +00:00
dsashidh	ee9ac36c23	Fixing misspelling in documentation (#157565 ) Fixes #157564 Fixes misspelling of the word parameter in documentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/157565 Approved by: https://github.com/awgu, https://github.com/cyyever	2025-07-05 17:04:13 +00:00
Sandeep Narendranath Karjala	9be5860bc3	[dynamo] Fix dynamic shapes handling in after_aot repro generation (#157136 ) Summary: - Extract symbolic variables directly from graph placeholders and arguments - Add symbolic variable definitions to generated repro code - Add unit tests with ToyModel for testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/157136 Approved by: https://github.com/xmfan ghstack dependencies: #157021	2025-07-05 15:38:41 +00:00
Abhishek Nandy	548c9d8281	Fix typo: 'paramter' → 'parameter' in quantization model report test (#157646 ) This PR addresses a minor typo in the file `test/quantization/fx/test_model_report_fx.py`: - Corrected the word "paramter" to "parameter" for better readability and accuracy. While it's a small change, correcting such typographical errors contributes to maintaining the overall quality and professionalism of the codebase. Thank you for your time and consideration in reviewing this PR. I'm happy to make any further adjustments if needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157646 Approved by: https://github.com/yewentao256, https://github.com/ezyang	2025-07-05 12:28:36 +00:00
Abhishek Nandy	71a650ad56	Fix typo: 'Intializing' → 'Initializing' in test_parametrization.py (#157362 ) This pull request fixes a minor typo in the doc comments of `test/nn/test_parametrization.py`. - Replaced `'Intializing'` with `'Initializing'` in two docstring comments to improve clarity and maintain consistency across the codebase. This is a non-functional change and does not impact behavior or test outcomes. Thank you for maintaining such a high-quality codebase. Please let me know if any adjustments are needed. I'd be happy to help! Pull Request resolved: https://github.com/pytorch/pytorch/pull/157362 Approved by: https://github.com/ezyang	2025-07-05 12:21:15 +00:00
bobrenjc93	2471cc3355	[pc] verify max autotune is in generated source code (#157650 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157650 Approved by: https://github.com/aorenste ghstack dependencies: #157305, #157614, #157619	2025-07-05 07:55:11 +00:00
bobrenjc93	db00e1699a	[pc] introduce ProgressiveCompilationState and clear callback (#157619 ) followup from https://github.com/pytorch/pytorch/pull/157305 where @aorenste correctly suggested clearing callback. this refactor introduces a new dataclass so we don't need to check nullability for each field Pull Request resolved: https://github.com/pytorch/pytorch/pull/157619 Approved by: https://github.com/aorenste ghstack dependencies: #157305, #157614	2025-07-05 07:55:11 +00:00
bobrenjc93	5ea832e5f6	[pc] migrate progression futures from list to deque (#157614 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157614 Approved by: https://github.com/aorenste ghstack dependencies: #157305	2025-07-05 07:55:03 +00:00
Nikita Shulga	a952956d05	Add isnan exit condition to special ops (#157464 ) They might have been slow on CUDA-11.3, but this version of CUDA is long gone. More fundamental underlying issue were linear complexity of the recursive polynomial definitions for higher order polynomials, for example see this loop from implementation of Chebyshev polynomial of the first kind `7081b8233a/aten/src/ATen/native/Math.h (L2969-L2973)` which were tested by `test_compare_cpu` using following values (as sample index 16) `7081b8233a/torch/testing/_internal/opinfo/core.py (L2079)` Luckily chebyshev polynomials for absolute values higher than 1 pretty quickly reach infinity, see below ``` python3 -c "import torch;print(torch.special.chebyshev_polynomial_v(torch.nextafter(torch.tensor(1.0), torch.tensor(2.0)), torch.tensor(1e6)))" tensor(nan) ``` Which is not the case for Laguerre polynomials, but it's probably fine to just limit it to 1e7 Before ``` $ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_ ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.) return torch._C._get_cublas_allow_tf32() ....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss ---------------------------------------------------------------------- Ran 432 tests in 8.575s OK (skipped=344) ``` After ``` $ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_ ssssssss........................ssssssssssssssss......../home/ubuntu/pytorch/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/ubuntu/pytorch/aten/src/ATen/Context.cpp:78.) return torch._C._get_cublas_allow_tf32() ........................................................................................xxxxxxxx................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss ---------------------------------------------------------------------- Ran 432 tests in 45.580s OK (skipped=72, expected failures=8) ``` Fixes https://github.com/pytorch/pytorch/issues/79528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #157488	2025-07-05 04:19:50 +00:00
yewentao256	63e87d6d05	[Refactor] Add maybe unused flag to remove warning (#157655 ) Fixes #157653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157655 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-07-05 03:23:39 +00:00
yewentao256	f7127b9b94	[Refactor] Remove unused variables (#157654 ) Fixes #157653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157654 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-07-05 02:12:15 +00:00
Adeniji Adekunle James	44f5b93122	fix: correct sentence punctuation in cuDNN note (#157623 ) Fixes #ISSUE_NUMBER This PR fixes a small punctuation issue in the PyTorch README. Specifically: Added a missing full stop at the end of the sentence: "Note: You could refer to the cuDNN Support Matrix for cuDNN versions with the various supported CUDA, CUDA driver and NVIDIA hardware." Added comma for clarity between "CUDA driver" and "NVIDIA hardware". These edits improve the readability and grammatical correctness of the documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157623 Approved by: https://github.com/Skylion007	2025-07-05 01:37:33 +00:00
Abhishek Nandy	e0fd48be7d	Fix typo: 'occurances' → 'occurrences' in mobile model test (#157629 ) This PR addresses a typo in the file `test/mobile/model_test/gen_test_model.py`. ### Changes: - Corrected "occurances" to the correct spelling "occurrences" - Renamed associated variables to reflect this change for consistency and clarity This is a non-functional, cleanup-only PR to improve code readability. Thanks to the PyTorch team for maintaining such a high-quality codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/157629 Approved by: https://github.com/Skylion007	2025-07-05 01:36:42 +00:00
Abhishek Nandy	43f7216327	Fix typo: 'paramters' → 'parameters' in ATen tunable README (#157575 ) This PR addresses a minor typo in the documentation file aten/src/ATen/cuda/tunable/README.md, where paramters has been corrected to parameters for improved clarity and consistency. Context Accurate and clear documentation is crucial for helping developers and contributors understand PyTorch internals. This small fix contributes to the overall quality and readability of the project. Thank you to the PyTorch team and maintainers for your continued efforts in building such an incredible framework. I'm happy to contribute in any way I can — even if just with a small doc improvement like this one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157575 Approved by: https://github.com/eqy	2025-07-05 01:14:45 +00:00
Ke Wen	8a8fac1131	[SymmMem] Move code to where it is used (#157611 ) `maybe_initialize_env_vars` and `initialize_nvshmem_with_store` are only used in `NVSHMEMSymmetricMemory.cu`. Moving them there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157611 Approved by: https://github.com/Skylion007 ghstack dependencies: #157513	2025-07-04 23:37:49 +00:00
Huy Do	bcc98bb2a4	Update _linux-test to support B200 runner (#157341 ) This unblocks https://github.com/pytorch/test-infra/issues/6869. The key changes to call out: * B200 needs OIDC to access ECR and upload stats to S3, so we need to set `id-token: write` in `_linux-test`. All workflows calling `_linux-test` also need to be updated accordingly * Connecting sccache to S3 on B200 doesn't seem to work, so I disable it. It still works locally though. ### Testing https://github.com/pytorch/pytorch/actions/runs/16055549292/job/45312298376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157341 Approved by: https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet	2025-07-04 23:19:24 +00:00
Xuehai Pan	524e827095	[build] modernize build-backend: `setuptools.build_meta:__legacy__` -> `setuptools.build_meta` (#155998 ) Change `build-system.build-backend`: `setuptools.build_meta:__legacy__` -> `setuptools.build_meta`. Also, move static package info from `setup.py` to `pyproject.toml`. Now the repo can be installed from source via `pip` command instead of `python setup.py develop`: ```bash python -m pip install --verbose --editable . python -m pip install --verbose --no-build-isolation --editable . ``` In addition, the SDist is also buildable: ```bash python -m build --sdist python -m install dist/torch-*.tar.gz # build from source using SDist ``` Note that we should build the SDist with a fresh git clone if we will upload the output to PyPI. Because all files under `third_party` will be included in the SDist. The SDist file will be huge if the git submodules are initialized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155998 Approved by: https://github.com/ezyang, https://github.com/cyyever, https://github.com/atalman ghstack dependencies: #157557	2025-07-04 19:25:14 +00:00
Tom Ritchford	9968edd002	Fix #153942 (#153943 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153943 Approved by: https://github.com/malfet	2025-07-04 18:25:18 +00:00
Andrey Talman	7275f28045	Fix cuda 12.9 aarch64 GPU builds. Update CUDA_STABLE variable. (#157630 ) This contains 2 fixes that required in main and will need to be cherry-picked to Release 2.8 branch: 1. The PR https://github.com/pytorch/pytorch/pull/155819 missed to include triton change. 2. CUDA STABLE variable needs to be set to 12.8. Updating CUDA stable updates full static build Pull Request resolved: https://github.com/pytorch/pytorch/pull/157630 Approved by: https://github.com/Skylion007, https://github.com/jeanschmidt	2025-07-04 18:08:31 +00:00
zhxchen17	7be862ab8f	[dynamo] Relax DUPLICATED_INPUT to be serializable. (#157492 ) Since we don't actually rely on any real data while building DUPLICATE_INPUT guard, we can safely serialize it with sources and it should be able to reconstruct the guard correctly in the new process. Therefore we don't really need to prevent serializing it. Differential Revision: [D77683302](https://our.internmc.facebook.com/intern/diff/D77683302/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157492 Approved by: https://github.com/jamesjwu, https://github.com/jansel	2025-07-04 15:19:34 +00:00
Xuehai Pan	336f1e2d35	[AOTI] Fix AOT inductor CMake build dependency order (#157557 ) compile_model.py -> aoti_custom_class -> torch The custom command requires `torch` to be installed. `8408522976/test/cpp/aoti_inference/compile_model.py (L1-L7)` Fixes CI failure on trunk: - https://github.com/pytorch/pytorch/actions/runs/16041370426/job/45275085572#step:22:18348 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157557 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-07-04 14:33:36 +00:00
Abhishek Nandy	a46ea8a364	Fix typo: 'initalized' → 'initialized' in alias analysis test (#157628 ) This PR corrects a small spelling error in `test/jit/test_alias_analysis.py`. - "initalized" → "initialized" This is a minor comment correction and does not affect functionality or logic. Thank you for maintaining this amazing codebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157628 Approved by: https://github.com/Skylion007	2025-07-04 13:41:53 +00:00
zeshengzong	f41d017aa6	Add device check in `mse_loss` (#155089 ) Fixes #154978 ## Test Result ```python >>> import torch >>> import numpy as np >>> import torch.nn as nn >>> import torch.distributions.normal as norm >>> device = torch.device(('cuda' if torch.cuda.is_available() else 'cpu')) >>> print('Using {}'.format(device)) Using cuda >>> m = nn.Sequential(nn.Linear(1, 128).cuda(), nn.Tanh(), nn.Linear(128, 128).cuda(), nn.Tanh(), nn.Linear(128, 128).cuda(), nn.Tanh()) >>> m.to(device, dtype=None, non_blocking=False) Sequential( (0): Linear(in_features=1, out_features=128, bias=True) (1): Tanh() (2): Linear(in_features=128, out_features=128, bias=True) (3): Tanh() (4): Linear(in_features=128, out_features=128, bias=True) (5): Tanh() ) >>> opt = torch.optim.Adam(m.parameters(), lr=0.001) >>> print('Number of trainable parameters: ', sum((p.numel() for p in m.parameters() if p.requires_grad))) Number of trainable parameters: 33280 >>> input_tensor = torch.tensor(77.0, device=device) >>> target = torch.tensor(66.0) >>> loss_function = nn.MSELoss() >>> print('Loss Function: ', loss_function) Loss Function: MSELoss() >>> loss = loss_function(input_tensor, target) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1778, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 610, in forward return F.mse_loss(input, target, reduction=self.reduction) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/functional.py", line 3903, in mse_loss return torch._C._nn.mse_loss( ^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155089 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-07-04 12:37:48 +00:00
William Wen	52e4e41cbc	[dynamo] do not issue lru_cache warning for functions in the top-level torch namespace (#157598 ) `lru_cache` usage warning was being raised for `torch.get_device_module()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157598 Approved by: https://github.com/Sidharth123-cpu	2025-07-04 08:17:50 +00:00
Jason Ansel	64f2ec77f8	[inductor] Fix fractional_max_pool2d 3D input causing assertion error (#156912 ) Fixes #156682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156912 Approved by: https://github.com/angelayi	2025-07-04 06:09:28 +00:00
Laith Sakka	fdc5b42a8f	_broadcast_shapes gso generalizations (#157008 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157008 Approved by: https://github.com/ColinPeppler ghstack dependencies: #155590	2025-07-04 05:56:42 +00:00
bobrenjc93	d58ed04d89	[async-compile] add progressive compile mode (#157305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157305 Approved by: https://github.com/aorenste	2025-07-04 04:18:50 +00:00
PyTorch UpdateBot	386bc9e2e9	[audio hash update] update the pinned audio hash (#156905 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156905 Approved by: https://github.com/pytorchbot	2025-07-04 04:06:59 +00:00
PyTorch MergeBot	f2e712ca14	Revert "Fix is_unaligned usage of statically_known_true (#157400 )" This reverts commit b359571c6043b40c4ae4fbb07135fd0f04902e21. Reverted https://github.com/pytorch/pytorch/pull/157400 on behalf of https://github.com/malfet due to It break tests, see `99c1a6bdd9/1` ([comment](https://github.com/pytorch/pytorch/pull/157400#issuecomment-3034353539))	2025-07-04 03:57:08 +00:00
Ke Wen	99c1a6bdd9	[SymmMem] Find NVSHMEM from system installation (#157513 ) Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157513 Approved by: https://github.com/atalman, https://github.com/Skylion007	2025-07-04 03:34:44 +00:00
Geon-Woo Kim	4ed1b03f72	Add missing graph and memory related symbols to cuda_to_hip_mappings (#157435 ) (#157573 ) Summary: This PR adds missing CUDA symbols in `cuda_to_hip_mappings`. Test Plan: Tested in D77642700. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157573 Approved by: https://github.com/Skylion007 Co-authored-by: Geon-Woo Kim <gwkim@meta.com>	2025-07-04 03:03:04 +00:00
Ke Wen	8f9a191db6	[SymmMem] Fix CI name mismatch; remove TORCH_SYMMMEM requirement (#157597 ) Thanks @huydhn for spotting two name mismatches in the CI configs. We were matching against "test_h100_symm_mem" instead of "h100-symm-mem". Also, replaced `TORCH_SYMMMEM` env setting with programmatic method: `symm_mem.set_backend(...)` Further, skips a hanged test in `test_nvshmem_trion.py`. (#TODO @codingwithsurya ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157597 Approved by: https://github.com/fduwjj, https://github.com/huydhn	2025-07-04 01:43:08 +00:00
Mustafa Quraish	ef97bd4713	[torch] Add MTIA to the list of devices supporting foreach/fused kernels (#157583 ) Summary: We currently have foreach kernel implementations for MTIA, and for when we don't we internally decompose the ops. Anyone using this list for compatibility checks should be sending through the foreach kernels. Reviewed By: egienvalue, scottxu0730 Differential Revision: D77751248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157583 Approved by: https://github.com/egienvalue	2025-07-04 01:15:24 +00:00
rzou	f0b388665e	Add dynamo_timed to bytecode hook (#157587 ) Test Plan: - ran tlparse on vLLM and saw this Pull Request resolved: https://github.com/pytorch/pytorch/pull/157587 Approved by: https://github.com/jingsh, https://github.com/BoyuanFeng	2025-07-04 01:11:03 +00:00
Chong Gu	c9a5bf09ba	[FP8] FP8 for SwishLayerNorm (#157574 ) Summary: Add a pass use_triton_fp8_swish_replace_normal_swish to replace _triton_swish_rms_norm with its counterpart that supports fp8 triton_swish_rms_norm, and turn on fp8 during inference. Test Plan: ``` buck2 run mode/opt mode/inplace -c fbcode.platform010_cuda_version=12.4 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR --model-snapshot-id=899072727_0 --node-replacement-dict="{}" --gpu-trace --add-passes=use_triton_fp8_swish_replace_normal_swish ``` The perf improvement on the 100x model with this pass is roughly ~7%, details are recorded [here](https://docs.google.com/document/d/1eIV_OTQyQcf_DlEDxwycTwhyGxT5OJkLzs8cPL6EMYc/edit?tab=t.0) Rollback Plan: Reviewed By: frank-wei Differential Revision: D76531303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157574 Approved by: https://github.com/frank-wei	2025-07-04 01:06:21 +00:00
Guilherme Leobas	dfcda613b6	Ensure Dynamo can trace through explicit dunder method call (#154366 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154366 Approved by: https://github.com/zou3519 ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064, #154065, #154066, #154263	2025-07-04 00:46:05 +00:00
Guilherme Leobas	0e7f02fe2e	[Dynamo] [FrozensetSubclass] Add support for user defined frozensets (#154263 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154263 Approved by: https://github.com/williamwen42 ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064, #154065, #154066	2025-07-04 00:46:05 +00:00
Guilherme Leobas	308b88bde9	[Dynamo] [Set] Add comparison for set subclass (#154066 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154066 Approved by: https://github.com/Skylion007 ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064, #154065	2025-07-04 00:45:58 +00:00
Guilherme Leobas	c51da57b55	[Dynamo] [Set] Raise TypeError in set.union(...) and "__or__" (#154065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154065 Approved by: https://github.com/williamwen42 ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064	2025-07-04 00:45:50 +00:00
Guilherme Leobas	f9544f1f0c	[Dynamo] [Set] Raise TypeError if object is unhashable (#154064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154064 Approved by: https://github.com/Skylion007 ghstack dependencies: #153150, #152991, #154539, #153553, #154063	2025-07-04 00:45:42 +00:00
Guilherme Leobas	11c71053e0	[Dynamo] [Set] Implement some binop operators for dict/set/frozenset/dict_keys (#154063 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154063 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #153150, #152991, #154539, #153553	2025-07-04 00:45:34 +00:00
Guilherme Leobas	22abe6ded4	[Dynamo] [SetSubclass] Add support for user defined sets (#153553 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153553 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #153150, #152991, #154539	2025-07-04 00:45:25 +00:00
Guilherme Leobas	2b82c61f04	[Generator] Implement generator.__contains__ (#154539 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154539 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #153150, #152991	2025-07-04 00:45:18 +00:00
Guilherme Leobas	f651e28f80	[FrozenSet] Fixes for FrozenSet (#152991 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152991 Approved by: https://github.com/zou3519 ghstack dependencies: #153150	2025-07-04 00:45:11 +00:00
Guilherme Leobas	e7167dbacf	[Set] Support sets in VariableBuilder (#153150 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153150 Approved by: https://github.com/zou3519	2025-07-04 00:45:03 +00:00
Shuai Yang	6c42afe196	Introduce sync_cross_rank_decision (#156287 ) Summary: This is an improvement over `_broadcast_rank0_decision` where we uses the rank0's decision to broadcast to every rank. The issue of `_broadcast_rank0_decision` is that we observed large variance on the peak memory usage. One cause is that different ranks receive different dynamic shaped tensors and the hints of those tensors are different in different ranks. If we only rely on rank0's decision and it's unlucky to get unrepresentative hints, then the decision it makes may not be suitable for other ranks. Here, we introduce `sync_cross_rank_decision` which comes up with the decision after comparing all ranks' local decision, it will: 1. all gather decisions from all ranks; 2. test each decision on the current rank and get its estimated memory usage; 3. all reduce estimated memory usage with ReduceOp.MAX, so that we know the maximum memory usage of each decision on all ranks; 4. pick the decision which gives us minimum maximum memory memory usage; A graph to show more details https://internalfb.com/excalidraw/EX484509 After applying sync_cross_rank_decision, we observed that the variance are much smaller Rollback Plan: Differential Revision: D76714005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156287 Approved by: https://github.com/fmassa, https://github.com/bdhirsh	2025-07-03 23:43:53 +00:00
Sheng Qin	f7130c097e	[nativert] Move Executor to PyTorch core (#157514 ) Test Plan: CI Rollback Plan: Differential Revision: D77693984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157514 Approved by: https://github.com/zhxchen17	2025-07-03 23:31:54 +00:00
Scott Wolchok	ad86c05b78	efficient zero_mask implementation for vec128_*_neon (#155766 ) Differential Revision: [D76481039](https://our.internmc.facebook.com/intern/diff/D76481039/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155766 Approved by: https://github.com/malfet	2025-07-03 23:27:03 +00:00
Laith Sakka	b359571c60	Fix is_unaligned usage of statically_known_true (#157400 ) Summary: - symbolic shapes statically_known_true usage is wrong, this API is meant to be used for SymNodes. what is needed is V.graph.sizevars.statically_known_true. or V.graph.sizevars.statically_known_Equals or ideally V.graph.sizevars.statically_known_multiple_of. - The construction using == 0 is not symbolic, this used to always return false for symbolic inputs. Differential Revision: D77619293 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157400 Approved by: https://github.com/ColinPeppler	2025-07-03 23:26:36 +00:00
Aaron Gokaslan	a6fab82b16	[BE]: Fix NVSHMEM builds, add missing 12.9 dependency and update to latest for 2.8RC (#157453 ) Fixed our bad builds of nvshmem, (we were not building or testing before) and also updates to the latest version. Newest versions has critical support for things that would actually make it useful, like bfloat16 and float16 support. This is a proper fix for: https://github.com/pytorch/pytorch/pull/157411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157453 Approved by: https://github.com/kwen2501, https://github.com/atalman	2025-07-03 22:55:18 +00:00
Teja	dd3e7170c2	Add async checkpointing impl to experimental checkpointer and add a builder API (#156927 ) 1. Adds an AsyncCheckpointer with out-of-process checkpointing and state_dict_stager with shared memory, pinned memory and Zero Overhead Support. 2. Adds two conveinient functions to create sync/async checkpointers Differential Revision: [D77336833](https://our.internmc.facebook.com/intern/diff/D77336833/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156927 Approved by: https://github.com/pradeepfn	2025-07-03 22:49:20 +00:00
Mark Saroufim	7081b8233a	[BE] Accelerator agnostic timer.py (#157131 ) Farewell to a lot of if statements - benefit is this now also supports mps synchronization Still need to think of a good test strategy for the privateUse1 removal, granted I'm not sure what the semantics of something like https://docs.pytorch.org/docs/stable/generated/torch.cpu.synchronize.html actually since CPU is probably synchronous? Pull Request resolved: https://github.com/pytorch/pytorch/pull/157131 Approved by: https://github.com/albanD	2025-07-03 22:23:04 +00:00
IvanKobzarev	7b392bac13	all_gather_bucketing fx pass (#157396 ) Porting passes to bucket all_gathers The main logic of the pass is done via 1. Searching for all all_gathers from the buckets Copying tests from @wconstab PR to test compatibility with reordering. Test checks only compatibility, as because of (3) the joint all_gather will be scheduled already as early as possible and no space for reordering. Pass changes: Using mutation ops to match performance of fsdp, in future the perfect scenario will be to have only functional graph, that inductor does all memory optimizations on its own without mutable ops. Inductor changes: Adding foreach_copy_ lowering Pull Request resolved: https://github.com/pytorch/pytorch/pull/157396 Approved by: https://github.com/wconstab	2025-07-03 22:07:42 +00:00
Abhishek Nandy	19ae5afdaa	Fix typo: 'recieve' → 'receive' in comments (#157544 ) This PR corrects minor typos in developer-facing comments: - Replaces 'recieve' with 'receive' in: - `FunctionalTensorWrapper.cpp` - `make_boxed_from_unboxed_functor.h` These changes improve code readability and maintain comment correctness. Thank you for reviewing! Pull Request resolved: https://github.com/pytorch/pytorch/pull/157544 Approved by: https://github.com/soulitzer	2025-07-03 19:11:15 +00:00
Xuehai Pan	3fd84a8592	[BE][PYFMT] migrate PYFMT for `torch/[a-c]*/` to `ruff format` (#144554 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144554 Approved by: https://github.com/soulitzer	2025-07-03 18:56:07 +00:00
Manuel Candales	d56f11a1f2	[MPS] Implement logcumsumexp metal kernel (#156858 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156858 Approved by: https://github.com/malfet ghstack dependencies: #157512	2025-07-03 18:16:25 +00:00
Manuel Candales	794b95d54b	Enable Half dtype for logcumsumexp_backward (#157512 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157512 Approved by: https://github.com/malfet	2025-07-03 18:13:38 +00:00
rzou	e3fe001d9e	Add einops x torch.compile testing in PyTorch CI (#157416 ) Fixes #146782. This PR adds testing for multiple einops versions in PyTorch CI. This occurs in a new "einops" CI job that runs for both Python 3.9 and 3.13 (aka, what we test Dynamo over). Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/157416 Approved by: https://github.com/guilhermeleobas, https://github.com/arogozhnikov, https://github.com/anijain2305	2025-07-03 17:36:39 +00:00
henrylhtsang	660dbea909	[cutlass backend] modify presets ahead of cutlass 4 upgrade (#157522 ) Differential Revision: [D77707409](https://our.internmc.facebook.com/intern/diff/D77707409/) Also asking in https://github.com/NVIDIA/cutlass/issues/2435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157522 Approved by: https://github.com/coconutruben	2025-07-03 17:13:24 +00:00
Wanchao Liang	5cfe4377d6	[dtensor] Rework partial propagation in pointwise op and support mul (#157340 ) I am trying to see if I can easily add the linearity support for aten.mul to allow Partial placement to propagate through. But it turns out that I have to completely rework the current linearity propagation. In short, before this PR, linearity mainly support aten.add and some trival ops. It is done by allowing input Partial to propagate, and in the meanwhile, redistribute Replicate inputs to Partial to preserve the single device semantic, i.e suppose we want to execute `aten.add(lhs, rhs)` on 2 ranks: * `lhs` is partial, value on rank 0: `r0`, lhs value on rank 1: `r1` * `rhs` is replicate, value: `a` Then in order to preserve single device semantic (which should produce the value of `a + r0 + r1`), we do `rhs/world_size` first, then add `rhs` to `lhs`. This means every operand would first need be partial, then we can add them together. But this become non-true for multiplicative operations, like `aten.mul`, for `aten.mul`, assuming the same `aten.mul(lhs, rhs)` and value, we don't need to divide lhs by world_size to preserve single device semantic, b.c. `a* (r0+r1) = a* r0 + a* r1` So to accomodate the difference of add/mul, in this PR I: * change linearity to be a int to support different linearity types, add linearity and multiplicative are separate * add checks to ensure only a subset of partial types can support linearity (namely partial-sum/avg) * handle the linearity type plumbing through the pointwise ops. * add `mul.Tensor/Scalar` to be the multiplicative linearity * added the tests to show that the partial placements can be propagated with `aten.mul` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/157340 Approved by: https://github.com/zpcore	2025-07-03 17:04:08 +00:00
henrylhtsang	898179331e	[cutlass backend] fix CutlassTensor post-renaming (#157408 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157408 Approved by: https://github.com/mlazos ghstack dependencies: #157402	2025-07-03 17:02:21 +00:00
PyTorch MergeBot	2e64e45b0b	Revert "[build] modernize build-backend: `setuptools.build_meta:__legacy__` -> `setuptools.build_meta` (#155998 )" This reverts commit 404008e3efdabeaf5b140a3aff77131461c33a0a. Reverted https://github.com/pytorch/pytorch/pull/155998 on behalf of https://github.com/malfet due to Broke inductor_cpp, wrapper see `e472daa809/1` ([comment](https://github.com/pytorch/pytorch/pull/155998#issuecomment-3032915058))	2025-07-03 16:47:07 +00:00
Sandeep Narendranath Karjala	e472daa809	[dynamo] Add fx_graph_runnable test coverage (#157021 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157021 Approved by: https://github.com/StrongerXi, https://github.com/xmfan Co-authored-by: Simon Fan <xmfan@meta.com>	2025-07-03 16:42:06 +00:00
Nikita Shulga	ec816d73b4	[MPS] Add `shifted_chebyshev_polynomial_[tuvw]` (#157488 ) For eager and inductor As for all other chebyshev ops, logic is simply compiled from `94716db222/aten/src/ATen/native/cuda/Math.cuh (L2821)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157488 Approved by: https://github.com/dcci	2025-07-03 15:48:37 +00:00
namgyu-youn	f17f658125	[profiler] add more CUDA API for kernel launcher (#156016 ) Add more kernel detection options, resolving TODO - References : [NVIDIA - docs](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156016 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-07-03 15:26:42 +00:00
PyTorch MergeBot	c9174a20f7	Revert "[BE] Unskip special ops (#157464 )" This reverts commit e124a0d88ca2aa04bfaca2dcabf5de6244048e45. Reverted https://github.com/pytorch/pytorch/pull/157464 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](`e124a0d88c`) ([comment](https://github.com/pytorch/pytorch/pull/157464#issuecomment-3032676989))	2025-07-03 15:24:15 +00:00
PyTorch MergeBot	b6276a425f	Revert "[MPS] Add `shifted_chebyshev_polynomial_[tuvw]` (#157488 )" This reverts commit 9620994067b18e846a097d1e99af85ec2426ef0a. Reverted https://github.com/pytorch/pytorch/pull/157488 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](`e124a0d88c`) ([comment](https://github.com/pytorch/pytorch/pull/157464#issuecomment-3032676989))	2025-07-03 15:24:15 +00:00
Abhishek Nandy	a0e0abd037	Fix typo: 'intialized' → 'initialized' in test_modules.py (#157226 ) This PR fixes a minor typo in `test/jit/test_modules.py`: - Before: `intialized` - After: `initialized` There are no functional code changes — this is a comment-only fix to improve clarity and consistency. Thank you to the PyTorch team for maintaining this outstanding project. Please let me know if anything else is needed. With appreciation, Abhishek Nandy [@abhitorch81](https://github.com/abhitorch81) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157226 Approved by: https://github.com/Skylion007	2025-07-03 14:56:02 +00:00
Abhishek Nandy	b221be9140	Fix typo: 'intial_query_grad' → 'initial_query_grad' in test_transformers.py (#157306 ) This is a minor typo fix in `test/test_transformers.py`: - Renamed `intial_query_grad` to `initial_query_grad` for improved clarity and correctness in test variable naming. There are no functional or logic changes — this PR is aimed purely at improving readability and maintaining code quality. Thanks to the PyTorch team for their work and review time Please feel free to suggest if this needs any adjustment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157306 Approved by: https://github.com/Skylion007	2025-07-03 14:08:12 +00:00
atalman	8408522976	Remove +PTX from CUDA 12.8 builds (#157516 ) Remove +PTX from CUDA 12.8 builds and small refactor in build_cuda.sh. Removing +PTX reduces binary size required to be able to upload binaries to pypi Pull Request resolved: https://github.com/pytorch/pytorch/pull/157516 Approved by: https://github.com/malfet, https://github.com/ptrblck, https://github.com/tinglvv	2025-07-03 13:19:19 +00:00
Alexander Grund	c329a8f19c	Fix CPU bitwise shifts for out-of-limit values in VSX-vec (#157463 ) Similar to #96659 this implements the conditionals handling the out-of-limit values in the shift amounts (rhs) for the vectorized VSX code using the same logic as the scalar code. Fixes #109777 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157463 Approved by: https://github.com/jgong5	2025-07-03 10:41:33 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	5dfd8a9c7a	Remove is_jit_trace option (#157387 ) Summary: Title Test Plan: CI Rollback Plan: Differential Revision: D77319249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157387 Approved by: https://github.com/pianpwk	2025-07-03 09:20:27 +00:00
Shawn Xu	8c2e450082	[PT][FSDP] fail `set_allocate_memory_from_process_group` if used together with custom comm hooks (#157487 ) Summary: This is a follow up after the PR to add comm override support: https://github.com/pytorch/pytorch/pull/155189 The previous PR loosely checks the allocation mixin classes, which isn't really safe as the actual hook may still override the behavior. This may lead to unnecessary confusion for no good use case. So for now we just make the 2 sets of APIs largely incompatible: 1. setting custom comms after `set_allocate_memory_from_process_group_for_comm()` is ok. 2. setting `set_allocate_memory_from_process_group_for_comm()` after custom comms is ko. Basically `set_allocate_memory_from_process_group_for_comm` is like a drop in hammer while the `set_custom_all_gather/reduce_scatter()` are like finer-grained scalpels that require more code crafted. We can revisit this if there's use case in between but for now they can be largely viewed independent from each other (even tho we do share some of the underlying pieces for now, that could be subject to change and should not be exposed to end users). Test Plan: added UT Differential Revision: D77681620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157487 Approved by: https://github.com/weifengpy	2025-07-03 07:00:35 +00:00
Sheng Fu	2bb33e7a08	Fixed triton kernel in ET due to Triton version change. (#157484 ) Summary: Fixed triton kernel in ET due to Triton version change. Test Plan: buck2 run mode/opt param_bench/fb/integration_tests:test_et_replay Rollback Plan: Differential Revision: D77398841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157484 Approved by: https://github.com/davidberard98	2025-07-03 06:16:23 +00:00
Tanima Dey	4ce6e6ec88	XCCL changes for DDP (#155497 ) Add XCCL documentation for DDP Pull Request resolved: https://github.com/pytorch/pytorch/pull/155497 Approved by: https://github.com/guangyey, https://github.com/AlannaBurke Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-07-03 05:18:08 +00:00
Will Constable	382598ef87	Fix unsafe collective reorder past wait (#157489 ) Covers the case where the output of one collective feeds the input of another collective. e.g. TP + FSDP - all_gather(tp+dp sharded param on TP dim) -> allgather dp_sharded buffer on DP dim Fixes a bug where the reordering pass specifically exempted wait nodes from dependencies. Note: this exemption was incorrect, so it should be removed. But it was also put there for a reason, to help move collectives past wait nodes that are not related to that collective. After this fix, reordering performance may be worse and we need to find a smarter way to decide if a particular wait node is a blocker for a given collective. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157489 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #156879	2025-07-03 05:04:19 +00:00
Will Constable	dc524efb4d	Move logging into inner method for reorder pass (#156879 ) The reason for inner/outer method is to keep the outer method conforming to the typedef for a comms graph pass which returns one obj, while allowing unit tests to call the inner method that returns more metadata useful for testing the pass. The logs should be in the inner part, so they are functional also during unit testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156879 Approved by: https://github.com/IvanKobzarev	2025-07-03 05:04:19 +00:00
Klaus Zimmermann	5d5a5b3501	Fix GITHUB_OUTPUT syntax in create_release.yml workflow (#157466 ) #149919 fixed a number of linting issues, however, the conversion of the deprecated `::set-output` command to the new `>> $GITHUB_OUTPUT` redirect syntax went wrong, resulting in [failing uploads of the 2.8.0 rc1-rc3 pre-release tarballs](https://github.com/pytorch/pytorch/actions/runs/15892205745/job/44816789782). This PR fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157466 Approved by: https://github.com/clee2000, https://github.com/atalman	2025-07-03 04:57:52 +00:00
Xuehai Pan	404008e3ef	[build] modernize build-backend: `setuptools.build_meta:__legacy__` -> `setuptools.build_meta` (#155998 ) Change `build-system.build-backend`: `setuptools.build_meta:__legacy__` -> `setuptools.build_meta`. Also, move static package info from `setup.py` to `pyproject.toml`. Now the repo can be installed from source via `pip` command instead of `python setup.py develop`: ```bash python -m pip install --verbose --editable . python -m pip install --verbose --no-build-isolation --editable . ``` In addition, the SDist is also buildable: ```bash python -m build --sdist python -m install dist/torch-*.tar.gz # build from source using SDist ``` Note that we should build the SDist with a fresh git clone if we will upload the output to PyPI. Because all files under `third_party` will be included in the SDist. The SDist file will be huge if the git submodules are initialized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155998 Approved by: https://github.com/ezyang, https://github.com/cyyever, https://github.com/atalman	2025-07-03 04:10:44 +00:00
henrylhtsang	b642a5c118	[cutlass backend] Add dynamo timed (#157410 ) Differential Revision: [D77631592](https://our.internmc.facebook.com/intern/diff/D77631592/) Before: ![Screenshot 2025-07-01 at 4 08 06 PM](https://github.com/user-attachments/assets/8f6445aa-50c7-456f-b5ac-b2749eb9bf40) After (different run): ![Screenshot 2025-07-01 at 5 11 09 PM](https://github.com/user-attachments/assets/7513d312-c4dc-4e39-9718-c63eb641bc30) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157410 Approved by: https://github.com/jingsh	2025-07-03 04:03:20 +00:00
fduwjj	493f42a541	[symm_mem] Create a one side get api for symm mem (#157294 ) Doing similar like what we did in https://github.com/pytorch/pytorch/pull/156443 so that we can also have a one-side get API for symmetric memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157294 Approved by: https://github.com/kwen2501	2025-07-03 03:52:05 +00:00
fduwjj	662c1cfed2	[c10d][PGNCCL] Add waitcounter for watchdog and heartbeat monitoring thread (#157480 ) We want to have a wait counter for both side thread so that we can monitor its lifecycle. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157480 Approved by: https://github.com/d4l3k	2025-07-03 02:47:06 +00:00
Yu, Guangye	5cc4e856fd	Add device_id to XPU device properties (#156481 ) # Motivation Some older Intel iGPUs may share the same device name across different hardware products. (See [device name example](`aaa01c06f9/shared/source/dll/devices/devices_base.inl (L190-L199)`)) To help disambiguate which specific iGPU product is being used, we introduce the use of a [device id](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md#device-id). This device id corresponds to the Device ID in [official Intel product specification](https://www.intel.com/content/www/us/en/products/sku/232155/intel-core-i71360p-processor-18m-cache-up-to-5-00-ghz/specifications.html) and enables more accurate identification and troubleshooting for user issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156481 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-07-03 01:22:11 +00:00
Valentine233	7597988f1b	[fake tensor] fix issue of no attribute tags (#156689 ) Fixes #156688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156689 Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman	2025-07-03 01:16:01 +00:00
Nikita Shulga	9620994067	[MPS] Add `shifted_chebyshev_polynomial_[tuvw]` (#157488 ) For eager and inductor As for all other chebyshev ops, logic is simply compiled from `94716db222/aten/src/ATen/native/cuda/Math.cuh (L2821)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157488 Approved by: https://github.com/dcci ghstack dependencies: #157464	2025-07-02 23:29:35 +00:00
Nikita Shulga	e124a0d88c	[BE] Unskip special ops (#157464 ) They were slow on CUDA-11.3, which has long been gone, let's see if they work now Before ``` $ python test_ops.py -k chebyshev_polynomial_ ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.) return torch._C._get_cublas_allow_tf32() ....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss ---------------------------------------------------------------------- Ran 432 tests in 8.575s OK (skipped=344) ``` After ``` $ python test_ops.py -k chebyshev_polynomial_ ssssssss........................ssssssssssssssss......../home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.) return torch._C._get_cublas_allow_tf32() ........................................................................................ssssssss................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss ---------------------------------------------------------------------- Ran 432 tests in 42.379s OK (skipped=80) ``` Fixes https://github.com/pytorch/pytorch/issues/79528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464 Approved by: https://github.com/Skylion007	2025-07-02 23:16:52 +00:00
Laith Sakka	7cfd054075	[attempt 2] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#157472 ) Summary: When we compute contiguity for a tensor with dynamic shapes we first: 1) Try to compute it without guarding. 2) If all shapes hinted, compute it with potentially adding guards. 3) if any input is not hinted, compute it symbolically. sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called on it to avoid data dependent errors. ex: bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__); is_contiguous_or_false is a helper function that does that. In this PR I only handle default contiguity, will follow up with changes for other formats like channel_last . We use this patter in this PR for several locations to avoid DDEs. Test Plan: contbuild & OSS CI, Rollback Plan: Reviewed By: malfet Differential Revision: D77639021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157472 Approved by: https://github.com/aorenste	2025-07-02 23:12:29 +00:00
Xuehai Pan	d40aaa42ee	[BE][16/16] fix typos in torch/ (torch/utils/) (#156606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156606 Approved by: https://github.com/albanD ghstack dependencies: #156318, #156320, #156602, #156604	2025-07-02 22:55:29 +00:00
Xuehai Pan	11c07c848c	[BE][14/16] fix typos in torch/ (torch/fx/) (#156604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156604 Approved by: https://github.com/jingsh ghstack dependencies: #156318, #156320, #156602	2025-07-02 22:55:29 +00:00
Xuehai Pan	db259bd6b8	[BE][12/16] fix typos in torch/ (#156602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156602 Approved by: https://github.com/justinchuby, https://github.com/albanD ghstack dependencies: #156318, #156320	2025-07-02 22:55:29 +00:00
Xuehai Pan	d5cdc36943	[BE][10/16] fix typos in torch/ (torch/csrc/jit/) (#156320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156320 Approved by: https://github.com/albanD ghstack dependencies: #156318	2025-07-02 22:55:29 +00:00
Xuehai Pan	541584d22e	[BE][8/16] fix typos in torch/ (torch/csrc/jit/) (#156318 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156318 Approved by: https://github.com/albanD	2025-07-02 22:55:29 +00:00
henrylhtsang	c0e155a8d2	[cutlass backend] Use alignment of D for EVT / Float8 (#157402 ) I encountered an C++ compile error from running cutlass backend tests when upgrading cutlass version. It seems like Nvidia added "static_assert(detail::is_aligned<ElementC_, AlignmentC, ElementD_, AlignmentD>()," `b995f93317/include/cutlass/epilogue/collective/builders/sm90_builder.inl (L297)` However, it seems codegen have the wrong alignment for D. For C, 1 is okay since it is void. But for D, this is probably wrong. ``` void, cutlass::layout::ColumnMajor, 1, cutlass::bfloat16_t, cutlass::layout::RowMajor, 1, ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157402 Approved by: https://github.com/ColinPeppler, https://github.com/mlazos	2025-07-02 22:55:00 +00:00
Animesh Jain	48560eef80	[dynamo] Fix bug in dict(mapping_proxy) (#157467 ) Fixes https://github.com/pytorch/pytorch/issues/157284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157467 Approved by: https://github.com/jansel, https://github.com/StrongerXi Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-07-02 22:13:02 +00:00
Catherine Lee	fd4f704905	[ez][CI] Print set output in CI (#157477 ) Print what the output that's getting set is for better debugging It's probably bad there are 4 of these, but I'm also not sure if imports will behave correctly Pull Request resolved: https://github.com/pytorch/pytorch/pull/157477 Approved by: https://github.com/huydhn	2025-07-02 21:47:19 +00:00
Catherine Lee	60e66d11ab	[CI] Keep-going on main (#157470 ) Run an experiment where we turn on keep going on main. Revert this PR to cancel the experiment There have been a couple of changes that make it so that HUD will show the failure early even while the job is in progress, so triaging for reverts should still be able to happen quickly Pull Request resolved: https://github.com/pytorch/pytorch/pull/157470 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/malfet	2025-07-02 21:42:46 +00:00
Will Constable	4b4c2a7b1d	Support complex numbers in DTensor redistribute (#157329 ) Add complex number unwrapping in functional collectives used by DTensor. Complex tensors are not directly supported by underlying comm kernels (e.g. nccl) but complex tensors can be viewed as real tensors of a higher rank (added size-2 tensor dim represents real vs im component). Collective output is then viewed as complex to restore the original/expected shape and dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157329 Approved by: https://github.com/XilunWu	2025-07-02 21:37:16 +00:00
Benjamin Glass	af9c92b4cb	[CI] Remove redundant accuracy benchmarks for cpp_wrapper (#155966 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155966 Approved by: https://github.com/desertfire	2025-07-02 20:58:08 +00:00
Catherine Lee	c09cf29d7d	[ez][BE] Tag deletion script to delete any old ciflow + autorevert tags (#157468 ) Change the branch/tag deletion script that runs once per day to delete more tags Previous: only delete ciflow tags that didn't correspond to an open PR New: delete ciflow tags attached to commits that are > 7 days old. Also delete `trunk/<sha>` (I think they are for autorevert) tags that are attached to commits that are > 7 days old It's hard to figure out when the actual tag was pushed or created, so instead it looks at the commit date, which might lead to unexpected behavior if the tag was pushed much later than the commit (ex triggering periodic later to bisect). I think it's ok though since you don't really need the tag after the workflow runs Pull Request resolved: https://github.com/pytorch/pytorch/pull/157468 Approved by: https://github.com/izaitsevfb	2025-07-02 20:42:32 +00:00
Catherine Lee	6f60cfe9b1	[ez] Add super().setUp() in test_ops::TestFakeTensor (#157475 ) Noticed some disable issues getting a bunch of comments, so I took a look One day I'll write a better check for this Pull Request resolved: https://github.com/pytorch/pytorch/pull/157475 Approved by: https://github.com/huydhn	2025-07-02 20:34:00 +00:00
zhxchen17	e20784f228	[dynamo] Support BUILTIN_MATCH serialization. (#157016 ) Serialize BUILTIN_MATCH since they are all stored in __builtin__ dict. Also fixed an issue that the wrong global scope is passed to CheckFunctionManager while loading guards. Previously we can always reuse the compile-time global scope for evaluating guards because the compile-time and runtime global scope are always the same. For precompile, we need to serialize the compile-time global scope for loading only. We need to point the CheckFunctionManager to the new global scope after loading is finished for evaluating guards. Differential Revision: [D77159313](https://our.internmc.facebook.com/intern/diff/D77159313/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157016 Approved by: https://github.com/jansel, https://github.com/jamesjwu	2025-07-02 20:24:24 +00:00
Colin Peppler	172853547a	[inductor] more size_hint_or_throw usage (#157394 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157394 Approved by: https://github.com/jingsh	2025-07-02 20:20:59 +00:00
Catherine Lee	e0ab1b538a	[ez][BE] Remove max jobs override for CI build jobs (#157473 ) Basically reverts #147487 since it's not needed anymore Not an exact revert because some things have already been removed in a different PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/157473 Approved by: https://github.com/huydhn	2025-07-02 20:12:28 +00:00
Nikita Shulga	3f569f9af7	[BE] Remove extra semicolon (#157486 ) Fixes ``` /Users/nshulga/git/pytorch/pytorch/torch/nativert/executor/GraphExecutorBase.cpp:16:58: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi] 16 \| execPlan_(ExecutionPlanner{graph_}.createPlan()) {}; \| ^ 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157486 Approved by: https://github.com/seemethere, https://github.com/atalman, https://github.com/Skylion007	2025-07-02 19:56:21 +00:00
Nicolas Macchioni	94716db222	[BE][DCE] eliminate remnants of global gemm cache (#157327 ) Summary: The global gemm cache has not been maintained in ~1 year, and the only entry point (`search_autotune_cache`) was recently deprecated. Meaning, this is now dead code that we can remove. Test Plan: CI Rollback Plan: Differential Revision: D77520979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157327 Approved by: https://github.com/jansel	2025-07-02 19:52:35 +00:00
atalman	06f39a71b6	Add Release 2.8 CUDA matrix. Update Release schedule for 2.7.1 and 2.9 (#157482 ) This PR: - Adds Release 2.8 CUDA matrix - Update Release 2.9 schedule, to make it more similar to 2.5 release schedule. Mid Oct release - Update 2.7.1 release day Pull Request resolved: https://github.com/pytorch/pytorch/pull/157482 Approved by: https://github.com/Camyll	2025-07-02 19:52:24 +00:00
Ahmad Sharif	36dd598bda	layernorm tests: Tweak test thresholds for comparing tensors (#156699 ) After I landed this PR: https://github.com/pytorch/pytorch/pull/156600, this test was failing internally on large tensors because the differences were greater than tolerances on some cuda devices. We now raise the tolerances for larger tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156699 Approved by: https://github.com/eqy, https://github.com/ngimel	2025-07-02 19:33:38 +00:00
dolpm	32983ea698	[nativert] continue to move generated static dispatch kernels (#157460 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D77623080 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157460 Approved by: https://github.com/zhxchen17	2025-07-02 19:28:13 +00:00
Nikita Shulga	5e636d664a	[BE] `@serialTest` decorator must be called (#157388 ) Otherwise it turns test into a trivial one(that always succeeds), as following example demonstrates ```python import torch from torch.testing._internal.common_utils import serialTest, run_tests, TestCase class MegaTest(TestCase): @serialTest def test_foo(self): if hasattr(self.test_foo, "pytestmark"): print("foo has attr and it is", self.test_foo.pytestmark) print("foo") @serialTest() def test_bar(self): if hasattr(self.test_bar, "pytestmark"): print("bar has attr and it is", self.test_bar.pytestmark) print("bar") if __name__ == "__main__": run_tests() ``` That will print ``` test_bar (__main__.MegaTest.test_bar) ... bar has attr and it is [Mark(name='serial', args=(), kwargs={})] bar ok test_foo (__main__.MegaTest.test_foo) ... ok ---------------------------------------------------------------------- Ran 2 tests in 0.013s ``` Added assert that arg is boolean in the decorator to prevent such silent skips in the future Pull Request resolved: https://github.com/pytorch/pytorch/pull/157388 Approved by: https://github.com/clee2000	2025-07-02 19:15:19 +00:00
Dhia-naouali	eaf32fffb7	fixed a tiny typo in torch.compiler.md (#157462 ) Fixes #157444 there was a typo in [docs/source/torch.compiler.md](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler.md) : see -> seen Pull Request resolved: https://github.com/pytorch/pytorch/pull/157462 Approved by: https://github.com/Skylion007, https://github.com/svekars	2025-07-02 19:15:15 +00:00
Xuehai Pan	0e9d8032a3	[build] remove cmake cache and reconfigure again if it is invalid (#156958 ) See also: - astral-sh/uv#14269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156958 Approved by: https://github.com/Skylion007 ghstack dependencies: #156742	2025-07-02 18:46:32 +00:00
xadupre	0105cd89ab	[ONNX] Fix conversion of attention - 4D (#157130 ) Fixes a wrong conversion to onnx while investigation #149662. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157130 Approved by: https://github.com/gramalingam, https://github.com/justinchuby, https://github.com/titaiwangms Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-07-02 18:05:10 +00:00
dolpm	d5d14ee823	[nativert] create persistent value helper (#157286 ) Summary: att Test Plan: CI Reviewed By: georgiaphillips Differential Revision: D74300519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157286 Approved by: https://github.com/SherlockNoMad	2025-07-02 17:15:52 +00:00
Prajesh Praveen Anchalia	156bc243f0	Back out "Include c++ stack traces when we hit constraint violation (#155603 )" (#157406 ) Summary: Original commit changeset: 4b3fdaa8f2c6 Original Phabricator Diff: D76434787 Meta: https://fb.workplace.com/groups/1286739428954016/permalink/1535462614081695/ Test Plan: Meta: Revert D76434787 for S536719 Rollback Plan: Differential Revision: D77626334 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157406 Approved by: https://github.com/bobrenjc93	2025-07-02 16:51:07 +00:00
James Wu	bd6b5fddbf	[Precompile] [easy] Serialize requires_grad for tensors when serializing guards (#157372 ) Need to keep requires_grad on the tensor when serializing/deserializing guards. This matters when there's a TENSOR_MATCH guard on a tensor that requires_grad. Added a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157372 Approved by: https://github.com/jansel, https://github.com/zhxchen17 ghstack dependencies: #156433	2025-07-02 16:34:37 +00:00
Witold Dziurdz	54701a0c94	Add is_hidden_event method to KinetoEvent Python interface (#155214 ) Fixes #155213 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155214 Approved by: https://github.com/sraikund16	2025-07-02 16:29:21 +00:00
Paul Zhang	0edc1b91f7	[Inductor] Disable decompose_k for AMD (#157283 ) Differential Revision: D77544250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157283 Approved by: https://github.com/bdhirsh	2025-07-02 15:21:46 +00:00
Abhishek Nandy	9f5276dc07	Fix typo: 'Intializes' → 'Initializes' in _distributed_c10d.pyi docst… (#157455 ) Description: This PR fixes a small documentation typo in torch/_C/_distributed_c10d.pyi, correcting: Intializes → Initializes This helps improve clarity in internal docstrings for maintainers and contributors. Let me know if further changes are needed. Thanks for your time and the amazing work on PyTorch! Pull Request resolved: https://github.com/pytorch/pytorch/pull/157455 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-07-02 15:19:05 +00:00
Guilherme Leobas	9d175bc7e6	Fixes for CPython int/float tests (#155978 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155978 Approved by: https://github.com/zou3519	2025-07-02 15:04:00 +00:00
Xuehai Pan	b096341963	[BE] use `pathlib.Path` instead of `os.path.*` in `setup.py` (#156742 ) Resolves: - https://github.com/pytorch/pytorch/pull/155998#discussion_r2164376634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156742 Approved by: https://github.com/malfet	2025-07-02 14:57:58 +00:00
David Berard	82eefaedd9	[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322 ) Fixes #155006 Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322 Approved by: https://github.com/zou3519, https://github.com/drisspg	2025-07-02 14:02:01 +00:00
PyTorch MergeBot	c553c55be7	Revert "Fix full_like decomposition to preserve strides (#144765 )" This reverts commit 01b0f09931d47bd2716398a0c335b2807dc3074d. Reverted https://github.com/pytorch/pytorch/pull/144765 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal tests see [D77652778](https://www.internalfb.com/diff/D77652778), @jansel may you help get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/144765#issuecomment-3027975098))	2025-07-02 13:56:03 +00:00
PyTorch MergeBot	d5a89178b0	Revert "[dynamo] Add fx_graph_runnable test coverage (#157021 )" This reverts commit 77676753ecabf6a6645bdd3abfe01939e5751e76. Reverted https://github.com/pytorch/pytorch/pull/157021 on behalf of https://github.com/jeanschmidt due to New tests are red internally, more details on [D77652793](https://www.internalfb.com/diff/D77652793). Maybe codev could be a better strategy to merge this PR faster... ([comment](https://github.com/pytorch/pytorch/pull/157021#issuecomment-3027952946))	2025-07-02 13:48:41 +00:00
William Wen	bdb7819166	[dynamo, nested graph breaks] remove recursive cell/freevar in instruction tx (#154078 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154078 Approved by: https://github.com/StrongerXi, https://github.com/jansel	2025-07-02 13:36:14 +00:00
Bin Bao	34c8033fd3	Fix a div_mod bug in generic_math.h (#157383 ) Summary: There is a bug in integer div_mod that when the remainder is 0 and the divisor is negative, mod operation produces a negative number. Fixed in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157383 Approved by: https://github.com/angelayi, https://github.com/jingsh	2025-07-02 12:22:57 +00:00
William Wen	ab2294d828	[dynamo] fix _torchdynamo_orig_callable naming issues (#156901 ) `_torchdynamo_orig_callable` was being used in two distinct places: - to get the original user function from nested eval_frame.py decorators - to get the original backend from nested convert_frame.py callbacks We rename ~the first usage to `_torchdynamo_orig_fn`~ and the second to `_torchdynamo_orig_backend` in order to distinguish these cases. UPDATE: seems like both internal and OSS users depend on `_torchdynamo_orig_callable`, but it only seems in the first context. We should thus keep the original name for the first case then. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156901 Approved by: https://github.com/StrongerXi, https://github.com/jansel	2025-07-02 09:53:55 +00:00
Dylan Maloy	3173616532	[nativert] start to move generated static dispatch kernels (#157403 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D77622952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157403 Approved by: https://github.com/georgiaphillips	2025-07-02 08:42:01 +00:00
PyTorch MergeBot	8c0df6fe17	Revert "[dynamo][fsdp] Consistent behavior of int attributes (#157262 )" This reverts commit 42b48ee67229286127390000f103a11dfc8901f5. Reverted https://github.com/pytorch/pytorch/pull/157262 on behalf of https://github.com/jeanschmidt due to Newly introduced tests are red in internal runs, check D77593713 ([comment](https://github.com/pytorch/pytorch/pull/157262#issuecomment-3026944993))	2025-07-02 08:30:39 +00:00
Shawn Xu	0364db7cd1	[PT] support custom all_gather and reduce_scatter comms (#155189 ) Summary: This change introduces 2 comm override APIs: `set_custom_all_gather` and `set_custom_reduce_scatter` to allow for custom behavior respectively. This allow users to control how the comm buffers are allocated and the exact comm implementation for flexibility. For details, see docstring in `Comm` in `_fsdp_api.py` Related PR: https://github.com/pytorch/pytorch/pull/150564 Test Plan: CI Differential Revision: D75714362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155189 Approved by: https://github.com/weifengpy	2025-07-02 06:58:45 +00:00
haozhe.zhu	f8c0a4bd28	[inductor] enable bf32 test for mkldnn conv (#127293 ) Enable more test on inductor conv + bf32 Testplan: ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv2d_unary_cpu python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv3d_unary_cpu python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv_transpose2d_unary python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv2d_binary python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv3d_binary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127293 Approved by: https://github.com/jgong5 ghstack dependencies: #126050, #126054 Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>	2025-07-02 01:49:01 +00:00
haozhe.zhu	4c8eb65efb	allow to use bf16 as fp32 internal precision for mkldnn conv backward (#126054 ) Used for CI since depends on ideep update. Allow to use `BF16` as the internal computation data types by `torch.backends.mkldnn.conv.fp32_precision="bf16"` ### TestPlan python test/test_mkldnn.py -k conv ### Benchmarking FP32 conv2d backward vs. BF16 internal computation conv backward on SPR Single core: Input \| fp32 ms \| bf16 internal ms \| Speed up -- \| -- \| -- \| -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 \| 461.6734\| 358.3779\| 1.48 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 358.3779 \| 247.8631\| 1.46 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 \| 4.3783\| 3.8513\| 1.14 56 cores: Input \| fp32 ms \| bf16 internal ms \| Speed up -- \| -- \| -- \| -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 16.6119 \| 12.2047 \| 1.38 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 12.0016 \| 8.6711 \| 1.38 IC: 256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 \| 20.5947 \| 15.9366 \| 1.29 IC: 1024, OC: 256, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 \| 40.0952 \| 32.2222 \| 1.24 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 \| 162.7449 \| 142.3054 \| 1.14 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126054 Approved by: https://github.com/jgong5 ghstack dependencies: #126050 Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>	2025-07-02 01:40:13 +00:00
haozhe.zhu	5a2db5152d	allow to use bf16 as fp32 internal precision for mkldnn conv (#126050 ) Allow to use `BF16` as the internal computation data types by `torch.backends.mkldnn.conv.fp32_precision="bf16"` ### TestPlan python test/test_mkldnn.py -k conv ### Benchmarking FP32 conv2d vs. BF16 internal computation conv2d on SPR Single core: Input \| fp32 ms \| bf16 internal ms \| Speed up -- \| -- \| -- \| -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 \| 185.5071 \| 83.4749 \| 2.22 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 194.7558 \| 79.1683\| 2.46 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 \| 1.9213 \| 1.3690 \| 1.40 56 cores: Input \| fp32 ms \| bf16 internal ms \| Speed up -- \| -- \| -- \| -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 6.5804 \| 7.4349 \| 0.89 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 4.9940 \| 3.8093 \| 1.31 IC: 256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 \| 8.8359 \| 5.5802 \| 1.58 IC: 1024, OC: 256, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 \| 16.5800 \| 9.2367 \| 1.80 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 \| 79.5436 \| 38.3861 \| 2.07 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126050 Approved by: https://github.com/jgong5, https://github.com/jansel Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>	2025-07-02 01:31:23 +00:00
Nikita Shulga	0a63053fe9	Don't store flamegraph to tmp folder (#157374 ) Where it's accessible(and mutable) by multiple users. Instead use `~/.cache` folder instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/157374 Approved by: https://github.com/eqy ghstack dependencies: #157373	2025-07-02 00:46:51 +00:00
Animesh Jain	bb476310a4	[dynamo][guards] Stash root guard manager pointer in the LeafGuard (#157325 ) Preparing to simplify the recompilation reason codebase. This PR was 95% done by using AI tools. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157325 Approved by: https://github.com/jansel	2025-07-02 00:42:43 +00:00
Ankita George	fa1c20ae92	Fix test consolidate hf safetensors (#157386 ) Need to change an argument name that was changed in the test so that it doesn't throw Differential Revision: [D77604210](https://our.internmc.facebook.com/intern/diff/D77604210/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157386 Approved by: https://github.com/meetv18 ghstack dependencies: #154743, #156705	2025-07-02 00:16:21 +00:00
Sandeep Narendranath Karjala	77676753ec	[dynamo] Add fx_graph_runnable test coverage (#157021 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157021 Approved by: https://github.com/StrongerXi, https://github.com/xmfan Co-authored-by: Simon Fan <xmfan@meta.com>	2025-07-02 00:10:01 +00:00
Chong Gu	617e3f69f8	[FP8] Fix Benchmarking for certain Priors (#155722 ) Summary: For priors like layer norm, the order of the weight quantization kernel might be different and therefore have a different suffix, so we use regular expression instead. Test Plan: Trying this on model id 737772166 with ``` buck2 run mode/opt mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR --model-snapshot-id=737772166_0 --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 --node_replacement_dict "{'(autotune)':{'(1000+,1000+)':'fp8_float_model_dynamic_quantization_rowwise'}" ``` will allow more linears to be correctly replaced with fp8. An example of the gpu trace can be found in https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/hpc/new/models/feed/benchmark/libkineto_activities_773108_f58b57e208c04787acd3bcb01a3e8771.json.gz&bucket=gpu_traces. Rollback Plan: Differential Revision: D76092551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155722 Approved by: https://github.com/Skylion007	2025-07-02 00:01:23 +00:00
PyTorch MergeBot	ab6cb34480	Revert "[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322 )" This reverts commit 563fd95563c5edd732ae260b3bd3d0c38822ab57. Reverted https://github.com/pytorch/pytorch/pull/157322 on behalf of https://github.com/davidberard98 due to fails on rocm ([comment](https://github.com/pytorch/pytorch/pull/157322#issuecomment-3025826951))	2025-07-01 23:21:37 +00:00
PyTorch MergeBot	c6a27bae36	Revert "[do not revert] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590 )" This reverts commit d0a9629435aaceb5acbf31aad70f2109cb8a3ea2. Reverted https://github.com/pytorch/pytorch/pull/155590 on behalf of https://github.com/laithsakka due to was asked by to land this internally ([comment](https://github.com/pytorch/pytorch/pull/155590#issuecomment-3025796794))	2025-07-01 22:58:14 +00:00
David Berard	563fd95563	[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322 ) Fixes #155006 Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322 Approved by: https://github.com/zou3519, https://github.com/drisspg	2025-07-01 22:51:11 +00:00
PyTorch MergeBot	6ef70edd9a	Revert "Inductor logging + analysis of torch.profile (#149697 )" This reverts commit 47f10d0ad0dda281c886ff08ac2f938207027316. Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/malfet due to Looks like it's breaking ROCM tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm%20%2F%20linux-jammy ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-3025673908))	2025-07-01 22:11:53 +00:00
Xuehai Pan	3df6360e8c	[BE][Easy][setup] use `super().method(...)` in command subclasses in `setup.py` (#156044 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156044 Approved by: https://github.com/albanD ghstack dependencies: #156741	2025-07-01 22:09:10 +00:00
Laith Sakka	d0a9629435	[do not revert] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590 ) When we compute contiguity for a tensor with dynamic shapes we first: 1) Try to compute it without guarding. 2) If all shapes hinted, compute it with potentially adding guards. 3) if any input is not hinted, compute it symbolically. sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called on it to avoid data dependent errors. ex: bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__); is_contiguous_or_false is a helper function that does that. In this PR I only handle default contiguity, will follow up with changes for other formats like channel_last . We use this patter in this PR for several locations to avoid DDEs. Differential Revision: [D77183032](https://our.internmc.facebook.com/intern/diff/D77183032) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155590 Approved by: https://github.com/ezyang	2025-07-01 21:39:38 +00:00
Animesh Jain	22edb457c9	[invoke_subgraph][partitioner] Add meta val on run_and_save_rng ops (#157319 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157319 Approved by: https://github.com/zou3519	2025-07-01 21:02:08 +00:00
Nikita Shulga	e5f6ffd810	[BE] Replace `checkcall("chmod")` with `os.chmod` (#157373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157373 Approved by: https://github.com/clee2000, https://github.com/eqy, https://github.com/Skylion007	2025-07-01 20:46:25 +00:00
Nikita Shulga	019e30e3b8	[BE] Decorate LargeTensorTest with serialTests (#157382 ) May be it'll help make M2-15 jobs more stable, as that was the last test run before OOM Pull Request resolved: https://github.com/pytorch/pytorch/pull/157382 Approved by: https://github.com/clee2000	2025-07-01 20:35:42 +00:00
Bob Ren	4500a4aa50	remove allow-untyped-defs from torch/backends/mps/__init__.py (#157227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157227 Approved by: https://github.com/Skylion007	2025-07-01 20:00:19 +00:00
Ke Wen	6bc263809d	[SymmMem] Add NVSHMEM_CHECK macro (#157174 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157174 Approved by: https://github.com/fduwjj, https://github.com/fegin	2025-07-01 19:50:28 +00:00
angelayi	ffac0de07e	[export] Remove stack trace from input/output (#157302 ) Fixes https://github.com/pytorch/pytorch/issues/157183 https://github.com/pytorch/pytorch/pull/156257 consolidated the path for saving stack traces, but missed the part where stacktraces are not added to placeholder/output nodes in proxy_tensor tracing [(code)](https://github.com/pytorch/pytorch/pull/156257/files#diff-6960ce90e7162c0953b1ca07e92e7f0f2f6ba63b427b42df593e20cc6a096bb7L1107). Pull Request resolved: https://github.com/pytorch/pytorch/pull/157302 Approved by: https://github.com/yushangdi	2025-07-01 19:16:28 +00:00
Isuru Fernando	01b0f09931	Fix full_like decomposition to preserve strides (#144765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144765 Approved by: https://github.com/amjames, https://github.com/jansel	2025-07-01 19:13:22 +00:00
PyTorch MergeBot	6401d1d53d	Revert "Fused RMSNorm implementation (#153666 )" This reverts commit e1aee86646aa6d1b9cb9d34351e43936401c5efc. Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/davidberard98 due to causing build failures on main branch [GH job link](https://github.com/pytorch/pytorch/actions/runs/16007148842/job/45156382001) [HUD commit link](`e1aee86646`) ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3025146176))	2025-07-01 18:46:45 +00:00
PyTorch MergeBot	3a5677a380	Revert "ci: Add ability to test images for build-triton-wheel (#156894 )" This reverts commit 0e47312ae5a687f0aed61db753d03180118cddc4. Reverted https://github.com/pytorch/pytorch/pull/156894 on behalf of https://github.com/seemethere due to causing issues in downstream builds see https://github.com/pytorch/pytorch/pull/156664 for more info ([comment](https://github.com/pytorch/pytorch/pull/156894#issuecomment-3025137790))	2025-07-01 18:43:34 +00:00
Jack Taylor	02608e560a	[ROCm] Add more shards for inductor dashboard, more frequent runs (#157288 ) Also increases regularity of dashboard runs on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157288 Approved by: https://github.com/jeffdaily	2025-07-01 18:27:30 +00:00
AaronWang04	e1aee86646	Fused RMSNorm implementation (#153666 ) Relevant #72643 Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090. ```py import torch import torch.nn as nn class RMSNorm(nn.Module): def __init__(self, dim, eps=1e-5): super().__init__() self.eps = eps self.scale = nn.Parameter(torch.ones(dim)) def forward(self, x): norm_x = x.norm(2, dim=-1, keepdim=True) rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype)) x_normed = x / (rms_x + self.eps) return self.scale * x_normed def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16): rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype) input_data = torch.randn(input_shape, device='cuda', dtype=dtype) for _ in range(warmup_iterations): _ = rms_norm_layer(input_data) torch.cuda.synchronize() start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) start_event.record() for _ in range(num_iterations): _ = rms_norm_layer(input_data) end_event.record() torch.cuda.synchronize() elapsed_time_ms = start_event.elapsed_time(end_event) avg_time_ms = elapsed_time_ms / num_iterations print(f"--- RMSNorm CUDA Benchmark ---") print(f"Input Shape: {input_shape}") print(f"Normalized Dimension: {normalized_dim}") print(f"Benchmark Iterations: {num_iterations}") print(f"--- Fused Implementation ---") print(f"Average Time per Iteration: {avg_time_ms:.4f} ms") print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms") compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda() for _ in range(warmup_iterations): _ = compiled_rms_norm(input_data) torch.cuda.synchronize() start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) start_event.record() for _ in range(num_iterations): _ = compiled_rms_norm(input_data) end_event.record() torch.cuda.synchronize() elapsed_time_ms = start_event.elapsed_time(end_event) avg_time_ms = elapsed_time_ms / num_iterations print(f"--- TorchCompile Implementation ---") print(f"Average Time per Iteration: {avg_time_ms:.4f} ms") print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms") print("-" * 50) if __name__ == '__main__': parameter_sets = [ {'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16}, {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16}, {'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16}, {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32}, {'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16}, ] num_benchmark_iterations = 200 num_warmup_iterations = 20 for params in parameter_sets: batch_size = params['batch_size'] sequence_length = params['sequence_length'] hidden_features = params['hidden_features'] data_type = params.get('dtype', torch.float16) shape = (batch_size, sequence_length, hidden_features) norm_dim_to_normalize = hidden_features print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}") benchmark_rmsnorm_cuda(input_shape=shape, normalized_dim=norm_dim_to_normalize, num_iterations=num_benchmark_iterations, warmup_iterations=num_warmup_iterations, dtype=data_type) ``` Here are the triton compile tests ran on a 5090 (comparing this branch vs main) ```py import torch import torch.nn as nn from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code torch.manual_seed(0) device = torch.device("cuda") for batch in range(0, 9): for i in range(9, 16): normalized_shape_arg = (2batch, 2i) input_tensor = torch.randn(2batch, 2i, device=device, requires_grad=True) weight_tensor = torch.randn(2batch, 2i,device=device, requires_grad=True) model = torch.nn.functional.rms_norm compiled_model = torch.compile(model) loss = torch.randn_like(input_tensor) num_iter = 5 for j in range(num_iter): output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor) output.backward(loss) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) start_event.record() num_iter = 10 for j in range(num_iter): output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor) output.backward(loss) end_event.record() torch.cuda.synchronize() elapsed_time_ms = start_event.elapsed_time(end_event) avg_time_ms = round(elapsed_time_ms / num_iter, 5) print(2batch, 2i, avg_time_ms) ``` main ``` 32 512 0.1812 32 1024 0.19021 32 2048 0.18871 32 4096 0.17019 32 8192 0.21944 32 16384 0.38871 32 32768 0.83282 64 512 0.14705 64 1024 0.13987 64 2048 0.14111 64 4096 0.21699 64 8192 0.43141 64 16384 0.90652 64 32768 2.18573 128 512 0.19361 128 1024 0.1963 128 2048 0.20122 128 4096 0.38888 128 8192 0.93795 128 16384 2.23437 128 32768 5.50079 256 512 0.16722 256 1024 0.22856 256 2048 0.39421 256 4096 0.96621 256 8192 2.48746 256 16384 5.53571 256 32768 11.97932 ``` current branch ``` 32 512 0.16328 32 1024 0.18104 32 2048 0.15508 32 4096 0.14356 32 8192 0.20111 32 16384 0.45974 32 32768 0.94799 64 512 0.16874 64 1024 0.18701 64 2048 0.16107 64 4096 0.20152 64 8192 0.46568 64 16384 0.96599 64 32768 2.21661 128 512 0.14982 128 1024 0.15565 128 2048 0.22241 128 4096 0.46128 128 8192 0.88883 128 16384 2.3097 128 32768 5.84448 256 512 0.14346 256 1024 0.2007 256 2048 0.45927 256 4096 0.87876 256 8192 2.10571 256 16384 5.73948 256 32768 12.98581 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666 Approved by: https://github.com/ngimel	2025-07-01 18:22:24 +00:00
Nikita Shulga	1c8844d9e7	[MPS] Switch Cholesky decomp to column wise (#157014 ) Everything should go thru a generalized kernels, and Metal kernels should work with the same sizes and strides as CPU or CUDA backends to avoid problems with `torch.compile` that relies on the meta kernels to tell what its ouput going to look like. To avoid returning tensors with different layout depending on whether upper parameter is true or false, templatize `factorDiagonalBlock`, `applyTRSM` and `applySYRK` to take upper/lower (actually row-wise vs column-wise) as template argument and call appropriate templates from host TODOs: - Rename upper parameter to something more sensible and add comments - Use simd_groupsize instead of hardcoded 32 everywhere Fixes https://github.com/pytorch/pytorch/issues/156658 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157014 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #157179	2025-07-01 18:00:59 +00:00
xinan.lin	720c2c46b1	[Inductor UT][XPU] Reduce the runtime of the test case test_comprehensive_nn_functional_max_pool2d_xpu. (#157357 ) This test case has over a thousand input samples, causing it to run for more than 30 minutes, which triggers the timeout mechanism and breaks the XPU CI. This PR limit the sample number as one for this XPU case . Pull Request resolved: https://github.com/pytorch/pytorch/pull/157357 Approved by: https://github.com/chuanqi129, https://github.com/jansel	2025-07-01 17:47:49 +00:00
Xuehai Pan	3bc6bdc866	[BE] add type annotations and run `mypy` on `setup.py` (#156741 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156741 Approved by: https://github.com/aorenste	2025-07-01 17:09:05 +00:00
Gabriel Ferns	47f10d0ad0	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-07-01 16:51:03 +00:00
zhxchen17	0f9c1b374f	[dynamo] Ensure global state guard is preserved across serialization. (#157285 ) Currently, every time we construct a GLOBAL_STATE guard, we always create a fresh guard based on the current global state. For precompile, we want to create a GLOBAL_STATE guard always based on some external sources, e.g. serialized global states. This can also be applied with the normal case where we just pass in the global state guard from Python. Differential Revision: [D77400988](https://our.internmc.facebook.com/intern/diff/D77400988/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157285 Approved by: https://github.com/jansel	2025-07-01 15:46:34 +00:00
Xuehai Pan	b146e1a264	[BE] remove duplicates in generated `torch._VF.__all__` (#157365 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157365 Approved by: https://github.com/Skylion007	2025-07-01 15:43:20 +00:00
zhxchen17	c78fce9e79	[dynamo] show frame information when recompilation is triggered on fail_on_recompile (#156433 ) adding more information to the error message for debugging. example error message: ``` Detected recompile when torch.compile stance is 'fail_on_recompile'. filename: 'caffe2/test/dynamo/test_misc.py', function name: 'fn', line number: 0 Failed on the following precompiled guards: TREE_GUARD_MANAGER: +- RootGuardManager \| +- LAMBDA_GUARD: isinstance(L['x'], bool) GuardDebugInfo( result=0, verbose_code_parts=["isinstance(L['x'], bool)"], num_guards_executed=1) ``` Differential Revision: [D76987126](https://our.internmc.facebook.com/intern/diff/D76987126/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156433 Approved by: https://github.com/jamesjwu	2025-07-01 15:15:58 +00:00
PyTorch MergeBot	023887fc5a	Revert "Switch to standard pep517 sdist generation (#152098 )" This reverts commit f16053f0c9a09fa337fbf85aaf64f88712b8dcdb. Reverted https://github.com/pytorch/pytorch/pull/152098 on behalf of https://github.com/malfet due to IMO this PR needs to be split into few helper ones, with better test plan ([comment](https://github.com/pytorch/pytorch/pull/152098#issuecomment-3024223880))	2025-07-01 14:14:52 +00:00
PyTorch MergeBot	1586521461	Revert "Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590 )" This reverts commit 2c76f31221e117b217b8a6a96a5405f626d2218a. Reverted https://github.com/pytorch/pytorch/pull/155590 on behalf of https://github.com/jeanschmidt due to Breaking 1000s of internal builds, it cant be properly landed internally, there are no options except revert and codev. ([comment](https://github.com/pytorch/pytorch/pull/155590#issuecomment-3023503929))	2025-07-01 11:23:00 +00:00
PyTorch MergeBot	534c454e77	Revert "[xla hash update] update the pinned xla hash (#156584 )" This reverts commit b1a54fab9bcb0cc167773f9a885d4170447e1c68. Reverted https://github.com/pytorch/pytorch/pull/156584 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/155590 ([comment](https://github.com/pytorch/pytorch/pull/156584#issuecomment-3023492421))	2025-07-01 11:20:05 +00:00
PyTorch MergeBot	13bf2655c1	Revert "HF loads dcp - don't do a full deserialize on every file (#155942 )" This reverts commit 117db5601d78cbc746b35eef71fc815e042e903f. Reverted https://github.com/pytorch/pytorch/pull/155942 on behalf of https://github.com/jeanschmidt due to Newly introduced tests are red internally, more details on D76442012 ([comment](https://github.com/pytorch/pytorch/pull/155942#issuecomment-3023473036))	2025-07-01 11:15:08 +00:00
PyTorch MergeBot	0bce390269	Revert "[dynamo] Add fx_graph_runnable test coverage (#157021 )" This reverts commit 20e40492b046b9287726d3ec656117e4dc38f0e2. Reverted https://github.com/pytorch/pytorch/pull/157021 on behalf of https://github.com/jeanschmidt due to New tests are red internally, more details on D77471538 ([comment](https://github.com/pytorch/pytorch/pull/157021#issuecomment-3023455082))	2025-07-01 11:10:45 +00:00
Bob Ren	a767e50adc	remove allow-untyped-defs from torch/fx/experimental/migrate_gradual_types/util.py (#157236 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157236 Approved by: https://github.com/ezyang	2025-07-01 10:36:48 +00:00
Jeff Daily	210632fae1	[ROCm] support experimental CU carveout (#149466 ) Fixes #149280. Follow up to #147966, but now available for ROCm. Since hipblaslt does not support HIPBLASLT_MATMUL_DESC_CU_COUNT_TARGET, we instead create a hipStream that has a CU mask applied. We pass this masked stream to hipblaslt instead of pytorch's current stream. We ensure stream ordering between streams using hipEvents and stream synchronization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149466 Approved by: https://github.com/malfet, https://github.com/atalman	2025-07-01 08:54:52 +00:00
Jason Ansel	0596323c35	Better fix for `__index__` SymInt issue (#157201 ) This improves on #156928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157201 Approved by: https://github.com/ezyang	2025-07-01 07:06:46 +00:00
PyTorch MergeBot	c202a7329a	Revert "Fixes for CPython int/float tests (#155978 )" This reverts commit 23491519d288dedb2a54cfad5fef7fcb2ad8eade. Reverted https://github.com/pytorch/pytorch/pull/155978 on behalf of https://github.com/XuehaiPan due to sys.get_int_max_str_digits is not always available ([comment](https://github.com/pytorch/pytorch/pull/155978#issuecomment-3021990027))	2025-07-01 06:16:49 +00:00
Xuehai Pan	754699610b	[BE] always use `uv pip` if possible in `pip_init.py` for `lintrunner init` (#157199 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157199 Approved by: https://github.com/ezyang	2025-07-01 06:07:29 +00:00
Isuru Fernando	8f0998aafe	Check F2C BLAS for OpenBLAS and other vendors (#143846 ) This issue came from https://github.com/conda-forge/pytorch-cpu-feedstock/issues/180. MKL follows the F2C convention for returning single precision floats as doubles and uses the G77 convention for returning complex valued scalars. OpenBLAS does the opposite. There is a check for this already, but it's done only when the Generic BLAS vendor code path is used and this PR moves that code to `Dependencies.cmake` to make it work when the BLAS vendor is OpenBLAS and others Pull Request resolved: https://github.com/pytorch/pytorch/pull/143846 Approved by: https://github.com/rgommers, https://github.com/atalman	2025-07-01 05:56:24 +00:00
Ethan Wee	04bd7e6850	[ROCm] Remove use of `warpsize` on host-side compilation (#156979 ) Changes needed for ROCm7.0: * `warpSize` is _not_ a compile-time constant on device-side compilation for ROCm anymore * `warpSize` is _not_ defined on host-side compilation, hence `at::cuda::warp_size()` must be used to query warpsize at runtime * Redefining `C10_WARP_SIZE` to be a compile-time constant, with a reasonable value for device-side compilation, but an unreasonable value of 1 for host-side compilation Pull Request resolved: https://github.com/pytorch/pytorch/pull/156979 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-01 04:55:31 +00:00
Nikita Shulga	c811f41cf5	[BE] Remove unused variable from Pooling.metal (#157332 ) Fixes following compilation warning ``` /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:101:21: warning: unused variable 'indices_sizes' [-Wunused-variable] constant int64_t* indices_sizes = params.indices_sizes.data(); ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157332 Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/dcci	2025-07-01 04:28:04 +00:00
drisspg	4d5d627e5f	Remove super spammy log (#157157 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157157 Approved by: https://github.com/davidberard98	2025-07-01 03:51:58 +00:00
Jason Ansel	b40981c630	Fix incorrect stride handling in adaptive_avg_pool3d (#157326 ) Fixes #157248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157326 Approved by: https://github.com/eqy ghstack dependencies: #157242	2025-07-01 03:03:48 +00:00
Andy Lugo	b5ce77c1f5	[ROCm] Initial AITER Integration for mha_bwd asm kernels (#152630 ) Generates AITER plumbing via cmake. Calls into fav3 asm bwd CK kernels. Update submodule composable kernel for this change Pull Request resolved: https://github.com/pytorch/pytorch/pull/152630 Approved by: https://github.com/xw285cornell, https://github.com/yoyoyocmu	2025-07-01 02:53:27 +00:00
Catherine Lee	f40efde2a4	[CI] Add prebuild command option, set prebuild command option for CI to build flash attention (#156236 ) Build flash attention separately in build using 2 jobs since it OOMs on more, then the rest of the job uses 6 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156236 Approved by: https://github.com/malfet	2025-07-01 02:53:22 +00:00
Sidharth	3ed4384f5b	[dynamo] temporarily disabling generation of weblinks for torch v2.8 release (#157299 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157299 Approved by: https://github.com/williamwen42	2025-07-01 02:31:17 +00:00
Ti-Tai Wang	c174f3a6a5	[ONNX] Delete deprecated tutorial page link (#157310 ) Related to https://github.com/pytorch/tutorials/issues/3420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157310 Approved by: https://github.com/justinchuby	2025-07-01 01:18:26 +00:00
Prachi Gupta	6dc2b22269	[ROCm][SymmetricMemory] Performance improvements for two-shot allreduce (#156746 ) The biggest bottleneck that we found with two-shot allreduce was that the compiler was serializing all the load operations for some reason. To avoid these load delays, we've added de-serialization of loads. Along with this improvement, we also found that on AMD GPUs a different block and thread size gives a nice performance boost. Here are the bandwidth numbers I am getting with this PR: ![image](https://github.com/user-attachments/assets/57005856-4cb5-43cd-8e9c-46869f75ab0b) The rows that are green are the tensor sizes that we are interested in because two-shot is only used for bigger sizes (one-shot is used for smaller sizes). As we can see, our baseline numbers wrt to fbgemm numbers were consistently underperforming. However, with this deserialize change, most of the tensor sizes have a performance boost (positive %) for the green tensors. There's one tensor with negative performance, but that's within error margin. co-authored by: @amd-hhashemi https://github.com/pytorch/FBGEMM/issues/4072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156746 Approved by: https://github.com/jeffdaily Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>	2025-07-01 00:37:30 +00:00
fuwenguang	f860992db5	Add a custom profiler configuration option (#151656 ) We aim to pass some configuration options to our custom Kineto backend via ExperimentalConfig,, so we added a `custom_profiler_config` parameter. Requires https://github.com/pytorch/kineto/pull/1077 , Pull Request resolved: https://github.com/pytorch/pytorch/pull/151656 Approved by: https://github.com/sraikund16	2025-07-01 00:36:09 +00:00
Ankita George	b60569ed94	HF - consolidate shards of safetensors files to full tensors in finish step (#156705 ) Title - we can consolidate the shards to a full tensors, optionally behind a flag, in the finish step of DCP.save also adds the thread count argument which is configurable for users, before we were just using the default of 1. Re-creating https://github.com/pytorch/pytorch/pull/155940 bc it got into a bad detached state Differential Revision: [D77231774](https://our.internmc.facebook.com/intern/diff/D77231774/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156705 Approved by: https://github.com/saumishr ghstack dependencies: #154743	2025-07-01 00:30:48 +00:00
Nikita Shulga	4ebd269065	[Testing] Remove duplicate MPSInductor tests (#157328 ) They were added there before test_torchinductor were running in CI, but now the same are covered by `GPUTests.test_pointwise_*_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157328 Approved by: https://github.com/huydhn	2025-07-01 00:21:22 +00:00
Bob Ren	7709ff5512	[remove untyped defs] batch 1 (#157011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157011 Approved by: https://github.com/Skylion007	2025-06-30 23:54:40 +00:00
Scott Wolchok	fee2377f9e	Reapply D77381084 / #156964 : Rename torch::standalone to headeronly (#157251 ) Was reverted due to internal failure which should be fixed now. I believe Jane wants this reapplied and picked to release, and she's out this week. Original summary: headeronly is more clear, let's change the name before anyone depends on standalone Differential Revision: [D77520173](https://our.internmc.facebook.com/intern/diff/D77520173/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157251 Approved by: https://github.com/janeyx99, https://github.com/Skylion007, https://github.com/desertfire	2025-06-30 23:25:30 +00:00
Aaron Ang	3dda80e990	Overload `mul_overflows` for `size_t` (#155736 ) Partially fixes https://github.com/pytorch/executorch/pull/11537. We want to extend `mul_overflows` to support `size_t` in ExecuTorch. The current workflow in ET checks that the `c10` mirrors exactly as in PT, so the tests are failing. See comment: https://github.com/pytorch/executorch/pull/11537#issuecomment-2963821312 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155736 Approved by: https://github.com/swolchok	2025-06-30 22:57:28 +00:00
Animesh Jain	42b48ee672	[dynamo][fsdp] Consistent behavior of int attributes (#157262 ) Reimpl of https://github.com/pytorch/pytorch/pull/150954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157262 Approved by: https://github.com/bdhirsh	2025-06-30 22:32:52 +00:00
Ankita George	a9352bd25e	Script for consolidation of sharded safetensor files (#154743 ) Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory. Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154743 Approved by: https://github.com/saumishr	2025-06-30 22:25:58 +00:00
zhxchen17	f096820d0f	[precompile] Detect source code changes for save/load. (#156432 ) Go through all dynamo traced functions and compute checksum for them. While loading a precompilation back to memory, we will always check the checksum and refuse to load when source code changes are detected. Differential Revision: [D76987123](https://our.internmc.facebook.com/intern/diff/D76987123/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156432 Approved by: https://github.com/jansel, https://github.com/jamesjwu	2025-06-30 21:16:15 +00:00
PyTorch MergeBot	d3efd73234	Revert "[cutlass backend][BE][ez] Make matmul layouts be row x column (#156656 )" This reverts commit 84c588e5eada9e7921608065edc444a15c22cb1c. Reverted https://github.com/pytorch/pytorch/pull/156656 on behalf of https://github.com/henrylhtsang due to breaking fbcode A100 tests ([comment](https://github.com/pytorch/pytorch/pull/156656#issuecomment-3020769914))	2025-06-30 21:16:04 +00:00
Animesh Jain	3684be056d	[dynamo] Fix source for lru_cache method (#157292 ) Fixes - https://github.com/pytorch/pytorch/issues/157273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157292 Approved by: https://github.com/zou3519, https://github.com/malfet, https://github.com/jansel	2025-06-30 20:53:57 +00:00
Guilherme Leobas	23491519d2	Fixes for CPython int/float tests (#155978 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155978 Approved by: https://github.com/zou3519	2025-06-30 19:42:11 +00:00
Klaus Zimmermann	f16053f0c9	Switch to standard pep517 sdist generation (#152098 ) Generate source tarball with PEP 517 conform build tools instead of the custom routine in place right now. Closes #150461. The current procedure for generating the source tarball consists in creation of a source tree by manual copying and pruning of source files. This PR replaces that with a call to the standard [build tool](https://build.pypa.io/en/stable/), which works with the build backend to produce an sdist. For that to work correctly, the build backend also needs to be configured. In the case of Pytorch, the backend currently is (the legacy version of) the setuptools backend, the source dist part of which is mostly configured via the `MANIFEST.in` file. The resulting source distribution can be used to install directly from source with `pip install ./torch-{version}.tar.gz` or to build wheels directly from source with `pip wheel ./torch-{version}.tar.gz`; both should be considered experimental for now. ## Issues ### sdist name According to PEP 517, the name of the source distribution file must coincide with the project name, or [more precisely](https://peps.python.org/pep-0517/#source-distributions), the source distribution of a project that generates `{NAME}-{...}.whl` wheels are required to be named `{NAME}-{...}.tar.gz`. Currently, the source tarball is called `pytorch-{...}.tar.gz`, but the generated wheels and python package are called `torch-{...}`. ### Symbolic Links The source tree at the moment contains a small number of symbolic links. This [has been seen as problematic](https://github.com/pypa/pip/issues/5919) largely because of lack of support on Windows, but also because of [a problem in setuptools](https://github.com/pypa/setuptools/issues/4937). Particularly unfortunate is a circular symlink in the third party `ittapi` module, which can not be resolved by replacing it with a copy. PEP 721 (now integrated in the [Source Distribution Format Specification](https://packaging.python.org/en/latest/specifications/source-distribution-format/#source-distribution-archive-features)) allows for symbolic links, but only if they don't point outside the destination directory and if they don't contain `../` in their target. The list of symbolic links currently is as follows: <details> \|source\|target\|problem\|solution\| \|-\|-\|-\|-\| \| `.dockerignore` \| `.gitignore` \| ✅ ok (individual file) \|\| \| `docs/requirements.txt` \| `../.ci/docker/requirements-docs.txt` \|❗`..` in target\|swap source and target[^1]\| \| `functorch/docs/source/notebooks` \| `../../notebooks/` \|❗`..` in target\|swap source and target[^1]\| \| `.github/ci_commit_pins/triton.txt` \| `../../.ci/docker/ci_commit_pins/triton.txt` \| ✅ ok (omitted from sdist)\|\| \| `third_party/flatbuffers/docs/source/CONTRIBUTING.md` \| `../../CONTRIBUTING.md` \|❗`..` in target\|omit from sdist[^2]\| \| `third_party/flatbuffers/java/src/test/java/DictionaryLookup` \| `../../../../tests/DictionaryLookup` \|❗`..` in target\|omit from sdist[^3]\| \| `third_party/flatbuffers/java/src/test/java/MyGame` \| `../../../../tests/MyGame` \|❗`..` in target\|omit from sdist[^3]\| \| `third_party/flatbuffers/java/src/test/java/NamespaceA` \| `../../../../tests/namespace_test/NamespaceA` \|❗`..` in target\|omit from sdist[^3]\| \| `third_party/flatbuffers/java/src/test/java/NamespaceC` \| `../../../../tests/namespace_test/NamespaceC` \|❗`..` in target\|omit from sdist[^3]\| \| `third_party/flatbuffers/java/src/test/java/optional_scalars` \| `../../../../tests/optional_scalars` \|❗`..` in target\|omit from sdist[^3]\| \| `third_party/flatbuffers/java/src/test/java/union_vector` \| `../../../../tests/union_vector` \|❗`..` in target\|omit from sdist[^3]\| \| `third_party/flatbuffers/kotlin/benchmark/src/jvmMain/java` \| `../../../../java/src/main/java` \|❗`..` in target\|omit from sdist[^3]\| \| `third_party/ittapi/rust/ittapi-sys/c-library` \| `../../` \|❗`..` in target\|omit from sdist[^4]\| \| `third_party/ittapi/rust/ittapi-sys/LICENSES` \| `../../LICENSES` \|❗`..` in target\|omit from sdist[^4]\| \| `third_party/opentelemetry-cpp/buildscripts/pre-merge-commit` \| `./pre-commit` \|✅ ok (individual file)\|\| \| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_client.cc` \| `../../push/tests/integration/sample_client.cc` \|❗`..` in target\|omit from sdist[^5]\| \| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_server.cc` \| `../../pull/tests/integration/sample_server.cc` \|❗`..` in target\|omit from sdist[^5]\| \| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_client.cc` \| `../../push/tests/integration/sample_client.cc` \|❗`..` in target\|omit from sdist[^5]\| \| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_server.cc` \| `../../pull/tests/integration/sample_server.cc` \|❗`..` in target\|omit from sdist[^5]\| \| `third_party/XNNPACK/tools/xngen` \| `xngen.py` \| ✅ ok (individual file)\|\| </details> The introduction of symbolic links inside the `.ci/docker` folder creates a new problem, however, because Docker's `COPY` command does not allow symlinks in this way. We work around that by using `tar ch` to dereference the symlinks before handing them over to `docker build`. [^1]: These resources can be naturally considered to be part of the docs, so moving the actual files into the place of the current symlinks and replacing them with (unproblematic) symlinks can be said to improve semantics as well. [^2]: The flatbuffers docs already actually use the original file, not the symlink and in the most recent releases, starting from flatbuffers-25.1.21 the symlink is replaced by the actual file thanks to a documentation overhaul. [^3]: These resources are flatbuffers tests for java and kotlin and can be omitted from our sdist. [^4]: We don't need to ship the rust bindings for ittapi. [^5]: These are demonstration examples for how to link to prometheus-cpp using cmake and can be omitted. ### Nccl Nccl used to be included as a submodule. However, with #146073 (first released in v2.7.0-rc1), the submodule was removed and replaced with a build time checkout procedure in `tools/build_pytorch_libs.py`, which checks out the required version of nccl from the upstream repository based on a commit pin recorded in `.ci/docker/ci_commit_pins/nccl-cu{11,12}.txt`. This means that a crucial third party dependency is missing from the source distribution and as the `.ci` folder is omitted from the source distribution, it is not possible to use the build time download. However, it is possible to use a system provided Nccl using the `USE_SYSTEM_NCCL` environment variable, which now also is the default for the official Pytorch wheels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152098 Approved by: https://github.com/atalman	2025-06-30 19:07:34 +00:00
Wanchao Liang	c7b6c98d10	[tp] improve parallelize_module API to support more cases (#157182 ) This PR improves the parallelize_module API to support more corner cases: 1. if the plan entry specified as "", it should apply the style to the current module 2. if the plan entry does not have a corresponding submodule to apply, raise a warning and ignore this plan entry As working on this PR, I also found that the while-loop inside is actually not necessary and could produce some nasty on the fly modifying while iterating behavior.. So I removed the while loop Pull Request resolved: https://github.com/pytorch/pytorch/pull/157182 Approved by: https://github.com/tianyu-l	2025-06-30 18:10:44 +00:00
PyTorch MergeBot	d5e6f42094	Revert "Use std::string_view in torchgen (#157050 )" This reverts commit 064288cbab94c9931ca2296a2b9723e864f9050a. Reverted https://github.com/pytorch/pytorch/pull/157050 on behalf of https://github.com/jeanschmidt due to Seems to have broken internal builds, more details on D77449943. @ezyang may I count on your help to get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/157050#issuecomment-3020222668))	2025-06-30 18:08:54 +00:00
PyTorch MergeBot	efbf07e7ea	Revert "[dynamo] Fix issue with tensors passed as view() shapes (#156928 )" This reverts commit 75f3e5a88df60caef27fd9c9df3fd51161378fcc. Reverted https://github.com/pytorch/pytorch/pull/156928 on behalf of https://github.com/jeanschmidt due to Breaks a internal test, more details can be found on D77449971 ([comment](https://github.com/pytorch/pytorch/pull/156928#issuecomment-3020186268))	2025-06-30 17:56:01 +00:00
Avanish Tiwari	5e18bc3331	[PowerPC] Fixed build issue for vsx vec256 complexfloat and scaled_mm_out_cpu (#155255 ) Pytorch build is failing on power system from this commit ec24f8f58a74502c5a2488f5d9e85a817616dda0 *Build Failure Logs* Error related to mkldnn ``` pytorch/aten/src/ATen/native/Blas.cpp:302:26: error: ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope 302 \| if ((!mixed_dtype && cpuinfo_has_x86_amx_int8()) \|\| \| ^~~~~~~~~~~~~~~~~~~~~~~~ pytorch/aten/src/ATen/native/Blas.cpp:303:25: error: ‘cpuinfo_has_x86_amx_fp16’ was not declared in this scope 303 \| (mixed_dtype && cpuinfo_has_x86_amx_fp16())) { \| ^~~~~~~~~~~~~~~~~~~~~~~~ ``` Error related to vec256 complex float redefinition ``` aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: specialization of ‘at::vec::DEFAULT::Vectorized<c10::complex<float> >’ after instantiation 19 \| class Vectorized<ComplexFlt> { \| ^~~~~~~~~~~~~~~~~~~~~~ aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: redefinition of ‘class at::vec::DEFAULT::Vectorized<c10::complex<float> >’  aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:633:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’ 633 \| auto abs_a = a.abs_2_(); \| ^~~~~~ aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:634:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’ 634 \| auto abs_b = b.abs_2_(); \| ^~~~~~ /aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:666:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’ 666 \| vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())}; aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:673:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’ 673 \| vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())}; \| ^~~~ aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:680:27: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’ 680 \| vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())}; ``` *With this changes build logs* ``` Building wheel torch-2.8.0a0+gita3098a7 -- Building version 2.8.0a0+gita3098a7 -- Checkout nccl release tag: v2.26.5-1 cmake -GNinja -DBLAS=OpenBLAS -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/torch -DCMAKE_PREFIX_PATH=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/lib/python3.12/site-packages -DPython_EXECUTABLE=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/bin/python -DTORCH_BUILD_VERSION=2.8.0a0+gita3098a7 -DUSE_MKLDNN=ON -DUSE_MKLDNN_CBLAS=ON -DUSE_NUMPY=True -DUSE_OPENMP=ON /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch cmake --build . --target install --config Release running build_ext -- Building with NumPy bindings -- Not using cuDNN -- Not using CUDA -- Not using XPU -- Using MKLDNN -- Not using Compute Library for the Arm architecture with MKLDNN -- Using CBLAS in MKLDNN -- Not using NCCL -- Building with distributed package: -- USE_TENSORPIPE=True -- USE_GLOO=True -- USE_MPI=False -- Building Executorch -- Not using ITT Copying functorch._C from functorch/functorch.so to /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so copying functorch/functorch.so -> /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so building 'torch._C' extension creating build/temp.linux-ppc64le-cpython-312/torch/csrc ``` This patch will fix the pytorch build issue on power, and i am able to build successfully. Hi @malfet @albanD Please review this PR for pytorch build issue that we are observing on power. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155255 Approved by: https://github.com/albanD, https://github.com/malfet	2025-06-30 17:54:37 +00:00
Wanchao Liang	2815eea0d0	[dtensor] relax device_mesh argument constraint in local_map (#157049 ) This PR relaxes the device_mesh argument constraint in the local_map API. The current restriction is too strict, i.e. all the input arguments must have the same device mesh if they are DTensors. But many times user might want to pass in DTensors to this function that lives on different device mesh, i.e. weight and activation could live in different device mesh. When using the local_map, we are extracting the local tensors from DTensors, and as long as the placements user specified match with the actual DTensor placements, user knows clearly that the inputs are intended to live in different mesh. So this PR removes the same mesh check and update doc to clearly document the behavior. The `device_mesh` argument now serves for a main purpose, allow user to specify the device_mesh for the output DTensor reconstruction Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/157049 Approved by: https://github.com/Chillee, https://github.com/zpcore	2025-06-30 17:51:48 +00:00
Jason Ansel	f8cc4c0af8	[inductor] Update triton_key import to support latest Triton (#157242 ) With Triton main things were failing with: ```py File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 205, in get_system from triton.compiler.compiler import triton_key torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler' (/home/jansel/pytorch/triton/compiler/compiler.py) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157242 Approved by: https://github.com/aorenste	2025-06-30 17:51:43 +00:00
Ankita George	117db5601d	HF loads dcp - don't do a full deserialize on every file (#155942 ) Differential Revision: [D76442012](https://our.internmc.facebook.com/intern/diff/D76442012/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155942 Approved by: https://github.com/saumishr ghstack dependencies: #155707	2025-06-30 17:45:10 +00:00
Laith Sakka	ed5d6d2a20	python definitely_contiguous-> is_contiguous_or_false (#156515 ) We probably can avoid having those in python as well and just depend on c++ impl after we land https://github.com/pytorch/pytorch/pull/155590 but that is for a different PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156515 Approved by: https://github.com/bobrenjc93	2025-06-30 17:31:51 +00:00
PyTorch MergeBot	c038719731	Revert "Inductor logging + analysis of torch.profile (#149697 )" This reverts commit 347ace4c7ac2dbb14799089c30bd01a9ac312791. Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail on ROCm ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-3020006655))	2025-06-30 16:58:54 +00:00
Yukio Siraichi	b54eac2a5e	Upgrade to DLPack 1.0. (#145000 ) This PR makes the necessary changes in order to upgrade PyTorch DLPack support to version 1.0. In summary, we add support for the following: - Support both `DLManagedTensor` and `DLManagedTensorVersioned` when producing and consuming DLPack capsules - New parameter for `__dlpack__` method: `max_version` - Version checks: - Fallback to old implementation if no `max_version` or if version lower than 1.0 - Check that the to-be-consumed capsule is of version up to 1.X In order to accommodate these new specifications, this PR adds the following main changes: - `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python API for creating a versioned DLPack capsule (called by `__dlpack__` method) - `DLPackTraits<T>` class (DLConvertor.h): select the correct traits (e.g. capsule name, conversion functions) depending on which DLPack tensor class is being used - `toDLPackImpl<T>` function (DLConvertor.cpp): populates the common fields of both classes - `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor from a DLPAck capsule - `fillVersion<T>` function (DLConvertor.cpp): populates the version field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`) - `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function for constructing a tensor out of a DLPack capsule that also marks the capsule as used Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000 Approved by: https://github.com/albanD	2025-06-30 16:58:06 +00:00
Han, Xu	39b71d11fc	[Inductor] add pedantic to limit inductor code follow standard. (#156914 ) ### Background: During my development work, I found Windows msvc don't support to compile zero size array, please reference: https://github.com/pytorch/pytorch/issues/153180 As discussed with MSFT engineer, we found zero size array don't align to c++ standard, though gcc/clang can support it. When we add `-pedantic` option to gcc, it should check and raise c++ standard strictly. Reference: https://github.com/pytorch/pytorch/issues/153180#issuecomment-2986676878 So this PR add `-pedantic` to torch inductor build option list to constraint codegen generate c++ standard well code. Additional, It also fixed a halide zero size array code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156914 Approved by: https://github.com/jansel	2025-06-30 16:29:08 +00:00
Tom Ritchford	e3afbb0362	[inductor] Add typing to _inductor/ir.py (#149958 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958 Approved by: https://github.com/Skylion007	2025-06-30 15:56:35 +00:00
eqy	3b4b5f8d47	[SDPA] Fix `alloc_with_matching_layout` stride sorting (#157145 ) Otherwise dims with "zero" stride get moved before contiguous dims (stride 1). Need to move the fix from #149282 to here as #154340 moved the original definition from `MHA.cpp`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157145 Approved by: https://github.com/Skylion007	2025-06-30 15:43:29 +00:00
PyTorch MergeBot	da1f337bc4	Revert "Fixes for CPython int/float tests (#155978 )" This reverts commit fab53dfdf1d89cecd5e82b12cced9b6dd217e87c. Reverted https://github.com/pytorch/pytorch/pull/155978 on behalf of https://github.com/guilhermeleobas due to failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/155978#issuecomment-3019457531))	2025-06-30 14:49:44 +00:00
Guilherme Leobas	fab53dfdf1	Fixes for CPython int/float tests (#155978 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155978 Approved by: https://github.com/zou3519	2025-06-30 14:15:47 +00:00
PyTorch UpdateBot	ffaed8c569	Update slow tests (#155448 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155448 Approved by: https://github.com/pytorchbot	2025-06-30 12:08:52 +00:00
PyTorch UpdateBot	b1a54fab9b	[xla hash update] update the pinned xla hash (#156584 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156584 Approved by: https://github.com/pytorchbot	2025-06-30 11:23:06 +00:00
LifengWang	ccb67f39b4	Enable the AMP precision with freezing for CPU nightly test (#152298 ) Hi, @desertfire. Since we recommend users to use AMP precision and run with `--freezing` for CPU x86 Inductor inference, we suggest adding the AMP freezing test to the CPU nightly tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152298 Approved by: https://github.com/desertfire, https://github.com/huydhn Co-authored-by: zengxian <xiangdong.zeng@intel.com>	2025-06-30 09:17:17 +00:00
morrison-turnansky	f79689bd3d	updated matplotlib version in docs requirements (#155931 ) Fixes #155199 The issue on main is due an outdated version of matplotlib. I have bumped the version so that it is compatible with Numpy 2.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155931 Approved by: https://github.com/malfet	2025-06-30 02:05:53 +00:00
Isalia20	a1282b1823	[MPS] Add boilerplate sparse code support (#157238 ) This PR makes minimal changes to support sparse tensors on MPS. In the followup PRs I'll start adding different operations slowly so we can fix the issue of https://github.com/pytorch/pytorch/issues/129842 which is highly requested(I assume because of whisper using sparse tensors) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157238 Approved by: https://github.com/malfet	2025-06-30 01:53:45 +00:00
Bin Bao	771be85704	[AOTI] Print out error msg when nvcc compiler fails (#157203 ) Summary: To debug https://github.com/pytorch/pytorch/issues/156930. Not able to reproduce the problem locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157203 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2025-06-30 01:30:55 +00:00
Burak Turk	86ced14453	increment pending_callbacks_counter before initation the pt2 compile callbacks (#157185 ) Summary: Since we increment the counter after performing the callback, it leads to the assertion error when callback raises an error and increment never happens. Let's increment first to avoid it. Test Plan: tba Rollback Plan: Differential Revision: D77475650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157185 Approved by: https://github.com/xmfan	2025-06-30 01:23:59 +00:00
Jason Ansel	12cb06e574	[inductor] Increase tolerance for test_comprehensive_nn_functional_linear_cuda_float16 (#156962 ) Fixes #156514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156962 Approved by: https://github.com/jamesjwu	2025-06-30 00:54:20 +00:00
cyy	c27f83dd91	Remove old ASAN Docker images (#157197 ) The old ASAN jobs have been replaced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157197 Approved by: https://github.com/Skylion007	2025-06-30 00:30:56 +00:00
Jake Stevens	11f7e2f145	[caffe][executorch] rename to avoid shadow in irange (#157107 ) Summary: D76832520 switched Executorch to use the caffe c10 headers. This copy contains a shadow, which is treated as an error for certain embedded compile flows. Simple rename to avoid. Test Plan: CI Rollback Plan: Differential Revision: D77446104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157107 Approved by: https://github.com/Skylion007	2025-06-30 00:17:09 +00:00
dolpm	018e9826a2	[nativert] hook up memory planning to execution frame (#157053 ) Summary: pretty simple. if planner exists, which implies that planning is enabled, create a manager for each frame. the associated serial executor will use the withMemoryPlannner fn to ensure the deallocation is done after execution completes. Test Plan: CI Differential Revision: D73635809 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157053 Approved by: https://github.com/henryoier, https://github.com/georgiaphillips	2025-06-30 00:06:37 +00:00
Jason Ansel	41f6acef83	Update pr_time_benchmarks expected results (#157214 ) The job has been unstable Pull Request resolved: https://github.com/pytorch/pytorch/pull/157214 Approved by: https://github.com/laithsakka	2025-06-29 19:12:13 +00:00
PyTorch MergeBot	29f76ec0f3	Revert "[BE] use `pathlib.Path` instead of `os.path.*` in `setup.py` (#156742 )" This reverts commit 2380115f9738f97cf706affefd647d2cb6dfbb3f. Reverted https://github.com/pytorch/pytorch/pull/156742 on behalf of https://github.com/malfet due to Looks like it broke all ROCM tests, see `721d2580db/1` ([comment](https://github.com/pytorch/pytorch/pull/156742#issuecomment-3016937704))	2025-06-29 18:10:03 +00:00
Simon Fan	721d2580db	[dynamo][callbacks] temporarily disable TRITON_AUTOTUNING (#157186 ) Differential Revision: D77476551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157186 Approved by: https://github.com/burak-turk	2025-06-29 17:20:55 +00:00
Nick Riasanovsky	aec569da23	[Triton] [Inductor[ Add tt.descriptor_store to get_tma_stores (#157212 ) Summary: Fixes a gap in the Triton update where the traverse would break because `get_tma_stores` didn't handle both TMA APIs. Test Plan: `buck test -m ovr_config//triton:beta 'fbcode//mode/dev-nosan' fbcode//ads_mkl/ops/tests:gdpa_dcpp_test -- --exact 'ads_mkl/ops/tests:gdpa_dcpp_test - test_gdpa_dcpp (ads_mkl.ops.tests.gdpa_dcpp_test.GdpaDCPPTest)'` Rollback Plan: Differential Revision: D77501582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157212 Approved by: https://github.com/davidberard98	2025-06-29 16:44:52 +00:00
Jason Ansel	b147b6c0e3	Increase tolerance for test_corrcoef_cuda_int32 (#157206 ) Fixes #156988 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157206 Approved by: https://github.com/Skylion007	2025-06-29 16:30:54 +00:00
Patryk Ozga	e959dd017d	[TSAN][live speech translation] Fix A data race in caffe2 (#156378 ) Summary: noticed that context quantized_engine is accessed and written from multiple threads Test Plan: ➜ fbsource buck test --flagfile fbcode/mode/dev-tsan //xplat/assistant/integration_test/tests/supernova/speechtranslation:live_speech_translation_en_fr_tests -- --exact 'fbsource//xplat/assistant/integration_test/tests/supernova/speechtranslation:live_speech_translation_en_fr_tests - Translate/LiveSpeechTranslationTests.LiveSpeechTranslationEnFr/silence___fr_en' Rollback Plan: Differential Revision: D76921416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156378 Approved by: https://github.com/jerryzh168, https://github.com/cyyever	2025-06-29 07:23:20 +00:00
bobrenjc93	9d677389cb	[async compile] make it more obvious that we support backwards (#157204 ) current failing with ``` (/home/bobren/local/a/pytorch-env) [13:02] devgpu009:/home/bobren/local/a/pytorch python test/inductor/test_compile_subprocess.py -k GPUTests.test_async /home/bobren/local/a/pytorch/torch/backends/cudnn/__init__.py:115: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system. warnings.warn( /home/bobren/local/a/pytorch/torch/_inductor/ops_handler.py:741: UserWarning: undefined OpHandler.__getstate__, please add missing op schema warnings.warn(f"undefined OpHandler.{name}, please add missing op schema") /home/bobren/local/a/pytorch/torch/_inductor/ops_handler.py:741: UserWarning: undefined OpHandler.__getstate__, please add missing op schema warnings.warn(f"undefined OpHandler.{name}, please add missing op schema") W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] Unable to pickle input graph or example inputs W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] Traceback (most recent call last): W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx_ext.py", line 484, in serialize_compile W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] ).serialize() W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx_ext.py", line 210, in serialize W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] return _WireProtocolPickledInput(GraphPickler.dumps(self)) W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] File "/home/bobren/local/a/pytorch/torch/fx/_graph_pickler.py", line 124, in dumps W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] pickler.dump(obj) W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] AttributeError: Can't pickle local object 'make_opaque_bitwise_fn.<locals>.BitwiseFn' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157204 Approved by: https://github.com/aorenste	2025-06-29 05:38:54 +00:00
Gabriel Ferns	347ace4c7a	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-06-29 05:00:47 +00:00
Xuehai Pan	f8293116f5	[BE][13/16] fix typos in torch/ (torch/ao/) (#156603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156603 Approved by: https://github.com/msaroufim	2025-06-29 04:34:04 +00:00
Abdourrahmane Kabbaj	1913c915e0	Fixes issue #156414 : Fixes bug in implementation of _combine_histograms. (#156457 ) Fixes #156414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156457 Approved by: https://github.com/jerryzh168	2025-06-29 04:30:28 +00:00
Saiteja Samudrala	2796f31b5e	[DCP] OSS Zero Overhead Checkpointing Implementation (#156207 ) Summary: This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing Test Plan: Test with TorchTitan on this PR: https://github.com/pytorch/torchtitan/pull/1287 Differential Revision: D72391401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156207 Approved by: https://github.com/teja-rao	2025-06-29 03:19:48 +00:00
Nichols A. Romero	bccb8473fe	[ROCm] Allow use of rocSOLVER for Cholesky inversion. (#157154 ) Fixes https://github.com/pytorch/pytorch/issues/155046 This change allows Cholesky inversion to use rocSOLVER. This is now also the default on ROCm for Cholesky inversion which aligns with the behavior on NVIDIA (which defaults to cuSOLVER for this linear algebra operation). This fix also gets around a memory access fault encountered in MAGMA for large matrices. MAGMA can still be forced on ROCm by doing: ``` torch.backends.cuda.preferred_linalg_library(backend='magma') ``` Ran all Cholesky UT on ROCm and there were no regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157154 Approved by: https://github.com/jeffdaily	2025-06-29 01:53:02 +00:00
Laith Sakka	6cc490d40b	simplify max(1,x) to x when x known >=1 (#157189 ) Creating contiguous strides creates an expression max(1, x). Often we know that x >= 1, in which case we should simplify max(1, x) to x. This appeared in two situations: 1) An internal user complained about statically_known_true(x == max(1, x)) failing (internal link: https://fb.workplace.com/groups/1028545332188949/permalink/1232958568414290). This https://github.com/pytorch/pytorch/pull/155938 won't be needed with this. 3) Not simplifying the above could result in wrong ConstraintViolationErrors. Because we assume non-trival single arg guards shall evaporate see the logic in the function issue_guard in symbolic_shapes.py with this change we longer throw ConstraintViolationErrors with the program bellow this is blocking landing this [PR](https://github.com/pytorch/pytorch/pull/155590) from landing internally. Due to internal export tests throwing ConstraintViolationErrors. like ``` Constraints violated (width)! - Not all values of width = L['x'].size()[3] in the specified range 224 <= width <= 455 satisfy the generated guard max(1, 1 + (((-1) + L['x'].size()[3]) // 2)) == (1 + (((-1) + L['x'].size()[3]) // 2)). ```` ``` x = torch.rand(10) torch._dynamo.mark_dynamic(x, 0, max=20, min=5) @torch.compile(fullgraph=True, dynamic=True) def func(x): if max(1, (-1 + x.size()[0]//2)) == (-1+x.size()[0]//2): return x400 else: return (x10)*100 func(x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157189 Approved by: https://github.com/pianpwk	2025-06-29 01:16:30 +00:00
Yidi Wu	836bb1941b	[hop] support torch.func.functional_call in hop subgraph (#155886 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155886 Approved by: https://github.com/zou3519	2025-06-28 23:47:46 +00:00
Xuehai Pan	2380115f97	[BE] use `pathlib.Path` instead of `os.path.*` in `setup.py` (#156742 ) Resolves: - https://github.com/pytorch/pytorch/pull/155998#discussion_r2164376634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156742 Approved by: https://github.com/malfet	2025-06-28 23:31:15 +00:00
Xuehai Pan	90b973a2e2	[BE] parse CMake version from `cmake -E capabilities` instead of `cmake --version` (#157073 ) `cmake -E capabilities` produces a JSON format that is more machine-friendly. ```console $ cmake --version cmake version 4.0.3 CMake suite maintained and supported by Kitware (kitware.com/cmake). $ cmake -E capabilities \| jq '.version.string' "4.0.3" $ cmake -E capabilities \| jq { "debugger": true, "fileApi": { "requests": [ { "kind": "codemodel", "version": [ { "major": 2, "minor": 8 } ] }, { "kind": "configureLog", "version": [ { "major": 1, "minor": 0 } ] }, { "kind": "cache", "version": [ { "major": 2, "minor": 0 } ] }, { "kind": "cmakeFiles", "version": [ { "major": 1, "minor": 1 } ] }, { "kind": "toolchains", "version": [ { "major": 1, "minor": 0 } ] } ] }, "generators": [ { "extraGenerators": [], "name": "Watcom WMake", "platformSupport": false, "toolsetSupport": false }, { "extraGenerators": [ "Kate" ], "name": "Ninja Multi-Config", "platformSupport": false, "toolsetSupport": false }, { "extraGenerators": [ "CodeBlocks", "CodeLite", "Eclipse CDT4", "Kate", "Sublime Text 2" ], "name": "Ninja", "platformSupport": false, "toolsetSupport": false }, { "extraGenerators": [], "name": "Xcode", "platformSupport": false, "toolsetSupport": true }, { "extraGenerators": [ "CodeBlocks", "CodeLite", "Eclipse CDT4", "Kate", "Sublime Text 2" ], "name": "Unix Makefiles", "platformSupport": false, "toolsetSupport": false } ], "serverMode": false, "tls": true, "version": { "isDirty": false, "major": 4, "minor": 0, "patch": 3, "string": "4.0.3", "suffix": "" } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157073 Approved by: https://github.com/Skylion007	2025-06-28 23:20:10 +00:00
AaronWang04	772d590415	[CUTLASS] [CUDA] SM100 GroupMM (#156203 ) Closes https://github.com/pytorch/pytorch/issues/156202 PR adds blackwell support for GroupMM Most of the code that is used for SM90 can be reused, kernel schedule has to be changed in accordance with https://docs.nvidia.com/cutlass/media/docs/cpp/blackwell_functionality.html Did some preliminary benchmarking of H200 vs B200 Script ```py import torch print(torch.__file__) device = torch.device("cuda") dtype = torch.bfloat16 shapes = [ (16, 128000, 7168, 7168), (128, 1, 2048, 7168) ] for batch, M, N, K in shapes: a = torch.randn(batch, M, K, device=device, dtype=dtype) b = torch.randn(batch, N, K, device=device, dtype=dtype) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) for i in range(5): c = torch._grouped_mm(a, b) num_iter = 50 start_event.record() for i in range(num_iter): c = torch._grouped_mm(a, b) end_event.record() torch.cuda.synchronize() elapsed_time_ms = start_event.elapsed_time(end_event) avg_time_ms = elapsed_time_ms / num_iter print(f"batch: {batch}\tM: {M}\tN: {N}\tK: {K}") print(f"Time per Iteration:\t {avg_time_ms:.4f} ms") ``` On H200 ``` batch: 16 M: 128000 N: 7168 K: 7168 Time per Iteration: 298.6668 ms batch: 128 M: 1 N: 2048 K: 7168 Time per Iteration: 4.1462 ms ``` B200 ``` batch: 16 M: 128000 N: 7168 K: 7168 Time per Iteration: 190.7458 ms batch: 128 M: 1 N: 2048 K: 7168 Time per Iteration: 3.0680 ms ``` nsys nvprof ``` root@16930b42ffc6:/workspace/pytorch# nsys nvprof python gemm_test.py WARNING: python and any of its children processes will be profiled. Collecting data... batch: 16 M: 128000 N: 7168 K: 7168 Time per Iteration: 192.6420 ms batch: 128 M: 1 N: 2048 K: 7168 Time per Iteration: 1.2255 ms Generating '/tmp/nsys-report-6a53.qdstrm' [1/7] [========================100%] report1.nsys-rep [2/7] [========================100%] report1.sqlite [3/7] Executing 'nvtx_sum' stats report SKIPPED: /workspace/pytorch/report1.sqlite does not contain NV Tools Extension (NVTX) data. [4/7] Executing 'cuda_api_sum' stats report Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ------------ ------------ -------- ----------- ------------ --------------------------------- 98.9 10586895744 2 5293447872.0 5293447872.0 73786464 10513109280 7381715954.2 cudaDeviceSynchronize 1.0 104084608 5 20816921.6 33552480.0 100800 34786208 18048125.3 cudaMalloc 0.1 5694304 4 1423576.0 1416656.0 1258560 1602432 181668.1 cudaGetDeviceProperties_v2_v12000 0.1 5430496 130 41773.0 4560.0 2496 3854368 345761.8 cudaLaunchKernel 0.0 587584 110 5341.7 4992.0 4224 16992 1482.0 cudaLaunchKernelExC_v11060 0.0 119200 660 180.6 128.0 96 4128 206.7 cudaGetDriverEntryPoint_v11030 0.0 68352 660 103.6 64.0 32 4928 224.6 cuTensorMapEncodeTiled 0.0 34976 49 713.8 224.0 160 6720 1343.4 cudaStreamIsCapturing_v10000 0.0 32992 4 8248.0 7456.0 4128 13952 4804.4 cudaEventRecord 0.0 16928 4 4232.0 3600.0 1728 8000 2764.7 cudaEventQuery 0.0 16288 4 4072.0 3568.0 1952 7200 2396.1 cudaEventCreateWithFlags 0.0 13632 4 3408.0 2672.0 544 7744 3408.7 cudaEventDestroy 0.0 1056 1 1056.0 1056.0 1056 1056 0.0 cuModuleGetLoadingMode [5/7] Executing 'cuda_gpu_kern_sum' stats report Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ----------- ----------- --------- --------- ----------- ---------------------------------------------------------------------------------------------------- 99.0 10549232845 55 191804233.5 192944479.0 165746368 203645313 5353204.3 void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::Gemm… 0.6 67327135 55 1224129.7 1330656.0 924320 1364928 182180.4 void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::Gemm… 0.3 34854783 20 1742739.1 1597856.0 10080 3899616 818421.2 void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat… 0.0 354880 110 3226.2 3296.0 1920 4160 554.4 void at::cuda::detail::prepare_grouped_gemm_data<cutlass::bfloat16_t, cutlass::bfloat16_t, cutlass:… ``` The kernel names are too long to be shown via nvprof, I pasted this from nsight systems ``` small kernel 1SM 100.0% 1.286 ms 1 1.286 ms 1.286 ms 1.286 ms 1.286 ms 0 ns void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::GemmUniversal<cutlass::gemm::GroupProblemShape<cute::tuple<int, int, int>>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm100ArrayTmaUmmaWarpSpecialized<(int)3, (int)8, (int)2, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> , cutlass::bfloat16_t, cute::tuple<cute::C<(int)1>, long, cute::C<(int)0>> , cute::TiledMMA<cute::MMA_Atom<cute::SM100_MMA_F16BF16_SS<cutlass::bfloat16_t, cutlass::bfloat16_t, float, (int)128, (int)256, (cute::UMMA::Major)0, (cute::UMMA::Major)1, (cute::UMMA::ScaleIn)0, (cute::UMMA::ScaleIn)0>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)64>, cute::C<(int)8>>, cute::tuple<cute::C<(int)1>, cute::C<(int)64>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> , cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> , cutlass::epilogue::fusion::FusionCallbacks<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, >, cute::SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b64x, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>>, void, void>>>(T1::Params) large kernel 2SM 100.0% 194.178 ms 1 194.178 ms 194.178 ms 194.178 ms 194.178 ms 0 ns void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::GemmUniversal<cutlass::gemm::GroupProblemShape<cute::tuple<int, int, int>>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm100ArrayTmaUmmaWarpSpecialized<(int)5, (int)8, (int)2, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>>, cute::tuple<cute::C<(int)256>, cute::C<(int)256>, cute::C<(int)64>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> , cutlass::bfloat16_t, cute::tuple<cute::C<(int)1>, long, cute::C<(int)0>> , cute::TiledMMA<cute::MMA_Atom<cute::SM100_MMA_F16BF16_2x1SM_SS<cutlass::bfloat16_t, cutlass::bfloat16_t, float, (int)256, (int)256, (cute::UMMA::Major)0, (cute::UMMA::Major)1, (cute::UMMA::ScaleIn)0, (cute::UMMA::ScaleIn)0>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM100_TMA_2SM_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, void, cute::identity, cute::SM100_TMA_2SM_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)64>, cute::C<(int)8>>, cute::tuple<cute::C<(int)1>, cute::C<(int)64>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> , cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> , cutlass::epilogue::fusion::FusionCallbacks<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, >, cute::SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b64x, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>>, void, void>>>(T1::Params) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156203 Approved by: https://github.com/syed-ahmed, https://github.com/drisspg	2025-06-28 23:02:00 +00:00
Jeff Daily	996206e66f	cublaslt/hipblaslt persistent workspace (#156495 ) Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo. - fixes hipblaslt issue where memory use increased during graph capture - preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE - moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs - size_t getCUDABlasLtWorkspaceSize() - void* getCUDABlasLtWorkspace() Fixes https://github.com/ROCm/pytorch/issues/2286. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156495 Approved by: https://github.com/eqy	2025-06-28 22:38:43 +00:00
Edenzzzz	0629dfb860	Fix FSDP offload pin_memory bug (#157147 ) Fixes #157146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157147 Approved by: https://github.com/weifengpy	2025-06-28 21:09:11 +00:00
blorange-amd	67f8270516	[ROCm] test_hip_device_count safely runs on 1 GPU systems (#156398 ) Fixes test_cuda.py::TestCuda::test_hip_device_count on single gpu scenario Pull Request resolved: https://github.com/pytorch/pytorch/pull/156398 Approved by: https://github.com/jeffdaily	2025-06-28 20:17:26 +00:00
Yidi Wu	aeffb68d34	[schema_upgrader] add C++ upgrader for json based upgrading (#156761 ) Differential Revision: [D77459912](https://our.internmc.facebook.com/intern/diff/D77459912) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156761 Approved by: https://github.com/angelayi	2025-06-28 18:15:06 +00:00
Yidi Wu	064a7db7fc	[invoke_subgraph] turn on supports_input_mutation by default (#157177 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157177 Approved by: https://github.com/anijain2305	2025-06-28 18:14:47 +00:00
PyTorch MergeBot	2eb744c08d	Revert "[BE] parse CMake version from `cmake -E capabilities` instead of `cmake --version` (#157073 )" This reverts commit 0c58bdd8fb5f269aef100af8e2c43cfcf5f1f9dd. Reverted https://github.com/pytorch/pytorch/pull/157073 on behalf of https://github.com/XuehaiPan due to break libtorch build on Windows ([comment](https://github.com/pytorch/pytorch/pull/157073#issuecomment-3015273679))	2025-06-28 13:40:19 +00:00
Xuehai Pan	0c58bdd8fb	[BE] parse CMake version from `cmake -E capabilities` instead of `cmake --version` (#157073 ) `cmake -E capabilities` produces a JSON format that is more machine-friendly. ```console $ cmake --version cmake version 4.0.3 CMake suite maintained and supported by Kitware (kitware.com/cmake). $ cmake -E capabilities \| jq '.version.string' "4.0.3" $ cmake -E capabilities \| jq { "debugger": true, "fileApi": { "requests": [ { "kind": "codemodel", "version": [ { "major": 2, "minor": 8 } ] }, { "kind": "configureLog", "version": [ { "major": 1, "minor": 0 } ] }, { "kind": "cache", "version": [ { "major": 2, "minor": 0 } ] }, { "kind": "cmakeFiles", "version": [ { "major": 1, "minor": 1 } ] }, { "kind": "toolchains", "version": [ { "major": 1, "minor": 0 } ] } ] }, "generators": [ { "extraGenerators": [], "name": "Watcom WMake", "platformSupport": false, "toolsetSupport": false }, { "extraGenerators": [ "Kate" ], "name": "Ninja Multi-Config", "platformSupport": false, "toolsetSupport": false }, { "extraGenerators": [ "CodeBlocks", "CodeLite", "Eclipse CDT4", "Kate", "Sublime Text 2" ], "name": "Ninja", "platformSupport": false, "toolsetSupport": false }, { "extraGenerators": [], "name": "Xcode", "platformSupport": false, "toolsetSupport": true }, { "extraGenerators": [ "CodeBlocks", "CodeLite", "Eclipse CDT4", "Kate", "Sublime Text 2" ], "name": "Unix Makefiles", "platformSupport": false, "toolsetSupport": false } ], "serverMode": false, "tls": true, "version": { "isDirty": false, "major": 4, "minor": 0, "patch": 3, "string": "4.0.3", "suffix": "" } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157073 Approved by: https://github.com/Skylion007	2025-06-28 13:35:30 +00:00
Erik Schultheis	cdb144fcf0	Display a warning when overwriting `CMAKE_CUDA_ARCHITECTURES` (#156123 ) Really, pytorch shoudn't be messing with basic _global_ cmake configuration like this, but without a careful analysis what all depends on this behaviour, I'm not confident to propose a change. But at least notifying the user that something wonky is going on seems like a good idea. @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/156123 Approved by: https://github.com/drisspg, https://github.com/msaroufim Co-authored-by: Mark Saroufim <marksaroufim@meta.com>	2025-06-28 11:22:09 +00:00
fduwjj	8147c4a904	[symm_mem] Create a dedicated ci flow for symmetric memory and only use 4 GPUs (#157181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157181 Approved by: https://github.com/kwen2501, https://github.com/huydhn	2025-06-28 08:33:50 +00:00
Sheng Qin	88c6199db0	[nativert] Move KernelFactory to PyTorch core (#156913 ) Summary: Kernel factory handles the kernel nodes initializations and different type of kernels executions. Test Plan: CI Rollback Plan: Differential Revision: D77346836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156913 Approved by: https://github.com/zhxchen17	2025-06-28 06:34:24 +00:00
Aidyn-A	51eb8e8f84	[ATen][CUDA][CUB] Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen (#153373 ) A major release of CCCL 3.0.0 will introduce some bc-breaking changes. Namely iterators like TransformInputIterator and ConstantInputIterator were moved from CUB to Thrust, some operators like Max and Sum were moved to LibCUDACXX. For the more info on changes please visit: https://nvidia.github.io/cccl/cccl/3.0_migration_guide.html This is a follow up to PR #147493. A description from the original PR: > Several cub iterators have been deprecated and removed in the latest CCCL (cub) development https://github.com/NVIDIA/cccl/pull/3831. This PR replaced the usage of those cub iterators with thrust iterators. > > Some cub thread operators were also deprecated and removed in https://github.com/NVIDIA/cccl/pull/3918. This PR replaced those operators with libcudacxx ops. > > This might also affect ROCM usability a bit. > > This patch is tested to work with CCCL commit at `82befb0894` > > Tracking of CCCL/CUB deprecations in the most recent development https://github.com/NVIDIA/cccl/issues/101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153373 Approved by: https://github.com/cyyever, https://github.com/atalman	2025-06-28 05:44:52 +00:00
Lukas Geiger	a92b24cd83	Prevent cudaStreamSync when indexing GPU tensors with boolean CPU mask (#156384 ) `index_put` with a boolean mask (`target[mask] = src`) causes a `cudaStreamSynchronize`. When both `mask` and `target` tensors are on GPU this is expected. However, the sync can be prevented if the `mask` is a CPU tensor. Internally a new index tensor is created with `mask.nonzero()` so we can use a non-blocking copy to transfer it to the GPU since it cannot be accidentally mutated by the user between its creation and the device copy. @ngimel Let me know if I'm missing something. I think this is useful since users can't prevent a sync simply by making sure all tensors are on the same device as with other ops. Instead one would need to do something like this which is much less readable ```python indices = mask.nonzero().squeeze(1).to("cuda", non_blocking=True) target[indices] = src ``` Fixes #12461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156384 Approved by: https://github.com/ngimel	2025-06-28 05:41:16 +00:00
Justin Chu	5692cbb818	[ONNX] Delete symbolic caffe2 (#157102 ) Caffe2 is removed from pytorch. This is a clean up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157102 Approved by: https://github.com/titaiwangms, https://github.com/cyyever	2025-06-28 05:22:02 +00:00
cyy	30d2648a4a	Install nvperf_host together with cupti (#156668 ) Because cupti depends on nvperf_host, as discussed in https://github.com/pytorch/pytorch/pull/154595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156668 Approved by: https://github.com/Skylion007	2025-06-28 04:26:36 +00:00
zpcore	adf6dd1e44	Fix `aten::index_put` args Dtensor type mismatch and add a propagation strategy (#156240 ) We notice model code contains indexing syntax like [nanogpt model code](`f144fe9095/torchbenchmark/models/nanogpt/model.py (L240)`), which causes training fail in the backward pass when using DTensor. In the code, `x = x[:, [-1], :]` calls the index op and in the backward pass, it will trigger `aten.index_put.default` with the second argument to be of type `torch::List<std::optional<Tensor>>`, e.g., `[None, tensor([-1], device='cuda:0')]`. We are unable to unwarp the op info into Dtensor based on the current logic [here](`2625c70aec/torch/distributed/tensor/_dispatch.py (L339-L358)`). We need to set runtime_schema_info for the op and enable needs_pytree to support the conversion of tensor list arg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156240 Approved by: https://github.com/wanchaol	2025-06-28 04:09:41 +00:00
PyTorch MergeBot	f810480dbe	Revert "[schema_upgrader] add C++ upgrader for json based upgrading (#156761 )" This reverts commit 61712e6f2ba58cce354a742d918934ec7293ee43. Reverted https://github.com/pytorch/pytorch/pull/156761 on behalf of https://github.com/ydwu4 due to break linter test, which doesn't show up in the pr ([comment](https://github.com/pytorch/pytorch/pull/156761#issuecomment-3014918800))	2025-06-28 03:58:25 +00:00
Eli Uriegas	0e47312ae5	ci: Add ability to test images for build-triton-wheel (#156894 ) This wasn't available prior making it difficult to test if manywheel image changes would affect triton wheel builds. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156894 Approved by: https://github.com/atalman, https://github.com/clee2000, https://github.com/malfet ghstack dependencies: #156893	2025-06-28 03:41:18 +00:00
Teja	ef6dfa06a9	Create a base Checkpointer and SyncCheckpointer and add dist barrier impl and (#156926 ) In preparation to adding async checkpointing, this diff adds 1. Change Checkpointer to an Abstract base class and adds a sync checkpointer implementation. 2. torch.distributed.barrier() as one of the barrier choices. Differential Revision: [D77341314](https://our.internmc.facebook.com/intern/diff/D77341314/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156926 Approved by: https://github.com/pradeepfn	2025-06-28 02:48:29 +00:00
David Berard	e8217ad8be	[inductor][static launcher] Skip correctness test for test_floats (#157023 ) https://github.com/triton-lang/triton/issues/6176 causes kernels that take fp64 scalar inputs to generate wrong results. Until we get around to fixing this, just skip the accuracy check (it'll fail on Triton's launcher anyway). Differential Revision: [D77407307](https://our.internmc.facebook.com/intern/diff/D77407307) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157023 Approved by: https://github.com/jamesjwu	2025-06-28 02:19:10 +00:00
fduwjj	e3320965b4	[sym_mem] Further Fix NCCL symm mem unit test (#157156 ) We still see CI failures because of error "RuntimeError: CUDA driver error: invalid device ordinal". So upon discussion, we might also need a GPU number skip macro for the test itself: Fixes #156569 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157156 Approved by: https://github.com/kwen2501, https://github.com/fegin	2025-06-28 02:17:13 +00:00
Nikita Shulga	a1e4f1f98a	[MPS] Reimplement `tri[ul]` as Metal shaders (#157179 ) And add in-place flavor, as it is currently broken for non-contig tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/157179 Approved by: https://github.com/dcci	2025-06-28 01:33:18 +00:00
Orvid King	c14110056f	[caffe2] Allow the elimination of implicit calls to strlen when using the RECORD_FUNCTION macros (#153567 ) Summary: With the way these were written, any string literals that were being passed in, like `__func__`, were only ever passed down as a `const char*`, so this switches it over to take a `std::string_view` at the deepest part. This also has the side effect of allowing `std::string_view` to be passed to the `RECORD_FUNCTION` macros as well. Test Plan: contbuilds Rollback Plan: Differential Revision: D74681042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153567 Approved by: https://github.com/Skylion007, https://github.com/swolchok	2025-06-28 01:11:00 +00:00
PyTorch MergeBot	1e4c5b666a	Revert "[dynamo] fix _torchdynamo_orig_callable naming issues (#156901 )" This reverts commit eb9efb37c8f315f1d30e86d5797490c6a8666889. Reverted https://github.com/pytorch/pytorch/pull/156901 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some internal tests D77411594 ([comment](https://github.com/pytorch/pytorch/pull/156901#issuecomment-3014734151))	2025-06-28 00:37:01 +00:00
Yidi Wu	61712e6f2b	[schema_upgrader] add C++ upgrader for json based upgrading (#156761 ) Differential Revision: [D77459912](https://our.internmc.facebook.com/intern/diff/D77459912) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156761 Approved by: https://github.com/angelayi	2025-06-27 23:50:19 +00:00
morrison-turnansky	2815ade9a8	updated adafactor doc #154862 (#155248 ) updated adafactor doc to reflect difference in implementation vs original paper Fixes #154862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155248 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-06-27 23:23:19 +00:00
nautsimon	feea575082	[MTIA ATen Backend] Add dispatch keys for add.out (#156952 ) Migrate add.out Differential Revision: [D77352482](https://our.internmc.facebook.com/intern/diff/D77352482/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156952 Approved by: https://github.com/malfet, https://github.com/huydhn ghstack dependencies: #156944, #156945, #156946, #156947, #156948, #156949, #156950, #156951	2025-06-27 22:49:00 +00:00
nautsimon	253cbadade	[MTIA ATen Backend] Add dispatch keys for rsub.Tensor / rsub.Scalar / sub.out (#156951 ) Migrate rsub.Tensor / rsub.Scalar / sub.out Differential Revision: [D77015033](https://our.internmc.facebook.com/intern/diff/D77015033/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156951 Approved by: https://github.com/malfet ghstack dependencies: #156944, #156945, #156946, #156947, #156948, #156949, #156950	2025-06-27 22:49:00 +00:00
nautsimon	b6b2871555	[MTIA ATen Backend] Add dispatch keys for fmod / abs.out / logical_not.out (#156950 ) Migrate fmod / abs.out / logical_not.out Differential Revision: [D77220217](https://our.internmc.facebook.com/intern/diff/D77220217/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156950 Approved by: https://github.com/malfet ghstack dependencies: #156944, #156945, #156946, #156947, #156948, #156949	2025-06-27 22:48:48 +00:00
nautsimon	a95bee9ed6	[MTIA ATen Backend] Add dispatch key for div.out (#156949 ) Migrate div.out Differential Revision: [D77063371](https://our.internmc.facebook.com/intern/diff/D77063371/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156949 Approved by: https://github.com/malfet ghstack dependencies: #156944, #156945, #156946, #156947, #156948	2025-06-27 22:48:39 +00:00
nautsimon	f30e072cb4	[MTIA ATen Backend] Add dispatch keys for mul.Scalar_out / mul.out (#156948 ) Migrate mul.Scalar_out / mul.out Differential Revision: [D77011801](https://our.internmc.facebook.com/intern/diff/D77011801/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156948 Approved by: https://github.com/malfet ghstack dependencies: #156944, #156945, #156946, #156947	2025-06-27 22:48:32 +00:00
nautsimon	66ad843583	[MTIA ATen Backend] Add dispatch keys for gt.Tensor_out / gt.Scalar_out (#156947 ) Migrate gt.Tensor_out / gt.Scalar_out Differential Revision: [D77009468](https://our.internmc.facebook.com/intern/diff/D77009468/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156947 Approved by: https://github.com/malfet ghstack dependencies: #156944, #156945, #156946	2025-06-27 22:48:25 +00:00
nautsimon	f0a5a3b453	[MTIA ATen Backend] Add dispatch keys for ne.Tensor_out / ne.Scalar_out (#156946 ) Migrate ne.Tensor_out / ne.Scalar_out Differential Revision: [D77008139](https://our.internmc.facebook.com/intern/diff/D77008139/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156946 Approved by: https://github.com/malfet ghstack dependencies: #156944, #156945	2025-06-27 22:48:18 +00:00
Dylan Maloy	cd1a924dba	[nativert] get rid of sigmoid naming (#157134 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D77451215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157134 Approved by: https://github.com/zhxchen17, https://github.com/jingsh	2025-06-27 22:41:52 +00:00
Jane Xu	d283fc79b1	chunk_size should always be int64_t for Foreach functors (#156872 ) See https://github.com/pytorch/pytorch/issues/156261#issuecomment-3002394773 Testing is a valid q--it is pretty expensive to test such large tensors for all these ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156872 Approved by: https://github.com/Skylion007, https://github.com/eqy ghstack dependencies: #156876, #156871	2025-06-27 22:35:34 +00:00
Jane Xu	5a0926a26e	Stop skipping entire foreach tests, just skip the profiler portion (#156871 ) Instead of skipping the whole test as the CUPTI team figures out what is wrong, let's temporarily skip the profiler check portion. It is high pri to add it back to ensure foreach ops are actually performant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156871 Approved by: https://github.com/albanD ghstack dependencies: #156876	2025-06-27 22:35:34 +00:00
Sandeep Narendranath Karjala	20e40492b0	[dynamo] Add fx_graph_runnable test coverage (#157021 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157021 Approved by: https://github.com/StrongerXi, https://github.com/xmfan	2025-06-27 21:35:56 +00:00
Morrison Turnansky	130d4973bd	Documentation update torch.clone #156644 (#157007 ) updated torch clone docs to reflect implemented memory behavior Fixes #156644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157007 Approved by: https://github.com/malfet, https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-27 21:10:09 +00:00
nautsimon	3ee75b7eac	[MTIA ATen Backend] Add dispatch keys for le.Tensor_out / le.Scalar_out (#156945 ) Migrate le.Tensor_out / le.Scalar_out Differential Revision: [D77002317](https://our.internmc.facebook.com/intern/diff/D77002317/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156945 Approved by: https://github.com/malfet ghstack dependencies: #156944	2025-06-27 21:03:19 +00:00
nautsimon	6b7767fc8d	[MTIA ATen Backend] Add dispatch keys for ge.Tensor_out / ge.Scalar_out (#156944 ) Migrate ge.Tensor_out / ge.Scalar_out Differential Revision: [D77002145](https://our.internmc.facebook.com/intern/diff/D77002145/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156944 Approved by: https://github.com/malfet	2025-06-27 21:02:27 +00:00
PyTorch MergeBot	0decd966af	Revert "Fixes for CPython int/float tests (#155978 )" This reverts commit 216bd6091ec52865052282eced7e6d5d2a4b4fb4. Reverted https://github.com/pytorch/pytorch/pull/155978 on behalf of https://github.com/huydhn due to Some tests are still failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/155978#issuecomment-3014185210))	2025-06-27 19:39:41 +00:00
Philip Maybank	7c51619e7f	Fix Float16 CooperativeReduction Test Failure (#154516 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154516 Approved by: https://github.com/jansel, https://github.com/jeffdaily	2025-06-27 19:31:49 +00:00
Jane Xu	4048a144ab	Address richard's comments on libtorch_stable_abi note (#156324 ) Followups from #155984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156324 Approved by: https://github.com/zou3519	2025-06-27 19:19:12 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	dcb97cd519	Remove unneccesary code to check autograd state (#156855 ) Summary: Title Test Plan: CI Rollback Plan: Differential Revision: D77317627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156855 Approved by: https://github.com/zhxchen17 Co-authored-by: Camyll Harajli <camyllh@meta.com>	2025-06-27 19:18:06 +00:00
Lucas Beyer	8a88c6e85a	[nit] fix xavier init doc (#157100 ) Remove part of the documentation that is irrelevant and confusing at best, probably a copy-paste mistake: <img src="https://github.com/user-attachments/assets/77fa5734-5a5a-4f8d-80a5-bc3269668e07" width="500"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157100 Approved by: https://github.com/mikaylagawarecki	2025-06-27 19:13:40 +00:00
PyTorch MergeBot	75a7d9e868	Revert "python definitely_contiguous-> is_contiguous_or_false (#156515 )" This reverts commit 4c0091fda65b714fa73671a15e379f814af153e0. Reverted https://github.com/pytorch/pytorch/pull/156515 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause some torch.export failures internally ([comment](https://github.com/pytorch/pytorch/pull/156515#issuecomment-3014104570))	2025-06-27 19:07:06 +00:00
Svetlana Karslioglu	2860f5c4f5	Remove mentioning of TorchScript in Export doc (#156969 ) Remove mentioning of TorchScript Pull Request resolved: https://github.com/pytorch/pytorch/pull/156969 Approved by: https://github.com/angelayi Co-authored-by: Angela Yi <yiangela7@gmail.com>	2025-06-27 17:59:15 +00:00
vfdev	456b7451c7	Minor error message fix in device_mesh.py (#157096 ) Fixed error message: On main: ``` KeyError: ("Invalid mesh_dim_names ('dp_shard', 'dp_shard') specified. ", 'Found mesh dim indices to slice: [(1,), (1,)]. ', 'Mesh dim indices should be in ascending order.') ``` On PR: ``` KeyError: Invalid mesh_dim_names ('dp_shard', 'dp_shard') specified. Found mesh dim indices to slice: [(1,), (1,)]. Mesh dim indices should be in ascending order.' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157096 Approved by: https://github.com/Skylion007	2025-06-27 17:42:29 +00:00
Justin Chu	36fd1ac932	[ONNX] Bump onnxscript api for torch 2.8 (#157017 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157017 Approved by: https://github.com/titaiwangms, https://github.com/malfet	2025-06-27 17:39:17 +00:00
henrylhtsang	84c588e5ea	[cutlass backend][BE][ez] Make matmul layouts be row x column (#156656 ) Differential Revision: [D77184232](https://our.internmc.facebook.com/intern/diff/D77184232/) Motivation: * This is the case we care the most. * We are caching the kernels for this row x column layout. So testing on them can potentially make ci run faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156656 Approved by: https://github.com/ColinPeppler	2025-06-27 17:15:45 +00:00
Wanchao Liang	b22b93a6ba	[2/n] rewrite load balancing and sharding in context parallel (#155442 ) This PR rewrite how load balancing and sharding works in the current context parallel implementation. Why the changes? We should NOT expose another layer of "sharding" concept as it would confuse the user about its difference with DTensor sharding. The current CP perform sharding weirdly simply because it mixed the concept of load balancing and sharding. I think load balancing and sharding need to be decoupled to separate layers: * The load balancing layer is responsible to reorder the input sequence so that the attention computation are evenly balanced across rows/ranks. * Sharding is a separate layer after it, it simply take the input reordered by the load balancer and shard it exactly as how DTensor shard tensor sequentially In this PR: * I removed the "Sharder" and "LoadBalancer" mixed usage, and simply generate a roundrobin indices when the mask is a casual mask * use `distribute_tensor` to perform the sharding. We still keep the local shard instead of the DTensor objects to allow maximum compatibility with arbitrary model architecture given DTensor op coverage is not high enough. One alternative design is to still keep the LoadBalancer and add the indices generation and restore to be the protocol of the LoadBalancer. I thought through it and think we might want to directly expose the load_balancing indices as an argument instead of a dedicated class interface, so I removed it here. More discussion on this is welcomed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155442 Approved by: https://github.com/XilunWu ghstack dependencies: #155441	2025-06-27 17:06:42 +00:00
Wanchao Liang	f7c730107e	[1/n] refactor the ring attention implementation (#155441 ) as titled, I'm working on a series of changes to make ring attention impl and DTensor works better together, this PR specifically refactor the current implemtnation to: * remove dead/unused code * restructure the functions to make them stay organized * refactor to remove/make error message better Pull Request resolved: https://github.com/pytorch/pytorch/pull/155441 Approved by: https://github.com/fegin	2025-06-27 17:06:42 +00:00
Tuan Trieu	eeaefa1336	Fix UnbackedSymint rebinding - check unbacked before renaming (#156911 ) Differential Revision: D77249427 Due to memoization and graph order update, it can happen that a backed symbol is passed into compute_unbacked_bindings and lead to failure. An example as follow: - There are 2 boolean indexing operators (e.g. op1 and op2) with the same mask. - A unbacked symint is generated from op1, and then op2 reuses the unbacked symint due to a nonzero_memo in nonzero's fake implementation and no rebinding is needed for op2. - Since op1 generated the unbacked symint, its meta has "unbacked_bindings" field filled and op2's meta doesn't have it. - Output from op1 and op2 are later concated with others with backed symint, so that the unbacked symint can be replaced by a backed symint. - In Inductor, during fake tensor prop, there is no memoi because new fake tensor is always generated (for the same node). op1 generates an unbacked symint and the unbacked can be rebound successfully to the backed symint. Since there is no memoi, op2 also generates a new unbacked symint, but no rebinding can happen because op2's meta doesn't have "unbacked_bindings". And "compute_unbacked_bindings/_rename_unbacked_to" fails to assert op2's old symbol to be unbacked. From discussion with [@ezyang](https://www.internalfb.com/intern/profile/?id=503862770), there is no easy way to fix this issue. - We can try to enable memoization for fake tensor prop in Inductor, however, we need to ensure that op1 is visited before op2 during Inductor fake tensor prop for this to work (op2's meta doesn't have "unbacked_bindings" so no rebinding can happen and we need to do rebinding from op1. But there are passes such as reorder_for_locality that can change the graph order so this doesn't work. - A simple hack is to just replace the unbacked symbol in op2 by the backed symbol. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156911 Approved by: https://github.com/ezyang	2025-06-27 16:57:04 +00:00
Guilherme Leobas	216bd6091e	Fixes for CPython int/float tests (#155978 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155978 Approved by: https://github.com/zou3519	2025-06-27 16:41:00 +00:00
fduwjj	d0cfa3e5bf	[c10d] Move the include of header file of TraceUtils.h into NCCLUtil.cpp instead of keeping in hpp (#156909 ) We have seen complaint about compilation failure of `NCCLSymmetricMemory.cu` and the reason is because we include <torch/csrc/distributed/c10d/TraceUtils.h> inside NCCLUtil.hpp this is not necessary so we want to move the include to cpp. Differential Revision: [D77346675](https://our.internmc.facebook.com/intern/diff/D77346675) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156909 Approved by: https://github.com/kwen2501	2025-06-27 16:30:49 +00:00
Nikita Shulga	21b5dc7a6a	[CD] Add python-3.14.0b3 to docker image (#156889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156889 Approved by: https://github.com/albanD, https://github.com/atalman ghstack dependencies: #157033	2025-06-27 16:24:39 +00:00
Andrey Talman	d158e9ea82	Update nightly PyTorch version to 2.8.0->2.9.0 (#156965 ) Same as https://github.com/pytorch/pytorch/pull/149038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156965 Approved by: https://github.com/Camyll, https://github.com/malfet	2025-06-27 16:22:08 +00:00
Jason Ansel	60abb0d327	[dynamo] Better error for invalid @contextlib.contextmanager usage (#156924 ) Fixes #156716 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156924 Approved by: https://github.com/williamwen42	2025-06-27 15:50:36 +00:00
Zizeng Meng	ff8b53c056	[Kineto] Add MTIA_INSIGHT to kineto_shim (#156853 ) Summary: Add MTIA_INSIGHT to kMtiaTypes in kineto_shim.cpp For insight, user can use MTIA_INSIGHT_VERBOSE_TRACES=0 to disable the profiler. So, we can enable it by default Test Plan: {F1979756361} When the environment var isn't set, it uses 0. Rollback Plan: Differential Revision: D77315882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156853 Approved by: https://github.com/sraikund16	2025-06-27 15:30:14 +00:00
Aleksandar Samardžić	5118a8f8a5	Rename mm_scaled_grouped.py to mm_grouped.py (#156849 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156849 Approved by: https://github.com/amjames, https://github.com/Skylion007	2025-06-27 15:02:22 +00:00
rzou	aa2d54148d	Add AOTDispatcher config to set backward autocast behavior (#156356 ) This PR adds a new config `backward_pass_autocast`, to set the backward autocast behavior. It does not change the existing behavior. The reason why we need this is that torch.compile acquires a forward and backward graph at the time of the forward pass. This means that implemented naively, if there are any context managers active outside the call to torch.compile, the backward graph will also get the behaviors from those context managers. This PR gives users a way to tweak the autocast behavior of the backward pass. Please see torch._functorch.config for the options to the `backward_pass_autocast` config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156356 Approved by: https://github.com/bdhirsh ghstack dependencies: #155354	2025-06-27 14:58:58 +00:00
Howard Huang	adf9644440	Add pg transport and tests (#154653 ) Add PG transport and tests under `torch/distributed/checkpoint/` ### API: ```python def send_checkpoint(self, dst_ranks: list[int], state_dict: object) -> None: def recv_checkpoint(self, src_rank: int) -> object: ``` ### Tests: ``` python test/distributed/checkpoint/test_pg_transport.py ``` ### Example: Under `_pg_transport_example.py` (in https://github.com/pytorch/pytorch/pull/155810) ``` torchrun --nproc_per_node=2 -m torch.distributed.checkpoint._pg_transport_example -- --device cuda ``` Differential Revision: [D76044919](https://our.internmc.facebook.com/intern/diff/D76044919) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154653 Approved by: https://github.com/meetv18	2025-06-27 14:53:34 +00:00
Vasiliy Kuznetsov	414ad47045	revamp dtype documentation for 2025 (#156087 ) The dtype documentation has not been updated in awhile, let's do a revamp. 1. combine the duplicated docs for dtypes from `tensors.rst` and `tensor_attributes.rst` to live in `tensor_attributes.rst`, and link to that page from `tensors.rst` 2. split the dtype table into floating point and integer dtypes 3. add the definition of shell dtype 4. add the float8 and MX dtypes as shell dtypes to the dtype table 5. remove legacy quantized dtypes from the table 6. add the definition of various dtype suffixes ("fn", etc) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156087 Approved by: https://github.com/albanD	2025-06-27 13:10:23 +00:00
rzou	43523bf168	Fix silent incorrectness arising from incorrect alias information (#152011 ) Fixes #136662 There are two problems: 1) canonicalize_view_scatter_ops adds some new nodes into the graph. These new nodes cause the alias info on the graph to be wrong. To fix this, we try to run FakeTensorUpdater on the graph again. 2) FakeTensorUpdater's alias information is wrong. It tries to skip nodes that it thinks have "equivalent" FakeTensor metadata. It should not be allowed to do this if any users of the node can alias the node. The example is if we have `x = foo(...); y = x.view(...)`. If the user replaces `foo` with a new `bar` node and sets bar.meta["val"] correctly, then FakeTensorUpdater still needs to update y's meta["val"] to be a view of the new bar node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152011 Approved by: https://github.com/yf225	2025-06-27 12:45:03 +00:00
Jason Ansel	75f3e5a88d	[dynamo] Fix issue with tensors passed as view() shapes (#156928 ) Fixes #156720 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156928 Approved by: https://github.com/ezyang	2025-06-27 08:52:31 +00:00
Jason Ansel	588b5fb94b	Optimize TorchHigherOrderOperatorVariable.make() with lookup table (#157022 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157022 Approved by: https://github.com/zou3519	2025-06-27 07:36:12 +00:00
tvukovic-amd	968f90ce73	[ROCm][Windows] Fixing undefined symbol linker error after exposing MIOpen symbols (#156479 ) Fixing undefined symbol linker error after [exposing MIOpen symbols](https://github.com/pytorch/pytorch/pull/154545). This fix: - Hipifies `aten/src/ATen/miopen` and `aten/src/ATen/native/miopen` files - Adds `aten/src/ATen/miopen` and `aten/src/ATen/native/miopen` hipified source files to `all_hip_cpp` list Pull Request resolved: https://github.com/pytorch/pytorch/pull/156479 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-27 07:23:32 +00:00
PyTorch MergeBot	4a80ddfbe7	Revert "Fix reinplace pass handling of view input + mutable custom op (#156729 )" This reverts commit b754b1fa43d20f5b31e17c396487ab56991912da. Reverted https://github.com/pytorch/pytorch/pull/156729 on behalf of https://github.com/davidberard98 due to breaks lint: [GH job link](https://github.com/pytorch/pytorch/actions/runs/15918483073/job/44900430950) [HUD commit link](`b754b1fa43`) ([comment](https://github.com/pytorch/pytorch/pull/156729#issuecomment-3011867746))	2025-06-27 06:38:58 +00:00
cyy	064288cbab	Use std::string_view in torchgen (#157050 ) Let the generated code use std::sv Pull Request resolved: https://github.com/pytorch/pytorch/pull/157050 Approved by: https://github.com/ezyang	2025-06-27 06:36:10 +00:00
Laith Sakka	cc3ea2d840	remove gso from Linear.cpp (#156899 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156899 Approved by: https://github.com/ColinPeppler	2025-06-27 06:30:50 +00:00
zeshengzong	cf0749c92f	Use expecttest in test_compiled_optimizers.py (#155308 ) Fixes #141262 ## Test Result ```bash pytest test/inductor/test_compiled_optimizers.py -vv ``` ![image](https://github.com/user-attachments/assets/1886fb71-ff05-46e7-988c-82d36358a834) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155308 Approved by: https://github.com/mlazos, https://github.com/msaroufim Co-authored-by: Mark Saroufim <marksaroufim@gmail.com>	2025-06-27 06:29:51 +00:00
Laith Sakka	cbcffce48a	address remaining straight forward gso in meta_registrations (#156902 ) Those are all straight forward generalization of existing checks, Pull Request resolved: https://github.com/pytorch/pytorch/pull/156902 Approved by: https://github.com/ColinPeppler	2025-06-27 06:19:54 +00:00
Menglu Yu	640703d95f	add torch.concat to normalization pass (#156574 ) Summary: In the normalization pass, we also add torch.concat to it to normalize it as torch.cat Test Plan: ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_passes -- test_cat_normalization ``` Buck UI: https://www.internalfb.com/buck2/597fd4f1-0aa7-4372-8a66-5a690d9b63a4 Test UI: https://www.internalfb.com/intern/testinfra/testrun/1688850152284203 Network: Up: 84KiB Down: 34KiB (reSessionID-3916e009-7117-41ce-b6f9-089873aa50dd) Executing actions. Remaining 0/3 1.1s exec time total Command: test. Finished 2 local Time elapsed: 3:47.1s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 Rollback Plan: Differential Revision: D77125331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156574 Approved by: https://github.com/Mingming-Ding	2025-06-27 06:07:26 +00:00
Daisy Deng	1155c53e7d	Port three dynamo test to Intel GPU (#156575 ) For https://github.com/pytorch/pytorch/issues/114850, we will port test cases to Intel GPU. Two dynamo test files were ported in PR [#156056](https://github.com/pytorch/pytorch/pull/156056). In this PR we will port 3 more dynamo test files. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - added XPU support in decorators like @requires_gpu - enabled XPU for some test path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156575 Approved by: https://github.com/guangyey, https://github.com/jansel Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-06-27 05:56:22 +00:00
Jason Ansel	51853b358e	[dynamo] Improve error message for cond aliasing (#156963 ) See #156724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156963 Approved by: https://github.com/zou3519, https://github.com/williamwen42	2025-06-27 05:31:46 +00:00
David Berard	6b05842e47	[test][inductor] fix test_conv_cat failure (#155852 ) This test is currently failing because triton_poi_fused_cat_2 has changed to triton_poi_fused_cat_3. I have not investigated why the extra kernel is generated, but this test has been failing on trunk for a while (and I verified locally that it is failing). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155852 Approved by: https://github.com/FindHao, https://github.com/Skylion007	2025-06-27 05:11:11 +00:00
Laith Sakka	2c76f31221	Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590 ) When we compute contiguity for a tensor with dynamic shapes we first: 1) Try to compute it without guarding. 2) If all shapes hinted, compute it with potentially adding guards. 3) if any input is not hinted, compute it symbolically. sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called on it to avoid data dependent errors. ex: bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__); is_contiguous_or_false is a helper function that does that. In this PR I only handle default contiguity, will follow up with changes for other formats like channel_last . We use this patter in this PR for several locations to avoid DDEs. Differential Revision: [D77183032](https://our.internmc.facebook.com/intern/diff/D77183032) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155590 Approved by: https://github.com/ezyang	2025-06-27 04:59:52 +00:00
Will Feng	b754b1fa43	Fix reinplace pass handling of view input + mutable custom op (#156729 ) Fixes #153389. Using approach https://github.com/pytorch/pytorch/issues/153389#issuecomment-3006049928 suggested by Richard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156729 Approved by: https://github.com/zou3519	2025-06-27 04:54:17 +00:00
Divyansh Khanna	e6d8ed02cb	PyTorch Data Sampler benchmark (#156974 ) ## Motivation Many PRs optimizing samplers (for eg https://github.com/pytorch/pytorch/pull/147706, https://github.com/pytorch/pytorch/pull/137423) are leveraging an adhoc script for benchmarking samplers. The script and outputs are often copied over in PRs. We want to begin centralizing benchmarks for torch.utils.data components. ## What ? * This PR adds a new sub-folder in `benchmarks` for `data`. This is aimed to cover benchmarking scripts for torch.utils.data components like dataloader and sampler. * Specifically, this PR includes a simple script to time samplers. This is often "copy-pasted" in PRs optimizing samplers. Having it in a centralized location should prevent that, and allow a common standard. ## Output ``` Benchmark Results: +--------------+-------------+----------------+-----------+-----------+ \| Batch Size \| Drop Last \| Original (s) \| New (s) \| Speedup \| +==============+=============+================+===========+===========+ \| 4 \| True \| 0.004 \| 0.0088 \| -119.62% \| +--------------+-------------+----------------+-----------+-----------+ \| 4 \| False \| 0.0083 \| 0.009 \| -9.23% \| +--------------+-------------+----------------+-----------+-----------+ \| 8 \| True \| 0.003 \| 0.0074 \| -147.64% \| +--------------+-------------+----------------+-----------+-----------+ \| 8 \| False \| 0.0054 \| 0.0075 \| -38.72% \| +--------------+-------------+----------------+-----------+-----------+ \| 64 \| True \| 0.0021 \| 0.0056 \| -161.92% \| +--------------+-------------+----------------+-----------+-----------+ \| 64 \| False \| 0.0029 \| 0.0055 \| -92.50% \| +--------------+-------------+----------------+-----------+-----------+ \| 640 \| True \| 0.002 \| 0.0055 \| -168.75% \| +--------------+-------------+----------------+-----------+-----------+ \| 640 \| False \| 0.0024 \| 0.0062 \| -161.35% \| +--------------+-------------+----------------+-----------+-----------+ \| 6400 \| True \| 0.0021 \| 0.0055 \| -160.13% \| +--------------+-------------+----------------+-----------+-----------+ \| 6400 \| False \| 0.0021 \| 0.0068 \| -215.46% \| +--------------+-------------+----------------+-----------+-----------+ \| 64000 \| True \| 0.0042 \| 0.0065 \| -55.29% \| +--------------+-------------+----------------+-----------+-----------+ \| 64000 \| False \| 0.0029 \| 0.0077 \| -169.56% \| +--------------+-------------+----------------+-----------+-----------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156974 Approved by: https://github.com/ramanishsingh	2025-06-27 04:49:43 +00:00
codingwithsurya	195ef1bce8	[SymmMem] Refactor NVSHMEM tests: separate Triton tests into dedicated file (#156685 ) ## Summary Moved the Triton-specific NVSHMEM tests in `test_nvshmem.py` into a dedicated `test_nvshmem_triton.py` file. Also put the shared Triton JIT kernels at the top-level of new file for reusability. ## Testing ```bash TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem_triton.py ``` All 16 original tests pass with no functionality changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156685 Approved by: https://github.com/mandroid6, https://github.com/kwen2501 ghstack dependencies: #156684	2025-06-27 04:38:37 +00:00
David Berard	b6c00dfe24	[user triton] AOT inductor support for device-side TMA (#155896 ) Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma` Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel. To support this in AOTI, this PR: * records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen * allocates global scratch, if needed (cuda/device_op_overrides.py) * plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device) * updates tests to verify this works for dynamically shaped inputs This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works) Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space). For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda` https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155896 Approved by: https://github.com/desertfire	2025-06-27 04:28:04 +00:00
Nikita Shulga	710b92cf3b	[BE][BugFix] Install Python-3.13 correctly (#157033 ) Fixes temporary workaround introduced by https://github.com/pytorch/builder/pull/1827 I.e. it's been downloading latest 3.13 branch rather than 3.13.0 release Simplify nogil version handling Pull Request resolved: https://github.com/pytorch/pytorch/pull/157033 Approved by: https://github.com/wdvr, https://github.com/huydhn	2025-06-27 04:19:59 +00:00
leslie-fang-intel	1eea2c4fe3	[Inductor][CPP] Fix perf regression of functorch_maml_omniglot (#156526 ) Summary Fix the performance regression of `functorch_maml_omniglot` in TorchBench. The issue reported in [#151523](https://github.com/pytorch/pytorch/issues/151523) occurs only when a parallel reduction is performed under the vectorized loop and a scalar kernel is used for the tail loop. Previously, we addressed this regression in [#151887](https://github.com/pytorch/pytorch/pull/151887) by disabling all cases where a parallel reduction occurs under the vectorized loop. However, for `functorch_maml_omniglot`, we found that a masked vector kernel is used in the tail loop instead of the scalar kernel in the job of `inductor_torchbench_cpu_smoketest_perf`. In this PR, we refine the fix by excluding the cases where a masked vector kernel is used in the tail loop, rather than disabling all such scenarios. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156526 Approved by: https://github.com/CaoE	2025-06-27 03:09:24 +00:00
dolpm	7392470da4	[nativert] alias analyzer + layout planner/manager to pytorch core (#156897 ) Summary: att Test Plan: ci - unit tests still have some unresolved deps but will move them later. Rollback Plan: Differential Revision: D77320950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156897 Approved by: https://github.com/zhxchen17	2025-06-27 03:01:22 +00:00
Raman Kumar	382c6190c1	complex.pow(2) on GPU by replacing with complex * complex to avoid numerical instability (#152373 ) Fixes #150951 Summary: For complex.pow(2) on GPU: Uses complex * complex directly. Produces results consistent with CPU implementation. Eliminates spurious imaginary components for real inputs. 🧪 Tests Added unit tests to verify correctness of the new kernel path. Verified numerical consistency with CPU results. This change is backward-compatible and only affects the specific case of pow(2) on complex tensors on GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152373 Approved by: https://github.com/ezyang	2025-06-27 02:21:59 +00:00
PyTorch MergeBot	e290a4c645	Revert "Rename torch::standalone to headeronly (#156964 )" This reverts commit 7e54c02a35b905e758497b856a1953eb009ba836. Reverted https://github.com/pytorch/pytorch/pull/156964 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156964#issuecomment-3011136947))	2025-06-27 02:20:33 +00:00
Ke Wen	4ab4d29cbe	[BE] Remove SymmMem allocator destruct log (#157020 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157020 Approved by: https://github.com/fduwjj	2025-06-27 02:10:54 +00:00
PyTorch MergeBot	56c69bedcc	Revert "[dynamo] Better error for invalid @contextlib.contextmanager usage (#156924 )" This reverts commit 863327ae496471654344e1e04ccaa713a44a135d. Reverted https://github.com/pytorch/pytorch/pull/156924 on behalf of https://github.com/jansel due to Likely same issue as #156963 ([comment](https://github.com/pytorch/pytorch/pull/156924#issuecomment-3011087802))	2025-06-27 01:57:05 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	8e8bbfc803	Remove ts to export retracer (#156857 ) Summary: This is probably not used anymore Test Plan: CI Rollback Plan: Reviewed By: SherlockNoMad Differential Revision: D77318582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156857 Approved by: https://github.com/SherlockNoMad	2025-06-27 01:54:24 +00:00
Ryan Guo	a4b59498c5	Fix fake kernel for the `out=...` variant of `unbind_copy` (#156643 ) `unbind_copy(..., out=...)` returns None rather than the `out` argument (see https://github.com/pytorch/pytorch/issues/130829#issuecomment-2283936222), but the old fake kernel didn't account for that and caused an assertion failure in `pushPyOutToStack`. This patch fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156643 Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/bdhirsh ghstack dependencies: #156642	2025-06-27 01:34:07 +00:00
Ryan Guo	89aa708b39	[core] Dispatch to `at::nansum_out` rather than `at::native::nansum_out` (#156642 ) Calling `at::native::nansum_out` causes the fake kernel to dispatch to a `make_reduction` call and then segfaults later due to the `mutable_data_ptr` call in `TensorIteratorBase::build`. It also causes fake tensor propagation issue in Dynamo. The added tests demonstrate the aforementioned 2 issues. This patch fixes it by dispatching to `at::nansum_out` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156642 Approved by: https://github.com/zou3519	2025-06-27 01:34:07 +00:00
Jason Ansel	863327ae49	[dynamo] Better error for invalid @contextlib.contextmanager usage (#156924 ) Fixes #156716 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156924 Approved by: https://github.com/williamwen42	2025-06-27 01:02:01 +00:00
Jane Xu	7e54c02a35	Rename torch::standalone to headeronly (#156964 ) Summary: headeronly is more clear, let's change the name before anyone depends on standalone Test Plan: CI should pass! Rollback Plan: Differential Revision: D77381084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156964 Approved by: https://github.com/swolchok, https://github.com/albanD, https://github.com/desertfire	2025-06-27 01:00:14 +00:00
Nicolas Macchioni	3bdd5ae334	[PT2] deprecate `force_same_precision`, guarded by JK (#156789 ) Summary: cuBLAS used to have strict alignment requirements for TF32 usage, even if TF32 was enabled by users; this caused a numeric SEV in the past, when Triton would use TF32 even if cuBLAS could not due to failing the alignment checks we believe that cuBLAS no longer has alignment requirements for TF32 usage, based on some testing in D77265581; we'd like to deprecate `force_same_precision` since it no longer functions as expected changing the default to False in fbcode, guarded by a jk so that we can quickly revert to the original behavior if needed Test Plan: CI Rollback Plan: Differential Revision: D77265930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156789 Approved by: https://github.com/jhadidjojo, https://github.com/masnesral	2025-06-27 00:43:06 +00:00
PyTorch MergeBot	6215e90b7b	Revert "[dynamo] Improve error message for cond aliasing (#156963 )" This reverts commit 9c39bc24807a5843f8affdf56bd71836760dc554. Reverted https://github.com/pytorch/pytorch/pull/156963 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the failures are legit ([comment](https://github.com/pytorch/pytorch/pull/156963#issuecomment-3010870664))	2025-06-27 00:31:00 +00:00
PyTorch MergeBot	e3977e843d	Revert "Fix silent incorrectness arising from incorrect alias information (#152011 )" This reverts commit 2d39a48d524021995269411bd49fe792e59d9f94. Reverted https://github.com/pytorch/pytorch/pull/152011 on behalf of https://github.com/Camyll due to cannot land internally. owner will update and reland to fix ([comment](https://github.com/pytorch/pytorch/pull/152011#issuecomment-3010723960))	2025-06-26 23:54:13 +00:00
William Wen	eb9efb37c8	[dynamo] fix _torchdynamo_orig_callable naming issues (#156901 ) `_torchdynamo_orig_callable` was being used in two distinct places: - to get the original user function from nested eval_frame.py decorators - to get the original backend from nested convert_frame.py callbacks We rename the first usage to `_torchdynamo_orig_fn` and the second to `_torchdynamo_orig_backend` in order to distinguish these cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156901 Approved by: https://github.com/StrongerXi, https://github.com/jansel ghstack dependencies: #156527	2025-06-26 23:51:08 +00:00
William Wen	6089ebcf6d	[dynamo] fix segfault due to dangling CacheEntry backend pointer (#156527 ) Fixes https://github.com/pytorch/pytorch/issues/155057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156527 Approved by: https://github.com/anijain2305, https://github.com/jansel	2025-06-26 23:51:08 +00:00
Kurt Mohler	e0447bb5f8	Add `max_pool3d` for MPS (#156467 ) Fixes #100674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156467 Approved by: https://github.com/malfet	2025-06-26 23:33:50 +00:00
Manuel Candales	1fff6356d9	[MPS] Optimize cummin/cummax metal kernels (#156794 ) Performance improvement (M4 Max 64GB, macOS 15.5): ``` \| Current \| Previous cummin-dim0-32x32 (torch.float16) \| 103.4 \| 102.5 cummin-dim0-128x128 (torch.float16) \| 112.2 \| 133.6 cummin-dim0-512x512 (torch.float16) \| 146.9 \| 233.1 cummin-dim0-1024x1024 (torch.float16) \| 193.6 \| 364.2 cummin-dim1-32x32 (torch.float16) \| 102.0 \| 94.4 cummin-dim1-128x128 (torch.float16) \| 103.0 \| 109.9 cummin-dim1-512x512 (torch.float16) \| 109.1 \| 227.0 cummin-dim1-1024x1024 (torch.float16) \| 140.5 \| 985.1 cummin-1d-100 (torch.float16) \| 101.8 \| 100.7 cummin-1d-10000 (torch.float16) \| 112.8 \| 805.0 cummin-1d-1000000 (torch.float16) \| 1343.8 \| 70545.6 cummin-dim0-32x32 (torch.float32) \| 104.6 \| 102.7 cummin-dim0-128x128 (torch.float32) \| 112.3 \| 137.2 cummin-dim0-512x512 (torch.float32) \| 146.6 \| 209.7 cummin-dim0-1024x1024 (torch.float32) \| 194.0 \| 340.1 cummin-dim1-32x32 (torch.float32) \| 100.1 \| 99.2 cummin-dim1-128x128 (torch.float32) \| 101.4 \| 111.9 cummin-dim1-512x512 (torch.float32) \| 110.3 \| 250.7 cummin-dim1-1024x1024 (torch.float32) \| 141.4 \| 987.9 cummin-1d-100 (torch.float32) \| 101.0 \| 100.6 cummin-1d-10000 (torch.float32) \| 112.9 \| 794.7 cummin-1d-1000000 (torch.float32) \| 1311.7 \| 71995.3 cummin-dim0-32x32 (torch.bfloat16) \| 105.8 \| 105.9 cummin-dim0-128x128 (torch.bfloat16) \| 111.9 \| 135.7 cummin-dim0-512x512 (torch.bfloat16) \| 147.1 \| 231.9 cummin-dim0-1024x1024 (torch.bfloat16) \| 191.2 \| 327.7 cummin-dim1-32x32 (torch.bfloat16) \| 101.8 \| 91.3 cummin-dim1-128x128 (torch.bfloat16) \| 100.2 \| 108.5 cummin-dim1-512x512 (torch.bfloat16) \| 108.9 \| 222.0 cummin-dim1-1024x1024 (torch.bfloat16) \| 140.1 \| 936.9 cummin-1d-100 (torch.bfloat16) \| 103.0 \| 106.6 cummin-1d-10000 (torch.bfloat16) \| 113.1 \| 795.8 cummin-1d-1000000 (torch.bfloat16) \| 1296.8 \| 68667.4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156794 Approved by: https://github.com/malfet ghstack dependencies: #156860	2025-06-26 23:30:20 +00:00
Jason Ansel	9c39bc2480	[dynamo] Improve error message for cond aliasing (#156963 ) See #156724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156963 Approved by: https://github.com/zou3519, https://github.com/williamwen42	2025-06-26 23:12:00 +00:00
Laith Sakka	e6ed4074e8	update expected results (#157010 ) <img width="1490" alt="Screenshot 2025-06-26 at 12 30 46 PM" src="https://github.com/user-attachments/assets/4df626d4-3010-4362-974c-fb96fa68b29f" /> <img width="904" alt="Screenshot 2025-06-26 at 12 28 29 PM" src="https://github.com/user-attachments/assets/42626892-27e1-4e69-9efc-c9baf80c5384" /> <img width="752" alt="Screenshot 2025-06-26 at 12 29 05 PM" src="https://github.com/user-attachments/assets/0b1afb30-5868-4ba6-9985-2cc7994a4227" /> PR https://github.com/pytorch/pytorch/pull/152011 added slight regression <br class="Apple-interchange-newline"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157010 Approved by: https://github.com/zou3519	2025-06-26 21:56:57 +00:00
William Wen	80d89974c1	[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564 ) This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error. Implementation details: - The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`. - InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end. - When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289, #154782, #156762, #155166	2025-06-26 21:40:38 +00:00
William Wen	6df6eacce8	[dynamo] handle fullgraph toggle using nested torch.compile (#155166 ) See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not. Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289, #154782, #156762	2025-06-26 21:40:38 +00:00
William Wen	dcb8982969	[dynamo] move error_on_graph_break out of config (#156762 ) error_on_graph_break doesn't need to be in config, so we move it out. It should make the functorch_maml_omniglot regression less severe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156762 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289, #154782	2025-06-26 21:40:38 +00:00
William Wen	36666033ab	[dynamo] fix set_fullgraph for nested calls (#154782 ) - Make the fullgraph argument of set_fullgraph a positional argument - Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis). Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289	2025-06-26 21:40:38 +00:00
William Wen	7b7eafe7ba	[dynamo] add set_fullgraph decorator/context manager (#154289 ) Implements https://github.com/pytorch/pytorch/issues/144908. Implementation notes: - `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing. - Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example). - InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` . - `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #154283	2025-06-26 21:40:38 +00:00
William Wen	1c3f5e902d	[dynamo] control one_graph behavior additionally through config (#154283 ) `torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-06-26 21:40:38 +00:00
Ke Wen	fc10d4b1d6	[SymmMem] Allow selection of allocation backend (#156661 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Today the only way to choose allocation backend is via env `TORCH_SYMMMEM=...`. This is a bit hard to set in CI on test file basis. (The env has to be set before program is loaded). This PR added a programmatic way -- a `set_backend` API. Implementation: Since this API is slightly more dynamic than static registration, at static time each backend registers its availability rather than filling itself as the allocator directly. Later when `set_backend` is called, the allocator would actually fill in the device-to-allocation `map_`. Though added, `set_backend` is not a necessary API for user to call -- one backend is still registered as the default at static time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156661 Approved by: https://github.com/ngimel, https://github.com/fduwjj	2025-06-26 21:37:44 +00:00
dolpm	262654ee51	[nativert] move constantfolder to libtorch (#156918 ) Summary: att -- unit tests will be migrated later, since they still have unresolved deps. Test Plan: ci Rollback Plan: Differential Revision: D77159278 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156918 Approved by: https://github.com/henryoier, https://github.com/zhxchen17	2025-06-26 21:26:37 +00:00
Geevarghese George	7f6e7103a3	Convert to markdown: jit_python_reference.rst, jit_unsupported.rst, jit_utils.rst, library.rst (#155404 ) Fixes #155024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155404 Approved by: https://github.com/svekars	2025-06-26 21:09:46 +00:00
angelayi	aff9c1eec5	[aoti][mps] Add fused_rms and sdpa_mps fallback ops (#156844 ) Needed for llama3.1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156844 Approved by: https://github.com/desertfire ghstack dependencies: #156843	2025-06-26 21:03:05 +00:00
angelayi	17dab018e3	[aoti][mps] Fix deduplication of kernels (#156843 ) Previously I was not correctly deduplicating kernels generated by mps, so it would generate multiple of the same kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156843 Approved by: https://github.com/desertfire	2025-06-26 21:03:05 +00:00
Joshua Su	977abe786d	fix 'register_foward_pre_hook not supported on ScriptModule' error (#156904 ) Summary: Encountered 'register_foward_pre_hook not supported on ScriptModule' error when trying to publish CFR MTML with placing remote_ro module in remote. Issue may come from the fact that the local net from torchArrow is already scriptModule before gen_app_graph pass. {F1979770267} Test Plan: hg checkout 1ff14dfaade4ac1f3cbbf38fbd72f7fdd5cdcd16 bash hstu_blocker.sh Rollback Plan: Reviewed By: RenfeiChen-FB Differential Revision: D77341370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156904 Approved by: https://github.com/jingsh	2025-06-26 20:59:24 +00:00
Dylan Maloy	81759afed4	[nativert] clean up some migration side-effects (#156919 ) Summary: explicit torch::nativert namespace usage + // manual declarations Test Plan: ci Rollback Plan: Differential Revision: D77328855 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156919 Approved by: https://github.com/zhxchen17	2025-06-26 20:28:32 +00:00
codingwithsurya	b6e625e34f	[SymmMem] Remove redundant dist.barrier in Triton NVSHMEM tests & add device‐side signal_op support (#156684 ) ## Summary This PR removes unnecessary `dist.barrier` calls up in our Triton NVSHMEM test suite and adds signal_op support, which is a lightweight device-side signaling mechanism. Added test for this in our `wait_until` kernel and corresponding `core.extern` wrapper. Why did we drop the `dist.barrier()` calls? We dropped the host‐side dist.barrier() in all Triton NVSHMEM tests (except the raw put/get cases) because every other test already uses NVSHMEM collectives or device‐side sync primitives (fence/quiet/signal/wait), making the extra barrier redundant. This keeps synchronization entirely on the GPU and leverages NVSHMEM’s native ordering guarantees for clearer, more efficient tests. `test_triton_wait_until` update - Rank 1: after `put_kernel` writes the data, launches `signal_op_kernel` to atomically set Rank 0's flag via `nvshmemx_signal_op` - Rank 0: drops its old `dist.barrier()` and simply calls `wait_until_kernel` to spin-wait on the device flag, then asserts data correctness - Changes made per [this comment](https://github.com/pytorch/pytorch/pull/156472#discussion_r2159734046) ## Testing ```bash TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156684 Approved by: https://github.com/kwen2501, https://github.com/mandroid6	2025-06-26 20:16:06 +00:00
Manuel Candales	43a09189c6	[MPS] Add benchmark for scan with indices (#156860 ) Baseline performance on M4 Max 64GB (macOS 15.5): ``` [-------------------------------- --------------------------------] \| eager \| compile 1 threads: --------------------------------------------------------- cummin-dim0-32x32 (torch.float16) \| 102.5 \| 115.0 cummin-dim0-128x128 (torch.float16) \| 133.6 \| 147.8 cummin-dim0-512x512 (torch.float16) \| 233.1 \| 243.1 cummin-dim0-1024x1024 (torch.float16) \| 364.2 \| 385.2 cummin-dim1-32x32 (torch.float16) \| 94.4 \| 109.8 cummin-dim1-128x128 (torch.float16) \| 109.9 \| 122.5 cummin-dim1-512x512 (torch.float16) \| 227.0 \| 233.8 cummin-dim1-1024x1024 (torch.float16) \| 985.1 \| 1010.5 cummin-1d-100 (torch.float16) \| 100.7 \| 114.3 cummin-1d-10000 (torch.float16) \| 805.0 \| 879.1 cummin-1d-1000000 (torch.float16) \| 70545.6 \| 71310.3 cummin-dim0-32x32 (torch.float32) \| 102.7 \| 115.5 cummin-dim0-128x128 (torch.float32) \| 137.2 \| 143.8 cummin-dim0-512x512 (torch.float32) \| 209.7 \| 222.0 cummin-dim0-1024x1024 (torch.float32) \| 340.1 \| 389.9 cummin-dim1-32x32 (torch.float32) \| 99.2 \| 107.8 cummin-dim1-128x128 (torch.float32) \| 111.9 \| 119.3 cummin-dim1-512x512 (torch.float32) \| 250.7 \| 255.1 cummin-dim1-1024x1024 (torch.float32) \| 987.9 \| 1013.2 cummin-1d-100 (torch.float32) \| 100.6 \| 114.6 cummin-1d-10000 (torch.float32) \| 794.7 \| 862.2 cummin-1d-1000000 (torch.float32) \| 71995.3 \| 71963.5 cummin-dim0-32x32 (torch.bfloat16) \| 105.9 \| 113.9 cummin-dim0-128x128 (torch.bfloat16) \| 135.7 \| 147.9 cummin-dim0-512x512 (torch.bfloat16) \| 231.9 \| 240.7 cummin-dim0-1024x1024 (torch.bfloat16) \| 327.7 \| 366.9 cummin-dim1-32x32 (torch.bfloat16) \| 91.3 \| 103.3 cummin-dim1-128x128 (torch.bfloat16) \| 108.5 \| 117.4 cummin-dim1-512x512 (torch.bfloat16) \| 222.0 \| 233.6 cummin-dim1-1024x1024 (torch.bfloat16) \| 936.9 \| 982.5 cummin-1d-100 (torch.bfloat16) \| 106.6 \| 112.4 cummin-1d-10000 (torch.bfloat16) \| 795.8 \| 819.6 cummin-1d-1000000 (torch.bfloat16) \| 68667.4 \| 68557.9 Times are in microseconds (us). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156860 Approved by: https://github.com/malfet	2025-06-26 18:44:16 +00:00
PyTorch MergeBot	9fe2d156a9	Revert "[dynamo] fix segfault due to dangling CacheEntry backend pointer (#156527 )" This reverts commit 5ad2bee2c8a7defd2580bb138145a49c37146fcc. Reverted https://github.com/pytorch/pytorch/pull/156527 on behalf of https://github.com/Camyll due to failing test assertions ([comment](https://github.com/pytorch/pytorch/pull/156527#issuecomment-3009231797))	2025-06-26 17:32:34 +00:00
Nicolas Macchioni	13efb2c858	[BE] Deprecate `search_autotune_cache` (#155302 ) We haven't had the offline cache populated in > 1 year, this should be safe; if this passes, we can finally go through and rip out the offline cache logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/155302 Approved by: https://github.com/masnesral	2025-06-26 17:30:08 +00:00
Nikita Shulga	039a1ce0eb	[BE] Remove CXX11_ABI references from cpp_builder.py (#156896 ) As all Linux builds are CXX11_ABI compatible at this point Pull Request resolved: https://github.com/pytorch/pytorch/pull/156896 Approved by: https://github.com/desertfire, https://github.com/jansel	2025-06-26 17:28:01 +00:00
Laith Sakka	e15ea965a1	remove guard_size_oblivious from unbind. (#148815 ) unbind will always specialize on dim, because it determine the number of output tensors. guard_size_oblivious is not useful there and more confusing probably for code readers added a comment and a test that verifies the specialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148815 Approved by: https://github.com/pianpwk	2025-06-26 17:16:32 +00:00
Shangdi Yu	61eaaa21a4	Better error message when no .so/cpp files are found (#156863 ) Summary: Sample error message: ``` RuntimeError: Failed to find a generated cpp file or so file for model 'forward' in the zip archive. Available models in the archive: model To load a specific model, please provide its name using the `model_name` parameter when calling AOTIModelPackageLoader() or torch._inductor.package.load_package. The following files were loaded from the archive: c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/cqdxv6zki2oiiytjeqrg774uxlxgqdemhdxn5dycn4nnc3rmcd7w.cubin c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper.cpp c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/ctmp7adn3spwyscdotllyj4yx3vrqcnxk3thkpgdcax7zvqmyyp3.kernel.cpp c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper_metadata.json c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/ctmp7adn3spwyscdotllyj4yx3vrqcnxk3thkpgdcax7zvqmyyp3.kernel_metadata.json c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper.so c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/archive_format c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/archive_version c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/.data/version c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/byteorder c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/.data/serialization_id ``` Test Plan: ``` buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_loading_wrong_model" ``` Rollback Plan: Differential Revision: D77320485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156863 Approved by: https://github.com/tugsbayasgalan	2025-06-26 17:13:29 +00:00
PyTorch MergeBot	21990fbad9	Revert "[cond] support gen_schema for cond (#154193 )" This reverts commit 6de41ce0f899604c3f8b33e1f8d37eb89b3a963e. Reverted https://github.com/pytorch/pytorch/pull/154193 on behalf of https://github.com/Camyll due to issue landing internally, discussed with Yidi offline ([comment](https://github.com/pytorch/pytorch/pull/154193#issuecomment-3009160081))	2025-06-26 17:10:00 +00:00
Isuru Fernando	c808af514d	Support deterministic upsample trilinear backward (#154239 ) Fixes https://github.com/pytorch/pytorch/issues/154183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154239 Approved by: https://github.com/eellison, https://github.com/albanD	2025-06-26 15:02:27 +00:00
IvanKobzarev	2f94f69b7c	[aotd] Support mutations of the same input in fw and bw (#155354 ) Original issue: https://github.com/pytorch/pytorch/issues/154820 The issue happens when there is a mutation for the same input in forward AND in backward. AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward). After that partitioner can put it either in forward or in backward. The fix: 1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for forward mutation. 2/ Exposing mutation_counter to python We want to keep invariant that copy_ exist only in the end of joint graph. 3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward. Emit post_forward mutations after joint graph fully traced. add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward. 4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward. For this set MUST_SAVE for the source of mutation in forward. proxy_tensor changes: By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained. But we want that this copy_ will be independent and applied just to primals. For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354 Approved by: https://github.com/bdhirsh	2025-06-26 14:05:54 +00:00
Anatoly Myachev	197c1869f5	[Inductor][CLN] Remove unused default configs in `flex_attention.py` (#156700 ) They probably became unusable after `03023f178c` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156700 Approved by: https://github.com/jataylo, https://github.com/drisspg	2025-06-26 13:24:09 +00:00
rzou	2d39a48d52	Fix silent incorrectness arising from incorrect alias information (#152011 ) Fixes #136662 There are two problems: 1) canonicalize_view_scatter_ops adds some new nodes into the graph. These new nodes cause the alias info on the graph to be wrong. To fix this, we try to run FakeTensorUpdater on the graph again. 2) FakeTensorUpdater's alias information is wrong. It tries to skip nodes that it thinks have "equivalent" FakeTensor metadata. It should not be allowed to do this if any users of the node can alias the node. The example is if we have `x = foo(...); y = x.view(...)`. If the user replaces `foo` with a new `bar` node and sets bar.meta["val"] correctly, then FakeTensorUpdater still needs to update y's meta["val"] to be a view of the new bar node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152011 Approved by: https://github.com/yf225	2025-06-26 13:05:08 +00:00
haozhe.zhu	53e0b9c393	refine fp32 precision api (#125888 ) Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it. ### Design Choice: Directly use algorithms name like "TF32", "BF16". #### Pros - The names are more informative. 'tf32' is more informative than a simple "high". - Easier to extend new algorithm like `tf32x3` #### Cons - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them. ### We provide a layered structure for backends/operators. ('f32' is short for 'fp32_precision') ![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067) ### We provide 3 fp32 compute precision can be set: - "ieee": Not allowed to use any other internal computation data types . - "tf32": Allowed to use tf32 as internal computation data types. - "bf16": Allowed to use bf16 as internal computation data types. - "none": Precision's are not set. Can be override by its father node. ### Overriding Precision Settings Child node can be override by its father node if it is set to default. For current default settings: ``` backend = generic, op = all, precision setting = none backend = cuda, op = all, precision setting = none backend = cuda, op = conv, precision setting = tf32 backend = cuda, op = rnn, precision setting = tf32 backend = cuda, op = matmul, precision setting = none backend = matmul, op = all, precision setting = none backend = matmul, op = conv, precision setting = none backend = matmul, op = rnn, precision setting = none backend = matmul, op = matmul, precision setting = none ``` - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16". - If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16". ### Backward Compatible Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is - If the user only uses previous APIs, it will work as previous expectations. - If the user use new API to change the status to an un-representable status for old API, and try to access the status by old API. We will raise Runtime Error and point the document for user. ### Test Plan ``` python test/test_cuda.py -k test_fp32_precision_with_tf32 python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision python test/test_cuda.py -k test_invalid_status_for_legacy_api python test/test_mkldnn.py -k test_mlkdnn_get_set python test/test_mkldnn.py -k test_generic_precision python test/test_mkldnn.py -k test_invalid python test/test_mkldnn.py -k test_default_use_parent ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888 Approved by: https://github.com/jgong5, https://github.com/albanD Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>	2025-06-26 10:32:20 +00:00
Ting Lu	de45c5f673	[aarch64] Add back NCCL lib to cuda arm wheel (#156888 ) We discovered that when importing latest 12.9 arm nightly wheel, it is missing the NCCL lib. With the use of USE_SYSTEM_NCCL=1, we need to copy the libnccl.so lib into our big wheel environment, so that it can be dynamically linked at runtime. https://github.com/pytorch/pytorch/pull/152835 enabled USE_SYSTEM_NCCL=1, which would use the system NCCL by default, and it would no longer use the one built from libtorch_cuda.so. With this PR, we add back the libnccl.so to be used at runtime. In this way, we also provide the flexibility to use different versions of NCCL from what came with the original pytorch build. related - https://github.com/pytorch/pytorch/issues/144768 ``` Python 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 417, in <module> from torch._C import * # noqa: F403 ^^^^^^^^^^^^^^^^^^^^^^ ImportError: libnccl.so.2: cannot open shared object file: No such file or directory ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156888 Approved by: https://github.com/atalman	2025-06-26 10:24:18 +00:00
Mark Saroufim	18b01afa9e	load inline user overridable gencode (#156850 ) Fixes https://github.com/pytorch/pytorch/issues/156815 As far as testing goes * I tried to use cuobjdump but that was kinda goofy `bccd9393a5` the problem was that the name of the cubin will have a single gencode always * Another idea was to read stderr and check that the right amount of gencodes is there `0beadc01b3` this helped a lot to convince me locally that this test works, the test passed on my dev gpu but was failing in CI and I suspect it's because of a bad interaction with subprocesses * Last approach was to have a simpler unit test to check which flags get added by default, this is not as comprehensive as the previous ideas but it works and is fast so will opt for this since I'm convinced testing is working per my own experiments and customers Pull Request resolved: https://github.com/pytorch/pytorch/pull/156850 Approved by: https://github.com/malfet	2025-06-26 10:15:08 +00:00
Klaus Zimmermann	bbf1a6feac	Add dist_info to non-building setup.py commands (#156709 ) This adds the `dist_info` command to the list of non-building commands of `setup.py`, which avoids the current situation where simple metadata generation with any packaging tool already triggers a build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156709 Approved by: https://github.com/Skylion007	2025-06-26 08:38:39 +00:00
Xuehai Pan	455dfd2589	Fix macOS build with `USE_MPS=OFF` (#156847 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156847 Approved by: https://github.com/angelayi	2025-06-26 07:15:41 +00:00
Jane Xu	50b2069b61	Move out super large one off foreach_copy test (#156876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156876 Approved by: https://github.com/albanD, https://github.com/jeffdaily	2025-06-26 06:02:38 +00:00
Nicolas Macchioni	dfc31b3345	[BE] comments + try to get rid of secondary `make_autotune_fn` (#156358 ) Not sure this will work, but let's try it on the unit tests. The only thing I am worried about is the counters drifting off from their true values, so let the unit tests check that Pull Request resolved: https://github.com/pytorch/pytorch/pull/156358 Approved by: https://github.com/masnesral	2025-06-26 05:54:01 +00:00
Laith Sakka	0d01bafc34	remove gso from set_storage_meta__symint (#156525 ) We already check that inputs are hinted? i dont see value here for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156525 Approved by: https://github.com/pianpwk	2025-06-26 05:42:05 +00:00
Eli Uriegas	127695eb5c	ci: Add ciflow trigger for build-triton-wheel (#156893 ) Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156893 Approved by: https://github.com/malfet	2025-06-26 04:38:38 +00:00
FFFrog	0a16818d5b	[OpenReg] Remove the unit.skip for test_serialization (#156804 ) This bugs was fixed by this [PR](https://github.com/pytorch/pytorch/pull/147095) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156804 Approved by: https://github.com/albanD ghstack dependencies: #156588, #156589	2025-06-26 03:59:50 +00:00
FFFrog	98e594b565	[OpenReg][2/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156589 ) ---- - serialization - dlpack Next Steps: - The rest of `test/test_cpp_extensions_open_device_registration.py` is about the fallback mechanism. In order to keep it consistent with other accelerator usage (C++ registration), the implementation of OpenReg needs to be refactored: * Simulate multiple device memory in a single process (a brief RFC will be submitted this week) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156589 Approved by: https://github.com/albanD ghstack dependencies: #156588	2025-06-26 03:59:50 +00:00
FFFrog	a730c65fe3	[OpenReg][1/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156588 ) ---- - fake tensor - named tensor - custom autograd function Pull Request resolved: https://github.com/pytorch/pytorch/pull/156588 Approved by: https://github.com/albanD	2025-06-26 03:59:50 +00:00
fduwjj	4585c33e74	[symm_mem] Fix nccl test for symm mem (#156752 ) Try not to call set_device to Fixes #156569 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156752 Approved by: https://github.com/kwen2501	2025-06-26 02:59:38 +00:00
Edward Yang	7521cd9111	[BE] Typo fix (#156836 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156836 Approved by: https://github.com/albanD, https://github.com/jingsh, https://github.com/Skylion007 ghstack dependencies: #156830, #156831	2025-06-26 02:48:55 +00:00
Edward Yang	68e023cbbb	[BE] Add missing type for storage dict (#156831 ) For some reason, this one always bleats when I run mypy on OSX, so shut it up. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156831 Approved by: https://github.com/mikaylagawarecki, https://github.com/atalman, https://github.com/malfet ghstack dependencies: #156830	2025-06-26 02:48:55 +00:00
Edward Yang	df9e5a276b	[BE] Add type and docs for _process_export_inputs (#156830 ) Done using claude code and manual review. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156830 Approved by: https://github.com/tugsbayasgalan, https://github.com/malfet	2025-06-26 02:48:55 +00:00
henrylhtsang	81bf278537	[cutlass] rename cutlass python lib to python-cutlass (#156655 ) Differential Revision: [D77173366](https://our.internmc.facebook.com/intern/diff/D77173366/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156655 Approved by: https://github.com/Skylion007	2025-06-26 02:47:14 +00:00
bobrenjc93	8da774d81f	[ez] Add docblock for SchedulerNode.codegen (#156718 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156718 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #156466, #156445, #156625, #156717	2025-06-26 02:43:50 +00:00
Catherine Lee	78ee2ee90e	Fix environment and push env var for docker image builds for binary builds (#156910 ) Changes WITH_PUSH and the environment check to be ok with giving credentials to push to docker io if its on the main branch, a tag starting with v, or the release branch Credentials for pushing to docker io are in the environment, so without the environment, you can't push to docker io. You also don't do the push unless WITH_PUSH is true binary builds on release branch were failing because they pull from docker io, but the docker build wasn't pushing to docker io because it was either on the release branch (didn't have credentials https://github.com/pytorch/pytorch/actions/runs/15888166271/job/44813180986) or it was on the tag (doesn't have WITH_PUSH) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156910 Approved by: https://github.com/atalman	2025-06-26 02:06:57 +00:00
atalman	5db9a2b54a	[BE] Install Helion without dependencies (#156706 ) After: https://github.com/pytorch/pytorch/pull/155513 Please see comment: https://github.com/pytorch/pytorch/pull/155513#issuecomment-2998085740 Here are the logs: https://github.com/pytorch/pytorch/actions/runs/15838529400/job/44646874281?pr=156664#step:6:16372 Looks like current workflow is : Build triton - triton-3.4.0+git5389ed79-cp310-cp310-linux_x86_64.whl Install Helion - Overwrite triton with production 3.3.1 and install production torch Reinstall triton as final docker build step - triton-3.4.0+git5389ed79-cp310-cp310-linux_x86_64.whl This makes it somewhat messy since we install both torch and triton from prod. This is something we want to avoid when building underlining docker images for CI Log: ``` #55 311.4 + pip_install helion #55 311.4 + as_jenkins conda run -n py_3.10 pip install --progress-bar off helion #55 311.4 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH= conda run -n py_3.10 pip install --progress-bar off helion #55 393.6 Collecting helion #55 393.6 Downloading helion-0.0.7-py3-none-any.whl.metadata (14 kB) #55 393.6 Collecting filecheck (from helion) #55 393.6 Downloading filecheck-1.0.2-py3-none-any.whl.metadata (5.8 kB) #55 393.6 Collecting torch>=2.7.0 (from helion) #55 393.6 Downloading torch-2.7.1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (29 kB) #55 393.6 Requirement already satisfied: typing-extensions>=4.0.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from helion) (4.14.0) #55 393.6 Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (3.18.0) #55 393.6 Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (1.13.3) #55 393.6 Requirement already satisfied: networkx in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (2.8.8) #55 393.6 Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (3.1.6) #55 393.6 Requirement already satisfied: fsspec in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (2025.5.1) #55 393.6 Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch>=2.7.0->helion) #55 393.6 Downloading nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB) #55 393.6 Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch>=2.7.0->helion) #55 393.6 Downloading nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) #55 393.6 Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch>=2.7.0->helion) #55 393.6 Downloading nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) #55 393.6 Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch>=2.7.0->helion) #55 393.6 Downloading nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl.metadata (1.6 kB) #55 393.6 Collecting nvidia-cublas-cu12==12.6.4.1 (from torch>=2.7.0->helion) #55 393.6 Downloading nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) #55 393.6 Collecting nvidia-cufft-cu12==11.3.0.4 (from torch>=2.7.0->helion) #55 393.6 Downloading nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) #55 393.6 Collecting nvidia-curand-cu12==10.3.7.77 (from torch>=2.7.0->helion) #55 393.6 Downloading nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) #55 393.6 Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch>=2.7.0->helion) #55 393.6 Downloading nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) #55 393.6 Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch>=2.7.0->helion) #55 393.6 Downloading nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) #55 393.6 Collecting nvidia-cusparselt-cu12==0.6.3 (from torch>=2.7.0->helion) #55 393.6 Downloading nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB) #55 393.6 Collecting nvidia-nccl-cu12==2.26.2 (from torch>=2.7.0->helion) #55 393.6 Downloading nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) #55 393.6 Collecting nvidia-nvtx-cu12==12.6.77 (from torch>=2.7.0->helion) #55 393.6 Downloading nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) #55 393.6 Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch>=2.7.0->helion) #55 393.6 Downloading nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.5 kB) #55 393.6 Collecting nvidia-cufile-cu12==1.11.1.6 (from torch>=2.7.0->helion) #55 393.6 Downloading nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) #55 393.6 Collecting triton==3.3.1 (from torch>=2.7.0->helion) #55 393.6 Downloading triton-3.3.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.5 kB) #55 393.6 Requirement already satisfied: setuptools>=40.8.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from triton==3.3.1->torch>=2.7.0->helion) (80.9.0) #55 393.6 Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch>=2.7.0->helion) (1.3.0) #55 393.6 Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch>=2.7.0->helion) (3.0.2) #55 393.6 Downloading helion-0.0.7-py3-none-any.whl (149 kB) #55 393.6 Downloading torch-2.7.1-cp310-cp310-manylinux_2_28_x86_64.whl (821.2 MB) #55 393.6 Downloading nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB) #55 393.6 Downloading nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (8.9 MB) #55 393.6 Downloading nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB) #55 393.6 Downloading nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (897 kB) #55 393.6 Downloading nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB) #55 393.6 Downloading nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (200.2 MB) #55 393.6 Downloading nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB) #55 393.6 Downloading nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (56.3 MB) #55 393.6 Downloading nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (158.2 MB) #55 393.6 Downloading nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (216.6 MB) #55 393.6 Downloading nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) #55 393.6 Downloading nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) #55 393.6 Downloading nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB) #55 393.6 Downloading nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB) #55 393.6 Downloading triton-3.3.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (155.6 MB) #55 393.6 Downloading filecheck-1.0.2-py3-none-any.whl (23 kB) #55 393.6 Installing collected packages: nvidia-cusparselt-cu12, triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, filecheck, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, nvidia-cusolver-cu12, torch, helion #55 393.6 Attempting uninstall: triton #55 393.6 Found existing installation: triton 3.4.0+git5389ed79 #55 393.6 Uninstalling triton-3.4.0+git5389ed79: #55 393.6 Successfully uninstalled triton-3.4.0+git5389ed79 #55 393.6 Successfully installed filecheck-1.0.2 helion-0.0.7 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 torch-2.7.1 triton-3.3.1 #55 393.6 #55 DONE 428.8s #56 [final 1/30] COPY --from=triton-builder /opt/triton /opt/triton #56 DONE 0.0s #57 [final 2/30] RUN if [ -n "yes" ] \|\| [ -n "" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi #57 0.823 Processing /opt/triton/triton-3.4.0+git5389ed79-cp310-cp310-linux_x86_64.whl #57 2.263 Requirement already satisfied: setuptools>=40.8.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from triton==3.4.0+git5389ed79) (80.9.0) #57 2.589 Installing collected packages: triton #57 6.405 Successfully installed triton-3.4.0+git5389ed79 #57 6.405 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. #57 DONE 86.5s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156706 Approved by: https://github.com/oulgen, https://github.com/malfet	2025-06-26 02:05:47 +00:00
fduwjj	b50075343a	[distributed] Enable H100 test for all distributed related changes (#156721 ) We want to run H100 CI for distributed related changes. We already have a labeling of oncall:distributed when touching distributed related code: `4491326fb0/.github/labeler.yml (L94)`. So we want to leverage that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156721 Approved by: https://github.com/huydhn	2025-06-26 01:51:41 +00:00
James Wu	e581f015ee	Bump STATIC_CUDA_LAUNCHER_VERSION to 2 (#156726 ) Differential Revision: [D77241813](https://our.internmc.facebook.com/intern/diff/D77241813) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156726 Approved by: https://github.com/oulgen	2025-06-26 01:50:51 +00:00
Xia, Weiwen	b5bfbba184	[Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109 ) Fixes #154328 Summary Fail reason: The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected. Fix: Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t. Test plan ``` pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2025-06-26 01:24:36 +00:00
Nikita Shulga	214e2959dc	Cleanup leftover miniconda brew installation (#156898 ) That results in torch.compile being unable to produce working artifacts Should fix https://github.com/pytorch/pytorch/issues/156833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156898 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-06-26 01:02:04 +00:00
Laith Sakka	4c0091fda6	python definitely_contiguous-> is_contiguous_or_false (#156515 ) We probably can avoid having those in python as well and just depend on c++ impl after we land https://github.com/pytorch/pytorch/pull/155590 but that is for a different PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156515 Approved by: https://github.com/bobrenjc93	2025-06-26 00:47:14 +00:00
Laith Sakka	85df746892	refresh expected numbers (#156877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156877 Approved by: https://github.com/huydhn	2025-06-26 00:03:09 +00:00
Mikayla Gawarecki	2c6324a1eb	Delete sections referencing torchscript in serialization docs (#156648 ) Address [T228333890](https://www.internalfb.com/intern/tasks/?t=228333890) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156648 Approved by: https://github.com/svekars	2025-06-25 23:41:24 +00:00
albanD	a25d1443fa	Mark TorchServe as all emeritus (#156865 ) As per title and to follow the broader tutorial cleanup work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156865 Approved by: https://github.com/svekars, https://github.com/malfet, https://github.com/seemethere	2025-06-25 23:34:57 +00:00
bobrenjc93	451b525bf0	[ez] add docblock and comments to simd.split_and_set_ranges (#156717 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156717 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #156445	2025-06-25 23:07:28 +00:00
Shangdi Yu	204db27a0c	Consolidate stack trace in Tracer (#156257 ) Summary: - Consolidate the stack trace recording code in TracerBase and PythonKeyTracer - Change `make_fx`'s arg name to be consistent with TracerBase member name `record_stack_traces` We move the stack trace logic from `create_proxy` to `create_node` so all inherited classes of TracerBase and re-use the same stack trace logic. Test Plan: ``` buck run caffe2/test:test_export -- -r test_stack_trace ``` Rollback Plan: Pull Request resolved: https://github.com/pytorch/pytorch/pull/156257 Approved by: https://github.com/angelayi, https://github.com/zou3519	2025-06-25 23:07:10 +00:00
Isalia20	653c52fe52	[MPS] Fix batch norm incorrect gradient (#156867 ) Fixes #156555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156867 Approved by: https://github.com/malfet	2025-06-25 23:05:49 +00:00
Jane Xu	acaf6ba3c6	Organize BUCK for torch/standalone (#156503 ) Summary: Undo highlevel BUCKification in favor of something more organized by moving it to the dir itself Test Plan: CI Rollback Plan: Reviewed By: swolchok Differential Revision: D76920013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156503 Approved by: https://github.com/swolchok	2025-06-25 22:56:15 +00:00
Dylan Maloy	d98fa4a103	implement SR's storage group planning algorithm (#156715 ) Summary: att Test Plan: tested on a localnet. it's ~15% worse performance than greedy-by-size, but more performant. local: gbs: 110656b dsg: 131584b local_ro: gbs: 38208 dsg: 44544 Differential Revision: D75653840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156715 Approved by: https://github.com/zhxchen17	2025-06-25 22:43:40 +00:00
Laith Sakka	1e7e21ec5d	unify dynamic shapes API namings 3 (guard_int, guard_int_seq) (#155973 ) evaluate_static_shape -> guard_int evaluate_static_shapes -> guard_int_seq Pull Request resolved: https://github.com/pytorch/pytorch/pull/155973 Approved by: https://github.com/bobrenjc93	2025-06-25 22:40:28 +00:00
Yidi Wu	61f6aa36b9	[resubmit][export] add _union_dataclass to support comparing dataclasses that inherits from union. (#156765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156765 Approved by: https://github.com/zhxchen17	2025-06-25 22:32:12 +00:00
William Wen	53057fc16a	[dynamo] update base variable call_method hint with note on comprehensions (#156769 ) Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1696822194289318/ List/dict comprehensions in Python <= 3.11 result in potentially weird graph breaking behavior because comprehensions result in implicit function calls, which Dynamo may end up tracing as top-level frames, resulting in iterators being passed as arguments to the compiled region. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156769 Approved by: https://github.com/StrongerXi	2025-06-25 21:55:55 +00:00
dolpm	95a7d1912a	[sigmoid] add layout planner to executor (#156852 ) Summary: if memory planning is enabled in the runtime config, we will create a copy in the executor here. Test Plan: ci Differential Revision: D73635622 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156852 Approved by: https://github.com/zhxchen17	2025-06-25 21:41:09 +00:00
Yidi Wu	6de41ce0f8	[cond] support gen_schema for cond (#154193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154193 Approved by: https://github.com/zou3519 ghstack dependencies: #155644	2025-06-25 21:19:58 +00:00
Yidi Wu	3257c8f74c	[cond] preserve merged phs meta for subgraph (#155644 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155644 Approved by: https://github.com/zou3519	2025-06-25 21:19:58 +00:00
James Wu	e7a66166ce	[precompile] When using BundledAOTAutogradCache, disable FXGraphCache (#156611 ) The goal of this PR is to fix a specific bug when turning precompile on/off between caching runs. If you try to turn on BundledAOTAutogradCacheEntry today in between local runs, the FXGraphCache may randomly hit between the two runs, because FXGraphCache knows nothing about AOTAutogradCache's config. When FXGraphCache hits, it immediately will call make_launchers() immediately on the triton code it launches, which then causes an assertion failure because pickle should not be called after make_launchers. One way to resolve the bug is just to add whether precompile is enabled to teh FxGraph cache key. But the better fix for this, however, is higher level/philosophical: When using BundledAOTAutogradCacheEntry, the entire CompiledFxGraph is saved directly to the cache entry, and we expect the two caches to work in sync, i.e. as one cache. So to simplify the programming model, we disable FxGraphCache when BundledAOTAUtogradCache is turned on. BundledAOTAutogradCacheEntry is only used for precompile use cases now; if we wanted to use BundledAOTAutogradCache for traditional caching use cases, there's a bunch of further work, one of which would be to re-enable FxGraphCache in the event that BundledAOTAutogradCache has to bypass. However, for precompile, this is not a scenario that should happen: we should always expect the entire callable to be saveable, and we should expect to never bypass. So we don't do that change for now. Added a unit test demonstrating this behavior. Also updated existing unit tests to show that all fx graph cache operations are now 0 (but all tests still pass). Pull Request resolved: https://github.com/pytorch/pytorch/pull/156611 Approved by: https://github.com/zhxchen17	2025-06-25 21:01:42 +00:00
Dmitry Nikolaev	fe1f1a38df	add test_batchnorn_2D and 3D tests (#156498 ) New set of batchnorm tests to verify NCHW 2D/3D BatchNorm This test also allows to add and configure different BatchNorm tests (dtypes, NCHW/NHWC, Mixed) in the future based on: - Train [test_batchnorm_cudnn_nhwc](`1051b93192/test/test_nn.py (L4985)`) - Inference [test_batchnorm_nhwc_cuda](`1051b93192/test/test_nn.py (L5130)`) ``` test_batchnorm_3D_inference_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_float32) ... ok (0.113s) test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.057s) test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_float16) ... ok (0.063s) test_batchnorm_3D_inference_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_float32) ... ok (0.059s) test_batchnorm_3D_inference_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_mixed_bfloat16) ... ok (0.006s) test_batchnorm_3D_inference_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_mixed_float16) ... ok (0.006s) test_batchnorm_3D_train_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_float32) ... ok (0.007s) test_batchnorm_3D_train_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.005s) test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16) ... ok (0.005s) test_batchnorm_3D_train_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_float32) ... ok (0.003s) test_batchnorm_3D_train_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_bfloat16) ... skip: bfloat16 NCHW train failed due to native tolerance issue (0.001s) test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_float16) ... skip: 3D float16 NCHW train failed on ROCm<7.0 (0.001s) test_batchnorm_2D_inference_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_float32) ... ok (0.016s) test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.003s) test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_float16) ... ok (0.003s) test_batchnorm_2D_inference_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_float32) ... ok (0.054s) test_batchnorm_2D_inference_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_mixed_bfloat16) ... ok (0.002s) test_batchnorm_2D_inference_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_mixed_float16) ... ok (0.001s) test_batchnorm_2D_train_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_float32) ... ok (0.007s) test_batchnorm_2D_train_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.004s) test_batchnorm_2D_train_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_mixed_float16) ... ok (0.004s) test_batchnorm_2D_train_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_float32) ... ok (0.003s) test_batchnorm_2D_train_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_mixed_bfloat16) ... skip: bfloat16 NCHW train failed due to native tolerance issue (0.001s) test_batchnorm_2D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_mixed_float16) ... ok (0.002s) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156498 Approved by: https://github.com/jeffdaily	2025-06-25 20:38:02 +00:00
angelayi	48e7b62d3a	[dynamo] Add immutable pytree to trace_rules (#156772 ) Fixes https://github.com/pytorch/pytorch/issues/155426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156772 Approved by: https://github.com/williamwen42	2025-06-25 20:08:47 +00:00
Pavan Balaji	e99a2a2dba	[PG/nccl] Simplify uniqueHash management (#156790 ) Summary: ncclUniqueID is only relevant when a comm is created using ncclCommCreate or ncclCommCreateConfig. If a comm is created with ncclCommSplit, this field is unset, causing its usage to create unexpected behavior. This patch creates a unique hash key for each comm, irrespective of how the comm is created. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/156790 Approved by: https://github.com/fduwjj, https://github.com/kwen2501	2025-06-25 20:06:08 +00:00
James Wu	070aa59e49	Refactor DynamoStore into disk and in memory implementations (#155818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155818 Approved by: https://github.com/zhxchen17	2025-06-25 18:24:28 +00:00
Yulu Jia	6c24c6633a	[torch][test] skip `test_transformer_backend_inductor_fullgraph_True` (#156763 ) Summary: "Traceable FSDP2" is not being maintained anymore. Test Plan: ``` buck test @//mode/opt caffe2/test/distributed/_composable:fully_shard_compile -- test_transformer_backend_inductor_fullgraph_True ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/16044073764394232/ Rollback Plan: Differential Revision: D77264408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156763 Approved by: https://github.com/xunnanxu, https://github.com/yf225	2025-06-25 18:15:23 +00:00
PaliC	09ffba3cf7	[docs] Decorator to create a deprecation warning (#155127 ) This PR adds the `@deprecate` decorator for internal functions which we are prepping for deprecation. Add it on top of an internal function to emit a deprecation warning + allow bc with the non internal version of the function. Tested with `python test/test_utils.py TestDeprecate.test_deprecated ` Furthermore, testing with a modified version of the tes in the pr gives something like this which is what we want ``` /home/sahanp/repos/pytorch/test/test_utils.py:1239: UserWarning: deprecated_api is DEPRECATED, please consider using an alternative API(s). deprecated_api(1, 2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155127 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-06-25 18:09:04 +00:00
henrylhtsang	4bc3e4b497	[cutlass backend] Move cutlass key to cutlass_library (#156654 ) Differential Revision: [D77188311](https://our.internmc.facebook.com/intern/diff/D77188311/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156654 Approved by: https://github.com/ColinPeppler, https://github.com/jingsh ghstack dependencies: #156651	2025-06-25 17:55:57 +00:00
Arijit Mukhopadhyay	c1a629f76d	Update device for perf dashboard on AMD runners (#156809 ) Uses arch_device naming convention for storing perf dashboard logs on AMD runners based on the following PR https://github.com/pytorch/test-infra/pull/6793 Updated from zen_cpu_x86 to cpu_x86_zen Fixes https://github.com/pytorch/test-infra/issues/6823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156809 Approved by: https://github.com/desertfire, https://github.com/malfet	2025-06-25 17:34:49 +00:00
henrylhtsang	e071837594	[cutlass backend] compile and link for .so files (#155876 ) Differential Revision: [D76482736](https://our.internmc.facebook.com/intern/diff/D76482736/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155876 Approved by: https://github.com/coconutruben, https://github.com/ColinPeppler	2025-06-25 17:01:56 +00:00
zhxchen17	1051b93192	[export] Implement _compile_and_package for ExportPackage. (#156638 ) add a method to implement weight sharing. Differential Revision: [D76132005](https://our.internmc.facebook.com/intern/diff/D76132005/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156638 Approved by: https://github.com/tugsbayasgalan	2025-06-25 16:00:40 +00:00
Andrey Talman	8eb3c5b7a1	[release] delete tag-docker-images.sh as not required anymore (#156737 ) Thanks to @clee2000 This is no longer required since the docker images use hash as tag: https://github.com/pytorch/pytorch/actions/runs/15844298044/job/44662813176#step:15:92 ``` Login Succeeded ++ docker manifest inspect docker.io/pytorch/manylinux2_28-builder:cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09 ++ jq '[.layers[].size, .config.size] \| add / 1024 / 1024' + IMAGE_SIZE=9322.26076889038 + echo 'Compressed size of image in MB: 9322.26076889038' + set -e + docker inspect --type=image docker.io/pytorch/manylinux2_28-builder:cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09 Compressed size of image in MB: 9322.26076[88](https://github.com/pytorch/pytorch/actions/runs/15844298044/job/44662813176#step:15:90)9038 + retry docker pull docker.io/pytorch/manylinux2_28-builder:cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09 + docker pull docker.io/pytorch/manylinux2_28-builder:cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09 cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09: Pulling from pytorch/manylinux2_28-builder ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156737 Approved by: https://github.com/clee2000	2025-06-25 15:17:06 +00:00
PyTorch MergeBot	029e2b05c2	Revert "[Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109 )" This reverts commit 19ffb5e6f7606436249742b0f3efc0bab244dc55. Reverted https://github.com/pytorch/pytorch/pull/155109 on behalf of https://github.com/albanD due to The corresponding test still breaks on rocm ([comment](https://github.com/pytorch/pytorch/pull/155109#issuecomment-3004698438))	2025-06-25 13:05:40 +00:00
Xia, Weiwen	c2185dc4a5	[Quant][CPU] Enable fp8 qlinear (#155678 ) Summary Enable fp8 qlinear on CPU. It's part of the plan to enable fp8 static quantization on CPU. This PR only adds FP8 support of the existing int8 qlinear op. It does not add a new op nor does it affect frontend or quantization flow. The schema of the qlinear op is not changed either. So, the FP8 qlinear shares the same op as INT8 qlinear and the difference is that src/wei dtype is fp8 instead of int8. The output dtype can be fp8/float32/bfloat16. The implementation uses the oneDNN library. The differences of qlinear from `_scaled_mm` are that - Qlinear supports post op fusion while `_scaled_mm` does not - Weights are prepacked for qlinear Test plan ``` pytest test/quantization/core/test_quantized_op.py -k "qlinear and fp8" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155678 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2025-06-25 10:01:08 +00:00
Xia, Weiwen	19ffb5e6f7	[Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109 ) Fixes #154328 Summary Fail reason: The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected. Fix: Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t. Test plan ``` pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2025-06-25 09:28:54 +00:00
Aleksei Nikiforov	0ab075a69e	Fix docker image build for s390x (#156687 ) Add upstream patch for onnxruntime updating eigen dependency URL and hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156687 Approved by: https://github.com/seemethere	2025-06-25 09:09:22 +00:00
Teja Rao	4918502d2e	bug fix for losing shape on wrapper tensor for DTensor (#156774 ) Summary: Wrapper tensor for DTensor is losing shape in offload_tensor. This PR fixes this bug. Test Plan: updated the test. Test fails with old code and passes with the fix. Rollback Plan: Differential Revision: D77269733 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156774 Approved by: https://github.com/mikaylagawarecki	2025-06-25 08:14:16 +00:00
Xinya Zhang	d9577df312	[ROCm] Bump AOTriton to 0.10b (#156499 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b: * Official support of gfx950/gfx1201 * Experimental support of gfx1101/gfx1151/gfx1150/gfx1200 * Reduce libaotriton.so binary size by over 80%. + Without this optimization the binary size of `libaotriton.so` could be over 100MiB due to 2x more supported architectures compared with 0.9b. Now it is only about 11MiB. * Support sliding window attention (SWA) in `_flash_attention_forward/backward`. Should fix #154582 See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details, including Known Problems. Notable changes to SDPA backend: * `std::optional<int64_t>` `window_size_left/right` are directly passed to ROCM's SDPA backend, because the default value `-1` is meaningful to AOTriton's backend and bottom-right aligned causal mask is implemented with negative `window_size_left/right` * Some code clean up around `USE_CK_FLASH_ATTENTION` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156499 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd	2025-06-25 07:09:03 +00:00
Xuehai Pan	62272d5b24	[BE][Easy][setup] wrap over long error messages and redirect them to `stderr` in `setup.py` (#156043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156043 Approved by: https://github.com/jingsh	2025-06-25 06:57:59 +00:00
Sheng Qin	6c008e2fb5	[nativert] Move ParallelGraphExecutor to PyTorch core (#156751 ) Summary: `ParallelGraphExecutor` inherits from `GraphExecutorBase` and executes all nodes in the graph in a parallel manner Test Plan: CI Rollback Plan: Differential Revision: D77088996 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156751 Approved by: https://github.com/zhxchen17, https://github.com/dolpm	2025-06-25 06:54:45 +00:00
Isuru Fernando	44a5f93462	[dynamo] allow symints in list.__setitem__ (#156197 ) Fixes https://github.com/pytorch/pytorch/issues/155174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156197 Approved by: https://github.com/StrongerXi	2025-06-25 06:20:35 +00:00
Xuehai Pan	162ca185ff	[BE][PYFMT] migrate PYFMT for `torch/_[a-h]*/` to `ruff format` (#144551 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144551 Approved by: https://github.com/ezyang ghstack dependencies: #148186	2025-06-25 06:16:06 +00:00
morrison-turnansky	9642c75689	added stubs for jit tree views (#156504 ) Fixes #156488 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156504 Approved by: https://github.com/ezyang	2025-06-25 06:15:17 +00:00
yuchengliu1	c60327ba74	avoid to declare an unknown bound array without any element (#156543 ) Fixes #153180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156543 Approved by: https://github.com/jansel Co-authored-by: Xu Han <xu.han@outlook.com>	2025-06-25 06:14:57 +00:00
Wang, Chuanqi	4237ee3c33	[XPU] Add periodic run for xpu worklfow (#156698 ) Enable XPU periodic testing in xpu.yml workflow directly. It works for https://github.com/pytorch/pytorch/issues/114850. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156698 Approved by: https://github.com/atalman, https://github.com/huydhn	2025-06-25 05:57:52 +00:00
leslie-fang-intel	194c221e0a	Update the UT of test_decompose_mm_cpu (#154100 ) Summary Fixes #153616 Based on the latest decomposed heuristic in `daca611465/torch/_inductor/fx_passes/decompose_mem_bound_mm.py (L79-L82)`, for the shape in this test case `[m=1, k=64, n=32]`, the result should be decomposed. The previous CI didn't capture this failure due to the UT skip described in https://github.com/pytorch/pytorch/pull/153245. So this PR should be verified in CI after https://github.com/pytorch/pytorch/pull/153245 landed. Test Plan ``` python -u -m pytest -s -v test/inductor/test_decompose_mem_bound_mm.py -k test_decompose_mm_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154100 Approved by: https://github.com/jansel	2025-06-25 05:45:58 +00:00
Yidi Wu	f5f4beaf56	[invoke_subgraph] make collect_meta_analysis fake prop cachable (#156347 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156347 Approved by: https://github.com/anijain2305, https://github.com/zou3519 ghstack dependencies: #156260	2025-06-25 04:29:22 +00:00
Yidi Wu	558d7f7db0	[invoke_subgraph] make same subgraph share get_attr target (#156260 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156260 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2025-06-25 04:29:22 +00:00
Aaron Orenstein	568ca89bac	Add a crash handler to async compile subprocesses (#155068 ) When the async compile subprocesses crash in C++ they tend to just silently die instead of leaving any kind of trace. This installs a crash handler so that if they SEGV, ILL, or ABRT they'll attempt to output a backtrace instead. While in there I also cleaned up the CLANGTIDY warnings coming from Module.cpp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155068 Approved by: https://github.com/masnesral	2025-06-25 03:27:28 +00:00
Natalia Gimelshein	beb52f5c0a	use more efficient implementation for broadcasted indexing in determi… (#156744 ) …nistic scatter_add per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/156744 Approved by: https://github.com/suo	2025-06-25 02:59:50 +00:00
Yu, Guangye	9b498d3bb2	Update docs for torch.device (#156686 ) # Motivation Update the doc, to make `torch.device`'s constructor officially support the following methods: - A device string, which is a string representation of the device type and optionally the device ordinal. - A device type and a device ordinal. - A device ordinal, which is treated as the current accelerator type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156686 Approved by: https://github.com/albanD	2025-06-25 02:12:36 +00:00
bobrenjc93	3608737347	[ez] fix typo in comment (#156402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156402 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #156397	2025-06-25 02:07:36 +00:00
Ryan Guo	d06a406656	[dynamo] Graph break on `torch.Tensor.data` assignment with mismatched dtype (#156623 ) Fixes #152162. Discussed with @bdhirsh and decided this is the easiest workaround for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156623 Approved by: https://github.com/bdhirsh	2025-06-25 02:03:04 +00:00
FFFrog	e8cf5ff564	Fix the Problems About Defining Static Variable in Inline Function (#147095 ) Refer to https://github.com/pytorch/pytorch/issues/125465 for more informations - Remove unused header files - Move common functionality to separate files to reduce dependencies between picklers and unpicklers - Move the inline function that defines the static variable to .cc Differential Revision: [D76266755](https://our.internmc.facebook.com/intern/diff/D76266755) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147095 Approved by: https://github.com/cyyever, https://github.com/albanD Co-authored-by: Edward Yang <ezyang@meta.com>	2025-06-25 01:59:10 +00:00
cyy	41910d7a94	Move use of c10::string_view to std::string_view (#152509 ) Eliminate use of c10::string_view in OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152509 Approved by: https://github.com/ezyang	2025-06-25 01:57:49 +00:00
Valentine233	02c7ab2f9b	[cpp wrapper] add AOTI shim for collective ops (#154492 ) Implementations: 1. Move collective ops to c10d namespace, so that we can call them externally. 2. Add AOTI shims for collective ops. Testing 1. Add c10d functional UT for cpu. 2. Include the above one in cpp wrapper UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154492 Approved by: https://github.com/desertfire	2025-06-25 01:20:05 +00:00
Teja Rao	d797038ea9	[dcp_poc] Introduce a new simple rank local checkpointer (#156142 ) Summary: Adds an experimental implementation for a rank local checkpointer with save and load with partial load, blind load and in-place load. This uses an new API and simpler format. Plan to add async checkpointing, IO layer, pluggable storage backend, layout customization, Resharding, deduplication etc are not implemented. Test Plan: unit tests Reviewed By: saumishr Differential Revision: D75426560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156142 Approved by: https://github.com/saumishr	2025-06-25 01:19:40 +00:00
Pavan Balaji	0d8e4e2327	[PG/nccl] improvements to eager init (#156748 ) Summary: Cleanup eager init management, to detect and throw a warning when multiple p2p are issued on the same PG in eager init mode. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/156748 Approved by: https://github.com/wconstab, https://github.com/kwen2501, https://github.com/Skylion007	2025-06-25 01:04:37 +00:00
Vijay Kestur	06930706a1	Improve documentation for torch.lobpcg (#156139 ) The changes are documentation changes to the function lobpcg. There are three changes to the doc. 1. Match doc arg description to be in the same order as the parameters to the function. 2. Update documentation for arg `n` to indicate that when arg `x` is specified value of `n` is ignored if set. 3. Add warning that `m` must be bigger than 3 x the number of requested eigenpairs. Fixes #152107 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156139 Approved by: https://github.com/soulitzer	2025-06-25 00:39:34 +00:00
PyTorch MergeBot	3dd872e6d5	Revert "Add DeviceAllocator as the base device allocator (#138222 )" This reverts commit 92409b6c89fbfbd3caa79c81b1e3d9e7917d3bc7. Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/Camyll due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3002206756))	2025-06-25 00:11:35 +00:00
PyTorch MergeBot	6459a5c7a9	Revert "Add unified memory APIs for torch.accelerator (#152932 )" This reverts commit 35e44067c4d9cc9be2652c0b9098885c5a321029. Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/Camyll due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3002206756))	2025-06-25 00:11:35 +00:00
PyTorch MergeBot	fd4bb29410	Revert "[logging] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156517 )" This reverts commit fb75dea2c1b93c78dccf08d5fd5e20b362ecd405. Reverted https://github.com/pytorch/pytorch/pull/156517 on behalf of https://github.com/Camyll due to internal reverted ([comment](https://github.com/pytorch/pytorch/pull/156517#issuecomment-3002172049))	2025-06-24 23:45:13 +00:00
IvanKobzarev	313a6a8ef9	[pt2][pr_time_benchmarks] Refresh instructions count after disabled test (#156738 ) https://github.com/pytorch/pytorch/issues/153987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156738 Approved by: https://github.com/laithsakka	2025-06-24 23:45:02 +00:00
PyTorch MergeBot	4bd18e31e5	Revert "Add fx_graph_runnable tests boilerplate (#156552 )" This reverts commit 0a2ec7681d2af973d8daaf7905431a088739dc90. Reverted https://github.com/pytorch/pytorch/pull/156552 on behalf of https://github.com/Camyll due to breaking internal ([comment](https://github.com/pytorch/pytorch/pull/156552#issuecomment-3002159473))	2025-06-24 23:34:21 +00:00
Catherine Lee	2ff3280c77	[ez] Disable some failing periodic tests (#156731 ) test_torch.py::TestTorchDeviceTypeCUDA::test_storage_use_count_cuda: Added in https://github.com/pytorch/pytorch/pull/150059 Fails in debug mode [GH job link](https://github.com/pytorch/pytorch/actions/runs/15856606665/job/44706020831) [HUD commit link](`4491326fb0`) inductor/test_inductor_freezing.py::FreezingGpuTests::test_cpp_wrapper_cuda: [GH job link](https://github.com/pytorch/pytorch/actions/runs/15856606665/job/44707119967) [HUD commit link](`4491326fb0`) started failing after moving to new cuda version https://github.com/pytorch/pytorch/pull/155234 I'll ping people if this gets merged Pull Request resolved: https://github.com/pytorch/pytorch/pull/156731 Approved by: https://github.com/huydhn	2025-06-24 23:02:21 +00:00
bobrenjc93	d8bb5ac260	[ez] fix typo in select_algorithm.py (#156625 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156625 Approved by: https://github.com/Skylion007, https://github.com/BoyuanFeng ghstack dependencies: #156445	2025-06-24 23:01:58 +00:00
Mwiza Kunda	ce97a5dcfa	[Inductor] Restrict block analysis to only match integer dims and strides (#149615 ) Restrict block analysis to only match dimension sizes and strides that are integers. E.g. `sympy` can match index expressions like `ModularIndexing(xindex, 4, 4)) + 4(ModularIndexing(xindex, 32, 2))` with the candidate below that is invalid. ```python match_expr = stride_mod0_((xindex//(dim_mod1_dim_mod2_dim_mod3_dim_mod4_))) + stride_mod1_(ModularIndexing(xindex, dim_mod2_dim_mod3_dim_mod4_, dim_mod1_)) + stride_mod2_(ModularIndexing(xindex, dim_mod3_dim_mod4_, dim_mod2_)) + stride_mod3_(ModularIndexing(xindex, dim_mod4_, dim_mod3_)) + stride_mod4_(ModularIndexing(xindex, 1, dim_mod4_)) match={ dim_mod4_: 32, dim_mod3_: 2, stride_mod3_: 4, dim_mod2_: 1/16, dim_mod1_: 4, stride_mod1_: 1, stride_mod4_: 0, stride_mod2_: 0, stride_mod0_: 0 } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149615 Approved by: https://github.com/blaine-rister	2025-06-24 22:43:12 +00:00
Paul Zhang	c48d0f4643	[Inductor] Fix epilogue fusion decision with 1 Triton caller as choice (#156500 ) Differential Revision: D76904773 In the current scheduler logic, if a template buffer is only a Triton template, which can result from only 1 Triton choice in the autotuning, the fusion won't be benchmarked. This can lead to an edge case in which a Triton GEMM template from the autotune lookup table can have a problematic fusion, leading to shared memory requirements above the hardware limit. `(256, 128, 64, 4, 8, 8)` is such a config, where we have seen fusion with a `.to(torch.float32)` can lead to this issue, `out of resource: shared memory, Required: 264224, Hardware limit: 232448`. We benchmark the fusion for this case to ensure it's safe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156500 Approved by: https://github.com/jansel	2025-06-24 22:33:47 +00:00
Scott Wolchok	e96f530af5	Remove unnecessary use of c10::SmallVector from moments_utils (#156714 ) It's just making arrays of a particular size. (If it was resizing the vectors, we'd see compile errors.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156714 Approved by: https://github.com/Skylion007	2025-06-24 22:30:10 +00:00
Jane Xu	4ee4863232	Fix #156261 _foreach_copy indexing (#156719 ) Fixes #156261 Thanks to @ngimel's fast eyes For testing, I had experimented with a broader test case change but found that creating a tensor of 2**31+1 size was too expensive to do more than just a few times. Note that while the test case does not run in CI, I did run it locally to ensure it passes with new changes and fails without. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156719 Approved by: https://github.com/albanD	2025-06-24 21:58:44 +00:00
Yiming Zhou	310e8361c5	[nativert] Move PrimKernelRegistry to PyTorch core (#156506 ) Summary: Torch Native Runtime RFC: pytorch/rfcs#72 PrimKernelRegistry manages a small subset of kernel registry in NativeRT. Including ListPack, ListUnpack, Input, Output, VarConcat, VarStack Test Plan: Internal unittests Differential Revision: D77034945 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156506 Approved by: https://github.com/zhxchen17	2025-06-24 21:42:41 +00:00
Jeff Daily	fa0ea57f5e	[ROCm][CD] upgrade to 6.4.1 patch release (#156636 ) During https://github.com/pytorch/pytorch/pull/156112, we missed upgrading the manylinux and libtorch docker images. Fixes #155292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156636 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-24 21:41:42 +00:00
Isuru Fernando	3efb22e091	Enable C++ dynamic shape guards by default (#140756 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140756 Approved by: https://github.com/anijain2305, https://github.com/laithsakka	2025-06-24 21:10:17 +00:00
Laith Sakka	26f7ca3972	Unify dynamic shapes APIs naming 2 (expect_true and check) attempt2 (#156518 ) Summary: The functions guard_lt, guard_equals, and guard_leq work similarly to torch.check and expect_true, but they operate on SymPy expressions. Notably, guard_equals applies local replacements before comparison, which might be better extracted into a separate function. This pull request standardizes naming conventions to match symbolic_shapes.py. Specifically, - it introduces size_vars.expect_true and size_vars.check. - guard_lt becomes check_lt - guard_leq becomes check_leq - guard_equals becomes check_equals I am also seeing a couple of wrong usages !! that i will fix in the next PR Test Plan: OSS and cont Rollback Plan: Differential Revision: D77054177 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156518 Approved by: https://github.com/bobrenjc93	2025-06-24 21:01:38 +00:00
zeshengzong	dfef1e4408	Optimize dim description in torch.max (#156153 ) Fixes #156071 ## Test Result ### Before ![image](https://github.com/user-attachments/assets/8dd0d952-277a-4197-b323-d68ae1438171) ### After ![image](https://github.com/user-attachments/assets/4af5388e-ca9e-4268-a7c4-cf16b09b899f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156153 Approved by: https://github.com/albanD	2025-06-24 20:50:40 +00:00
PyTorch MergeBot	1dc1eedd43	Revert "[dynamo] Graph break on `torch.Tensor.data` assignment with mismatched dtype (#156623 )" This reverts commit c1ad4b8e7a16f54c35a3908b56ed7d9f95eef586. Reverted https://github.com/pytorch/pytorch/pull/156623 on behalf of https://github.com/albanD due to Breaks Dynamo tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/156623#issuecomment-3001806841))	2025-06-24 20:44:42 +00:00
PyTorch MergeBot	aa280ea19f	Revert "Remove remaining CUDA 12.4 CI code (#155412 )" This reverts commit 9fed2addedb42da86b657165fe14eadc911232cf. Reverted https://github.com/pytorch/pytorch/pull/155412 on behalf of https://github.com/Camyll due to cuda 12.4 still needed ([comment](https://github.com/pytorch/pytorch/pull/155412#issuecomment-3001711830))	2025-06-24 20:05:39 +00:00
PyTorch MergeBot	19f851ce10	Revert "Simplify nvtx3 CMake handling, always use nvtx3 (#153784 )" This reverts commit 099d0d6121125062ebc05771c8330cb7cd8d053a. Reverted https://github.com/pytorch/pytorch/pull/153784 on behalf of https://github.com/Camyll due to breaking internal tests and cuda 12.4 builds still used in CI ([comment](https://github.com/pytorch/pytorch/pull/153784#issuecomment-3001702310))	2025-06-24 20:02:07 +00:00
Edward Yang	376c16703c	Document each of the private member variables on ExportedProgram (#156704 ) Authored with claude code and then reviewed by hand. If you don't like it, tell me. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156704 Approved by: https://github.com/albanD, https://github.com/zhxchen17, https://github.com/jingsh	2025-06-24 19:56:40 +00:00
Ryan Guo	c1ad4b8e7a	[dynamo] Graph break on `torch.Tensor.data` assignment with mismatched dtype (#156623 ) Fixes #152162. Discussed with @bdhirsh and decided this is the easiest workaround for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156623 Approved by: https://github.com/bdhirsh	2025-06-24 19:33:11 +00:00

3419 changed files with 166348 additions and 108569 deletions

2

.bazelrc

View File

 @ -2,7 +2,7 @@ build --cxxopt=--std=c++17
 build --copt=-I.
 # Bazel does not support including its cc_library targets as system
 # headers. We work around this for generated code
 # (e.g. c10/macros/cmake_macros.h) by making the generated directory a
 # (e.g. torch/headeronly/macros/cmake_macros.h) by making the generated directory a
 # system include path.
 build --copt=-isystem --copt bazel-out/k8-fastbuild/bin
 build --copt=-isystem --copt bazel-out/darwin-fastbuild/bin

									
										2

.ci/aarch64_linux/aarch64_ci_build.sh
									
												View File
												
				@ -4,7 +4,7 @@ set -eux -o pipefail

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				if [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0"

				    export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"

				fi

				SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"

									
										5

.ci/aarch64_linux/aarch64_wheel_ci_build.py
									
												View File
												
				@ -79,6 +79,7 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				    os.system(f"unzip {wheel_path} -d {folder}/tmp")

				    libs_to_copy = [

				        "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",

				        "/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",

				        "/usr/local/cuda/lib64/libcudnn.so.9",

				        "/usr/local/cuda/lib64/libcublas.so.12",

				        "/usr/local/cuda/lib64/libcublasLt.so.12",

				@ -88,8 +89,10 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				        "/usr/local/cuda/lib64/libcusparseLt.so.0",

				        "/usr/local/cuda/lib64/libcusolver.so.11",

				        "/usr/local/cuda/lib64/libcurand.so.10",

				        "/usr/local/cuda/lib64/libnccl.so.2",

				        "/usr/local/cuda/lib64/libnvJitLink.so.12",

				        "/usr/local/cuda/lib64/libnvrtc.so.12",

				        "/usr/local/cuda/lib64/libnvshmem_host.so.3",

				        "/usr/local/cuda/lib64/libcudnn_adv.so.9",

				        "/usr/local/cuda/lib64/libcudnn_cnn.so.9",

				        "/usr/local/cuda/lib64/libcudnn_graph.so.9",

				@ -206,7 +209,7 @@ if __name__ == "__main__":

				    build_vars = "CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "

				    # MAX_JOB=5 is not required for CPU backend (see commit 465d98b)

				    if enable_cuda:

				        build_vars = "MAX_JOBS=5 " + build_vars

				        build_vars += "MAX_JOBS=5 "

				    override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")

				    desired_cuda = os.getenv("DESIRED_CUDA")

									
										16

.ci/aarch64_linux/build_aarch64_wheel.py
									
												View File
												
				@ -438,9 +438,7 @@ def build_torchvision(

				        )

				        build_vars += f"BUILD_VERSION={version}.dev{build_date}"

				    elif build_version is not None:

				        build_vars += (

				            f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"

				        )

				        build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				@ -495,9 +493,7 @@ def build_torchdata(

				        )

				        build_vars += f"BUILD_VERSION={version}.dev{build_date}"

				    elif build_version is not None:

				        build_vars += (

				            f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"

				        )

				        build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				@ -553,9 +549,7 @@ def build_torchtext(

				        )

				        build_vars += f"BUILD_VERSION={version}.dev{build_date}"

				    elif build_version is not None:

				        build_vars += (

				            f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"

				        )

				        build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				@ -613,9 +607,7 @@ def build_torchaudio(

				        )

				        build_vars += f"BUILD_VERSION={version}.dev{build_date}"

				    elif build_version is not None:

				        build_vars += (

				            f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"

				        )

				        build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

									
										6

.ci/caffe2/test.sh
									
												View File
												
				@ -5,7 +5,7 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				if [[ ${BUILD_ENVIRONMENT} == *onnx* ]]; then

				  pip install click mock tabulate networkx==2.0

				  pip -q install --user "file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx"

				  pip -q install "file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx"

				fi

				# Skip tests in environments where they are not built/applicable

				@ -147,8 +147,8 @@ export DNNL_MAX_CPU_ISA=AVX2

				if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then

				  # TODO(sdym@meta.com) remove this when the linked issue resolved.

				  # py is temporary until https://github.com/Teemu/pytest-sugar/issues/241 is fixed

				  pip install --user py==1.11.0

				  pip install --user pytest-sugar

				  pip install py==1.11.0

				  pip install pytest-sugar

				  # NB: Warnings are disabled because they make it harder to see what

				  # the actual erroring test is

				  "$PYTHON" \

									
										101

.ci/docker/README.md
									
												View File
												
				@ -36,3 +36,104 @@ See `build.sh` for valid build environments (it's the giant switch).

				# Set flags (see build.sh) and build image

				sudo bash -c 'TRITON=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest

				```

				## [Guidance] Adding a New Base Docker Image

				### Background

				The base Docker images in directory `.ci/docker/` are built by the `docker-builds.yml` workflow. Those images are used throughout the PyTorch CI/CD pipeline. You should only create or modify a base Docker image if you need specific environment changes or dependencies before building PyTorch on CI.

				1. **Automatic Rebuilding**:

				   - The Docker image building process is triggered automatically when changes are made to files in the `.ci/docker/*` directory

				   - This ensures all images stay up-to-date with the latest dependencies and configurations

				2. **Image Reuse in PyTorch Build Workflows** (example: linux-build):

				   - The images generated by `docker-builds.yml` are reused in `_linux-build.yml` through the `calculate-docker-image` step

				   - The `_linux-build.yml` workflow:

				     - Pulls the Docker image determined by the `calculate-docker-image` step

				     - Runs a Docker container with that image

				     - Executes `.ci/pytorch/build.sh` inside the container to build PyTorch

				3. **Usage in Test Workflows** (example: linux-test):

				   - The same Docker images are also used in `_linux-test.yml` for running tests

				   - The `_linux-test.yml` workflow follows a similar pattern:

				     - It uses the `calculate-docker-image` step to determine which Docker image to use

				     - It pulls the Docker image and runs a container with that image

				     - It installs the wheels from the artifacts generated by PyTorch build jobs

				     - It executes test scripts (like `.ci/pytorch/test.sh` or `.ci/pytorch/multigpu-test.sh`) inside the container

				### Understanding File Purposes

				#### `.ci/docker/build.sh` vs `.ci/pytorch/build.sh`

				- **`.ci/docker/build.sh`**:

				  - Used for building base Docker images

				  - Executed by the `docker-builds.yml` workflow to pre-build Docker images for CI

				  - Contains configurations for different Docker build environments

				- **`.ci/pytorch/build.sh`**:

				  - Used for building PyTorch inside a Docker container

				  - Called by workflows like `_linux-build.yml` after the Docker container is started

				  - Builds PyTorch wheels and other artifacts

				#### `.ci/docker/ci_commit_pins/` vs `.github/ci_commit_pins`

				- **`.ci/docker/ci_commit_pins/`**:

				  - Used for pinning dependency versions during base Docker image building

				  - Ensures consistent environments for building PyTorch

				  - Changes here trigger base Docker image rebuilds

				- **`.github/ci_commit_pins`**:

				  - Used for pinning dependency versions during PyTorch building and tests

				  - Ensures consistent dependencies for PyTorch across different builds

				  - Used by build scripts running inside Docker containers

				### Step-by-Step Guide for Adding a New Base Docker Image

				#### 1. Add Pinned Commits (If Applicable)

				We use pinned commits for build stability. The `nightly.yml` workflow checks and updates pinned commits for certain repository dependencies daily.

				If your new Docker image needs a library installed from a specific pinned commit or built from source:

				1. Add the repository you want to track in `nightly.yml` and `merge-rules.yml`

				2. Add the initial pinned commit in `.ci/docker/ci_commit_pins/`. The text filename should match the one defined in step 1

				#### 2. Configure the Base Docker Image

				1. **Add new Base Docker image configuration** (if applicable):

				   Add the configuration in `.ci/docker/build.sh`. For example:

				   ```bash

				   pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-new1)

				     CUDA_VERSION=12.8.1

				     ANACONDA_PYTHON_VERSION=3.12

				     GCC_VERSION=11

				     VISION=yes

				     KATEX=yes

				     UCX_COMMIT=${_UCX_COMMIT}

				     UCC_COMMIT=${_UCC_COMMIT}

				     TRITON=yes

				     NEW_ARG_1=yes

				     ;;

				   ```

				2. **Add build arguments to Docker build command**:

				   If you're introducing a new argument to the Docker build, make sure to add it in the Docker build step in `.ci/docker/build.sh`:

				   ```bash

				   docker build \

				      ....

				      --build-arg "NEW_ARG_1=${NEW_ARG_1}"

				   ```

				3. **Update Dockerfile logic**:

				   Update the Dockerfile to use the new argument. For example, in `ubuntu/Dockerfile`:

				   ```dockerfile

				   ARG NEW_ARG_1

				   # Set up environment for NEW_ARG_1

				   RUN if [ -n "${NEW_ARG_1}" ]; then bash ./do_something.sh; fi

				   ```

				4. **Add the Docker configuration** in `.github/workflows/docker-builds.yml`:

				   The `docker-builds.yml` workflow pre-builds the Docker images whenever changes occur in the `.ci/docker/` directory. This includes the

				   pinned commit updates.

									
										6

.ci/docker/almalinux/Dockerfile
									
												View File
												
				@ -64,6 +64,10 @@ FROM cuda as cuda12.9

				RUN bash ./install_cuda.sh 12.9

				ENV DESIRED_CUDA=12.9

				FROM cuda as cuda13.0

				RUN bash ./install_cuda.sh 13.0

				ENV DESIRED_CUDA=13.0

				FROM ${ROCM_IMAGE} as rocm

				ENV PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

				ADD ./common/install_mkl.sh install_mkl.sh

				@ -76,10 +80,10 @@ ADD ./common/install_mnist.sh install_mnist.sh

				RUN bash ./install_mnist.sh

				FROM base as all_cuda

				COPY --from=cuda11.8  /usr/local/cuda-11.8 /usr/local/cuda-11.8

				COPY --from=cuda12.6  /usr/local/cuda-12.6 /usr/local/cuda-12.6

				COPY --from=cuda12.8  /usr/local/cuda-12.8 /usr/local/cuda-12.8

				COPY --from=cuda12.9  /usr/local/cuda-12.9 /usr/local/cuda-12.9

				COPY --from=cuda13.0  /usr/local/cuda-13.0 /usr/local/cuda-13.0

				# Final step

				FROM ${BASE_TARGET} as final

									
										146

.ci/docker/build.sh
									
												View File
												
				@ -52,6 +52,8 @@ fi

				if [[ "$image" == *-jammy* ]]; then

				  UBUNTU_VERSION=22.04

				elif [[ "$image" == *-noble* ]]; then

				  UBUNTU_VERSION=24.04

				elif [[ "$image" == *ubuntu* ]]; then

				  extract_version_from_image_name ubuntu UBUNTU_VERSION

				fi

				@ -74,6 +76,9 @@ elif [[ "$image" == *cuda*linter* ]]; then

				elif [[ "$image" == *linter* ]]; then

				  # Use a separate Dockerfile for linter to keep a small image size

				  DOCKERFILE="linter/Dockerfile"

				elif [[ "$image" == *riscv* ]]; then

				  # Use RISC-V specific Dockerfile

				  DOCKERFILE="ubuntu-cross-riscv/Dockerfile"

				fi

				_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb

				@ -89,9 +94,18 @@ tag=$(echo $image | awk -F':' '{print $2}')

				# configuration, so we hardcode everything here rather than do it

				# from scratch

				case "$tag" in

				  pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11)

				    CUDA_VERSION=12.4

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				@ -102,7 +116,6 @@ case "$tag" in

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    VISION=yes

				@ -114,7 +127,6 @@ case "$tag" in

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    VISION=yes

				@ -126,7 +138,6 @@ case "$tag" in

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    VISION=yes

				@ -136,56 +147,18 @@ case "$tag" in

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.6.3

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6

				    CUDNN_VERSION=9

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm)

				    CUDA_VERSION=12.8.1

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    GCC_VERSION=11

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    VISION=yes

				@ -195,7 +168,7 @@ case "$tag" in

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3-clang12-onnx)

				    ANACONDA_PYTHON_VERSION=3.9

				    ANACONDA_PYTHON_VERSION=3.10

				    CLANG_VERSION=12

				    VISION=yes

				    ONNX=yes

				@ -206,32 +179,12 @@ case "$tag" in

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3.11-clang12)

				    ANACONDA_PYTHON_VERSION=3.11

				    CLANG_VERSION=12

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3.9-gcc9)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=9

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-rocm-n-1-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				    ROCM_VERSION=6.3

				    NINJA_VERSION=1.9.0

				    TRITON=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				  pytorch-linux-jammy-rocm-n-py3 | pytorch-linux-jammy-rocm-n-py3-benchmarks | pytorch-linux-noble-rocm-n-py3)

				    if [[ $tag =~ "jammy" ]]; then

				      ANACONDA_PYTHON_VERSION=3.10

				    else

				      ANACONDA_PYTHON_VERSION=3.12

				    fi

				    GCC_VERSION=11

				    VISION=yes

				    ROCM_VERSION=6.4

				@ -240,7 +193,21 @@ case "$tag" in

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    INDUCTOR_BENCHMARKS=yes

				    if [[ $tag =~ "benchmarks" ]]; then

				      INDUCTOR_BENCHMARKS=yes

				    fi

				    ;;

				  pytorch-linux-noble-rocm-alpha-py3)

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    VISION=yes

				    ROCM_VERSION=7.0

				    NINJA_VERSION=1.9.0

				    TRITON=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    PYTORCH_ROCM_ARCH="gfx90a;gfx942;gfx950"

				    ;;

				  pytorch-linux-jammy-xpu-2025.0-py3)

				    ANACONDA_PYTHON_VERSION=3.9

				@ -258,7 +225,7 @@ case "$tag" in

				    NINJA_VERSION=1.9.0

				    TRITON=yes

				    ;;

				    pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)

				  pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    VISION=yes

				@ -270,22 +237,10 @@ case "$tag" in

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12)

				    ANACONDA_PYTHON_VERSION=3.9

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    CLANG_VERSION=12

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3-clang12-asan)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=12

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3-clang15-asan)

				    ANACONDA_PYTHON_VERSION=3.10

				    CLANG_VERSION=15

				    VISION=yes

				    ;;

				  pytorch-linux-jammy-py3-clang18-asan)

				    ANACONDA_PYTHON_VERSION=3.10

				    CLANG_VERSION=18

				@ -333,6 +288,7 @@ case "$tag" in

				    GCC_VERSION=11

				    ACL=yes

				    VISION=yes

				    OPENBLAS=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				@ -342,11 +298,15 @@ case "$tag" in

				    GCC_VERSION=11

				    ACL=yes

				    VISION=yes

				    OPENBLAS=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-noble-riscv64-py3.12-gcc14)

				    GCC_VERSION=14

				    ;;

				  *)

				    # Catch-all for builds that are not hardcoded.

				    VISION=yes

				@ -356,7 +316,6 @@ case "$tag" in

				    fi

				    if [[ "$image" == *cuda* ]]; then

				      extract_version_from_image_name cuda CUDA_VERSION

				      extract_version_from_image_name cudnn CUDNN_VERSION

				    fi

				    if [[ "$image" == *rocm* ]]; then

				      extract_version_from_image_name rocm ROCM_VERSION

				@ -408,9 +367,6 @@ docker build \

				       --build-arg "PYTHON_VERSION=${PYTHON_VERSION}" \

				       --build-arg "GCC_VERSION=${GCC_VERSION}" \

				       --build-arg "CUDA_VERSION=${CUDA_VERSION}" \

				       --build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \

				       --build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \

				       --build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \

				       --build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \

				       --build-arg "KATEX=${KATEX:-}" \

				       --build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \

				@ -428,6 +384,7 @@ docker build \

				       --build-arg "XPU_VERSION=${XPU_VERSION}" \

				       --build-arg "UNINSTALL_DILL=${UNINSTALL_DILL}" \

				       --build-arg "ACL=${ACL:-}" \

				       --build-arg "OPENBLAS=${OPENBLAS:-}" \

				       --build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \

				       --build-arg "SKIP_LLVM_SRC_BUILD_INSTALL=${SKIP_LLVM_SRC_BUILD_INSTALL:-}" \

				       -f $(dirname ${DOCKERFILE})/Dockerfile \

				@ -470,7 +427,14 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				fi

				if [ -n "$GCC_VERSION" ]; then

				  if !(drun gcc --version 2>&1 | grep -q " $GCC_VERSION\\W"); then

				  if [[ "$image" == *riscv* ]]; then

				    # Check RISC-V cross-compilation toolchain version

				    if !(drun riscv64-linux-gnu-gcc-${GCC_VERSION} --version 2>&1 | grep -q " $GCC_VERSION\\W"); then

				      echo "RISC-V GCC_VERSION=$GCC_VERSION, but:"

				      drun riscv64-linux-gnu-gcc-${GCC_VERSION} --version

				      exit 1

				    fi

				  elif !(drun gcc --version 2>&1 | grep -q " $GCC_VERSION\\W"); then

				    echo "GCC_VERSION=$GCC_VERSION, but:"

				    drun gcc --version

				    exit 1

2

.ci/docker/ci_commit_pins/huggingface-requirements.txt Normal file

View File

 @ -0,0 +1,2 @@
 transformers==4.54.0
 soxr==0.5.0

1

.ci/docker/ci_commit_pins/huggingface.txt

View File

				`@ -1 +0,0 @@`
				`243e186efbf7fb93328dd6b34927a4e8c8f24395`

2

.ci/docker/ci_commit_pins/nccl-cu12.txt

View File

 @ -1 +1 @@
 v2.27.3-1
 v2.27.5-1

1

.ci/docker/ci_commit_pins/nccl-cu13.txt Normal file

View File

				`@ -0,0 +1 @@`
				`v2.27.7-1`

0

.github/ci_commit_pins/torchbench.txt → .ci/docker/ci_commit_pins/torchbench.txt

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 ae324eeac8e102a2b40370e341460f3791353398
 dc9b2bb815e428f721f9da599dab0dc1c5d7

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 c8757738a7418249896224430ce84888e8ecdd79
 f7888497a1eb9e98d4c07537f0d0bcfe180d1363

									
										4

.ci/docker/common/common_utils.sh
									
												View File
												
				@ -23,6 +23,10 @@ conda_install() {

				  as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION" $*

				}

				conda_install_through_forge() {

				  as_jenkins conda install -c conda-forge -q -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION" $*

				}

				conda_run() {

				  as_jenkins conda run -n py_$ANACONDA_PYTHON_VERSION --no-capture-output $*

				}

									
										3

.ci/docker/common/install_base.sh
									
												View File
												
				@ -15,6 +15,9 @@ install_ubuntu() {

				  elif [[ "$UBUNTU_VERSION" == "22.04"* ]]; then

				    cmake3="cmake=3.22*"

				    maybe_libiomp_dev=""

				  elif [[ "$UBUNTU_VERSION" == "24.04"* ]]; then

				    cmake3="cmake=3.28*"

				    maybe_libiomp_dev=""

				  else

				    cmake3="cmake=3.5*"

				    maybe_libiomp_dev="libiomp-dev"

									
										21

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -4,12 +4,8 @@ set -ex

				# Optionally install conda

				if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  BASE_URL="https://repo.anaconda.com/miniconda"

				  CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"

				  if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]] || [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				    BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"  # @lint-ignore

				    CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"

				  fi

				  BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"  # @lint-ignore

				  CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"

				  MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)

				  MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)

				@ -21,7 +17,6 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				      exit 1

				      ;;

				  esac

				  mkdir -p /opt/conda

				  chown jenkins:jenkins /opt/conda

				@ -70,10 +65,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  fi

				  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

				  if [[ $(uname -m) == "aarch64" ]]; then

				    conda_install "openblas==0.3.29=*openmp*"

				  else

				    conda_install "mkl=2021.4.0 mkl-include=2021.4.0"

				  if [[ $(uname -m) != "aarch64" ]]; then

				    pip_install mkl==2024.2.0

				    pip_install mkl-static==2024.2.0

				    pip_install mkl-include==2024.2.0

				  fi

				  # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

				@ -87,6 +82,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				    conda_run ${SCRIPT_FOLDER}/install_magma_conda.sh $(cut -f1-2 -d'.' <<< ${CUDA_VERSION})

				  fi

				  if [[ "$UBUNTU_VERSION" == "24.04"* ]] ; then

				    conda_install_through_forge libstdcxx-ng=14

				  fi

				  # Install some other packages, including those needed for Python test reporting

				  pip_install -r /opt/conda/requirements-ci.txt

									
										39

.ci/docker/common/install_cpython.sh
									
												View File
												
				@ -3,11 +3,10 @@

				set -uex -o pipefail

				PYTHON_DOWNLOAD_URL=https://www.python.org/ftp/python

				PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/heads  # @lint-ignore

				GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py

				# Python versions to be installed in /opt/$VERSION_NO

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t 3.14.0 3.14.0t"}

				function check_var {

				    if [ -z "$1" ]; then

				@ -24,9 +23,8 @@ function do_cpython_build {

				    tar -xzf Python-$py_ver.tgz

				    local additional_flags=""

				    if [ "$py_ver" == "3.13.0t" ]; then

				    if [[ "$py_ver" == *"t" ]]; then

				        additional_flags=" --disable-gil"

				        mv cpython-3.13/ cpython-3.13t/

				    fi

				    pushd $py_folder

				@ -68,32 +66,29 @@ function do_cpython_build {

				        ln -s pip3 ${prefix}/bin/pip

				    fi

				    # install setuptools since python 3.12 is required to use distutils

				    ${prefix}/bin/pip install wheel==0.34.2 setuptools==68.2.2

				    local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")

				    # packaging is needed to create symlink since wheel no longer provides needed information

				    ${prefix}/bin/pip install packaging==25.0 wheel==0.45.1 setuptools==80.9.0

				    local abi_tag=$(${prefix}/bin/python -c "from packaging.tags import interpreter_name, interpreter_version; import sysconfig ; from sysconfig import get_config_var; print('{0}{1}-{0}{1}{2}'.format(interpreter_name(), interpreter_version(), 't' if sysconfig.get_config_var('Py_GIL_DISABLED') else ''))")

				    ln -sf ${prefix} /opt/python/${abi_tag}

				}

				function build_cpython {

				    local py_ver=$1

				    check_var $py_ver

				    check_var $PYTHON_DOWNLOAD_URL

				    local py_ver_folder=$py_ver

				    local py_suffix=$py_ver

				    local py_folder=$py_ver

				    if [ "$py_ver" = "3.13.0t" ]; then

				        PY_VER_SHORT="3.13"

				        PYT_VER_SHORT="3.13t"

				        check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH

				        wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

				        do_cpython_build $py_ver cpython-$PYT_VER_SHORT

				    elif [ "$py_ver" = "3.13.0" ]; then

				        PY_VER_SHORT="3.13"

				        check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH

				        wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

				        do_cpython_build $py_ver cpython-$PY_VER_SHORT

				    else

				        wget -q $PYTHON_DOWNLOAD_URL/$py_ver_folder/Python-$py_ver.tgz

				        do_cpython_build $py_ver Python-$py_ver

				    # Special handling for nogil

				    if [[ "${py_ver}" == *"t" ]]; then

				        py_suffix=${py_ver::-1}

				        py_folder=$py_suffix

				    fi

				    # Only b3 is available now

				    if [ "$py_suffix" == "3.14.0" ]; then

				        py_suffix="3.14.0b3"

				    fi

				    wget -q $PYTHON_DOWNLOAD_URL/$py_folder/Python-$py_suffix.tgz -O Python-$py_ver.tgz

				    do_cpython_build $py_ver Python-$py_suffix

				    rm -f Python-$py_ver.tgz

				}

									
										125

.ci/docker/common/install_cuda.sh
									
												View File
												
				@ -10,6 +10,8 @@ else

				  arch_path='sbsa'

				fi

				NVSHMEM_VERSION=3.3.20

				function install_cuda {

				  version=$1

				  runfile=$2

				@ -40,13 +42,67 @@ function install_cudnn {

				  rm -rf tmp_cudnn

				}

				function install_nvshmem {

				  cuda_major_version=$1      # e.g. "12"

				  nvshmem_version=$2         # e.g. "3.3.9"

				  case "${arch_path}" in

				    sbsa)

				      dl_arch="aarch64"

				      ;;

				    x86_64)

				      dl_arch="x64"

				      ;;

				    *)

				      dl_arch="${arch}"

				      ;;

				  esac

				  tmpdir="tmp_nvshmem"

				  mkdir -p "${tmpdir}" && cd "${tmpdir}"

				  # nvSHMEM license: https://docs.nvidia.com/nvshmem/api/sla.html

				  # This pattern is a lie as it is not consistent across versions, for 3.3.9 it was cuda_ver-arch-nvshhem-ver

				  filename="libnvshmem-linux-${arch_path}-${nvshmem_version}_cuda${cuda_major_version}-archive"

				  suffix=".tar.xz"

				  url="https://developer.download.nvidia.com/compute/redist/nvshmem/${nvshmem_version}/builds/cuda${cuda_major_version}/txz/agnostic/${dl_arch}/${filename}${suffix}"

				  # download, unpack, install

				  wget -q "${url}"

				  tar xf "${filename}${suffix}"

				  cp -a "${filename}/include/"* /usr/local/cuda/include/

				  cp -a "${filename}/lib/"*     /usr/local/cuda/lib64/

				  # cleanup

				  cd ..

				  rm -rf "${tmpdir}"

				  echo "nvSHMEM ${nvshmem_version} for CUDA ${cuda_major_version} (${arch_path}) installed."

				}

				function install_124 {

				  CUDNN_VERSION=9.1.0.70

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.2"

				  install_cuda 12.4.1 cuda_12.4.1_550.54.15_linux

				  install_cudnn 12 $CUDNN_VERSION

				  CUDA_VERSION=12.4 bash install_nccl.sh

				  CUDA_VERSION=12.4 bash install_cusparselt.sh

				  ldconfig

				}

				function install_126 {

				  CUDNN_VERSION=9.10.2.21

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"

				  install_cuda 12.6.3 cuda_12.6.3_560.35.05_linux

				  install_cudnn 12 $CUDNN_VERSION

				  install_nvshmem 12 $NVSHMEM_VERSION

				  CUDA_VERSION=12.6 bash install_nccl.sh

				  CUDA_VERSION=12.6 bash install_cusparselt.sh

				@ -56,13 +112,15 @@ function install_126 {

				function install_129 {

				  CUDNN_VERSION=9.10.2.21

				  echo "Installing CUDA 12.9.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"

				  echo "Installing CUDA 12.9.1 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"

				  # install CUDA 12.9.1 in the same container

				  install_cuda 12.9.1 cuda_12.9.1_575.57.08_linux

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  install_cudnn 12 $CUDNN_VERSION

				  install_nvshmem 12 $NVSHMEM_VERSION

				  CUDA_VERSION=12.9 bash install_nccl.sh

				  CUDA_VERSION=12.9 bash install_cusparselt.sh

				@ -70,49 +128,17 @@ function install_129 {

				  ldconfig

				}

				function prune_126 {

				  echo "Pruning CUDA 12.6"

				  #####################################################################################

				  # CUDA 12.6 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then

				      export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.6 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.6/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/

				}

				function install_128 {

				  CUDNN_VERSION=9.8.0.87

				  echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"

				  echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"

				  # install CUDA 12.8.1 in the same container

				  install_cuda 12.8.1 cuda_12.8.1_570.124.06_linux

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  install_cudnn 12 $CUDNN_VERSION

				  install_nvshmem 12 $NVSHMEM_VERSION

				  CUDA_VERSION=12.8 bash install_nccl.sh

				  CUDA_VERSION=12.8 bash install_cusparselt.sh

				@ -120,16 +146,39 @@ function install_128 {

				  ldconfig

				}

				function install_130 {

				  CUDNN_VERSION=9.12.0.46

				  NVSHMEM_VERSION=3.3.20

				  echo "Installing CUDA 13.0 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"

				  # install CUDA 13.0 in the same container

				  install_cuda 13.0.0 cuda_13.0.0_580.65.06_linux

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  install_cudnn 13 $CUDNN_VERSION

				  install_nvshmem 13 $NVSHMEM_VERSION

				  CUDA_VERSION=13.0 bash install_nccl.sh

				  CUDA_VERSION=13.0 bash install_cusparselt.sh

				  ldconfig

				}

				# idiomatic parameter and option handling in sh

				while test $# -gt 0

				do

				    case "$1" in

				    12.6|12.6.*) install_126; prune_126

				    12.4) install_124;

				        ;;

				    12.6|12.6.*) install_126;

				        ;;

				    12.8|12.8.*) install_128;

				        ;;

				    12.9|12.9.*) install_129;

				        ;;

				    13.0|13.0.*) install_130;

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

				    esac

									
										24

.ci/docker/common/install_cudnn.sh
									
												View File
											
				@ -1,24 +0,0 @@

				#!/bin/bash

				if [[ -n "${CUDNN_VERSION}" ]]; then

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn

				    pushd tmp_cudnn

				    if [[ ${CUDA_VERSION:0:4} == "12.9" || ${CUDA_VERSION:0:4} == "12.8" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"

				    else

				        print "Unsupported CUDA version ${CUDA_VERSION}"

				        exit 1

				    fi

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz

				    tar xf ${CUDNN_NAME}.tar.xz

				    cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/

				    cp -a ${CUDNN_NAME}/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cudnn

				    ldconfig

				fi

									
										18

.ci/docker/common/install_cusparselt.sh
									
												View File
												
				@ -5,7 +5,15 @@ set -ex

				# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				mkdir tmp_cusparselt && cd tmp_cusparselt

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-9]$ ]]; then

				if [[ ${CUDA_VERSION:0:4} =~ "13" ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.8.0.4_cuda13-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-9]$ ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				@ -13,6 +21,14 @@ if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-9]$ ]]; then

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.7.1.0-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				else

				    echo "Not sure which libcusparselt version to install for this ${CUDA_VERSION}"

				fi

									
										31

.ci/docker/common/install_inductor_benchmark_deps.sh
									
												View File
												
				@ -5,9 +5,7 @@ set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				function install_huggingface() {

				  local version

				  commit=$(get_pinned_commit huggingface)

				  pip_install "git+https://github.com/huggingface/transformers@${commit}"

				  pip_install -r huggingface-requirements.txt

				}

				function install_timm() {

				@ -15,11 +13,34 @@ function install_timm() {

				  commit=$(get_pinned_commit timm)

				  pip_install "git+https://github.com/huggingface/pytorch-image-models@${commit}"

				  # Clean up

				  conda_run pip uninstall -y torch torchvision triton

				}

				function install_torchbench() {

				  local commit

				  commit=$(get_pinned_commit torchbench)

				  git clone https://github.com/pytorch/benchmark torchbench

				  pushd torchbench

				  git checkout "$commit"

				  python install.py --continue_on_fail

				  echo "Print all dependencies after TorchBench is installed"

				  python -mpip freeze

				  popd

				  chown -R jenkins torchbench

				  chown -R jenkins /opt/conda

				}

				# Pango is needed for weasyprint which is needed for doctr

				conda_install pango

				# Stable packages are ok here, just to satisfy TorchBench check

				pip_install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

				install_torchbench

				install_huggingface

				install_timm

				# Clean up

				conda_run pip uninstall -y torch torchvision torchaudio triton torchao

									
										2

.ci/docker/common/install_nccl.sh
									
												View File
												
				@ -7,6 +7,8 @@ if [[ ${CUDA_VERSION:0:2} == "11" ]]; then

				  NCCL_VERSION=$(cat ci_commit_pins/nccl-cu11.txt)

				elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then

				  NCCL_VERSION=$(cat ci_commit_pins/nccl-cu12.txt)

				elif [[ ${CUDA_VERSION:0:2} == "13" ]]; then

				  NCCL_VERSION=$(cat ci_commit_pins/nccl-cu13.txt)

				else

				  echo "Unexpected CUDA_VERSION ${CUDA_VERSION}"

				  exit 1

									
										4

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -19,8 +19,8 @@ pip_install \

				  transformers==4.36.2

				pip_install coloredlogs packaging

				pip_install onnxruntime==1.18.1

				pip_install onnxscript==0.3.0

				pip_install onnxruntime==1.22.1

				pip_install onnxscript==0.3.1

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

									
										6

.ci/docker/common/install_openblas.sh
									
												View File
												
				@ -4,8 +4,9 @@

				set -ex

				cd /

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION:-v0.3.29}" --depth 1 --shallow-submodules

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION:-v0.3.30}" --depth 1 --shallow-submodules

				OPENBLAS_CHECKOUT_DIR="OpenBLAS"

				OPENBLAS_BUILD_FLAGS="

				NUM_THREADS=128

				USE_OPENMP=1

				@ -13,9 +14,8 @@ NO_SHARED=0

				DYNAMIC_ARCH=1

				TARGET=ARMV8

				CFLAGS=-O3

				BUILD_BFLOAT16=1

				"

				OPENBLAS_CHECKOUT_DIR="OpenBLAS"

				make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}

				make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR}

									
										57

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -8,9 +8,11 @@ ver() {

				install_ubuntu() {

				    apt-get update

				    if [[ $UBUNTU_VERSION == 20.04 ]]; then

				      # gpg-agent is not available by default on 20.04

				      apt-get install -y --no-install-recommends gpg-agent

				    # gpg-agent is not available by default

				    apt-get install -y --no-install-recommends gpg-agent

				    if [[ $(ver $UBUNTU_VERSION) -ge $(ver 22.04) ]]; then

				        echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \

				            | sudo tee /etc/apt/preferences.d/rocm-pin-600

				    fi

				    apt-get install -y kmod

				    apt-get install -y wget

				@ -28,16 +30,25 @@ EOF

				    # we want the patch version of 6.4 instead

				    if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then

				        ROCM_VERSION="${ROCM_VERSION}.1"

				        ROCM_VERSION="${ROCM_VERSION}.2"

				    fi

				    # Default url values

				    rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"

				    amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu"

				    # Special case for ROCM_VERSION == 7.0

				    if [[ $(ver "$ROCM_VERSION") -eq $(ver 7.0) ]]; then

				        rocm_baseurl="https://repo.radeon.com/rocm/apt/7.0_alpha2"

				        amdgpu_baseurl="https://repo.radeon.com/amdgpu/30.10_alpha2/ubuntu"

				    fi

				    # Add amdgpu repository

				    UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`

				    echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list

				    echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list

				    # Add rocm repository

				    wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -

				    local rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"

				    echo "deb [arch=amd64] ${rocm_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/rocm.list

				    apt-get update --allow-insecure-repositories

				@ -71,29 +82,33 @@ EOF

				    done

				    # ROCm 6.3 had a regression where initializing static code objects had significant overhead

				    # CI no longer builds for ROCm 6.3, but

				    # ROCm 6.4 did not yet fix the regression, also HIP branch names are different

				    if [[ $(ver $ROCM_VERSION) -ge $(ver 6.3) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then

				        if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then

				            HIP_BRANCH=release/rocm-rel-6.4

				            VER_STR=6.4

				            VER_PATCH=.1

				    if [[ $(ver $ROCM_VERSION) -ge $(ver 6.4) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then

				        if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.2) ]]; then

				            HIP_TAG=rocm-6.4.2

				            CLR_HASH=74d78ba3ac4bac235d02bcb48511c30b5cfdd457  # branch release/rocm-rel-6.4.2-statco-hotfix

				        elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then

				            HIP_TAG=rocm-6.4.1

				            CLR_HASH=efe6c35790b9206923bfeed1209902feff37f386  # branch release/rocm-rel-6.4.1-statco-hotfix

				        elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then

				            HIP_BRANCH=release/rocm-rel-6.4

				            VER_STR=6.4

				        elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then

				            HIP_BRANCH=rocm-6.3.x

				            VER_STR=6.3

				            HIP_TAG=rocm-6.4.0

				            CLR_HASH=600f5b0d2baed94d5121e2174a9de0851b040b0c  # branch release/rocm-rel-6.4-statco-hotfix

				        fi

				        # clr build needs CppHeaderParser but can only find it using conda's python

				        /opt/conda/bin/python -m pip install CppHeaderParser

				        git clone https://github.com/ROCm/HIP -b $HIP_BRANCH

				        python -m pip install CppHeaderParser

				        git clone https://github.com/ROCm/HIP -b $HIP_TAG

				        HIP_COMMON_DIR=$(readlink -f HIP)

				        git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}${VER_PATCH}-statco-hotfix

				        git clone https://github.com/jeffdaily/clr

				        pushd clr

				        git checkout $CLR_HASH

				        popd

				        mkdir -p clr/build

				        pushd clr/build

				        cmake .. -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR

				        # Need to point CMake to the correct python installation to find CppHeaderParser

				        cmake .. -DPython3_EXECUTABLE=/opt/conda/envs/py_${ANACONDA_PYTHON_VERSION}/bin/python3 -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR

				        make -j

				        cp hipamd/lib/libamdhip64.so.${VER_STR}.* /opt/rocm/lib/libamdhip64.so.${VER_STR}.*

				        cp hipamd/lib/libamdhip64.so.6.4.* /opt/rocm/lib/libamdhip64.so.6.4.*

				        popd

				        rm -rf HIP clr

				    fi

									
										7

.ci/docker/common/install_rocm_magma.sh
									
												View File
												
				@ -5,7 +5,12 @@ set -eou pipefail

				function do_install() {

				    rocm_version=$1

				    rocm_version_nodot=${1//./}

				    if [[ ${rocm_version} =~ ^[0-9]+\.[0-9]+\.[0-9]+$ ]]; then

				        # chop off any patch version

				        rocm_version="${rocm_version%.*}"

				    fi

				    rocm_version_nodot=${rocm_version//./}

				    # Version 2.7.2 + ROCm related updates

				    MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6

									
										6

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -98,6 +98,10 @@ fi

				if [ -n "${NUMPY_VERSION}" ]; then

				  pip_install "numpy==${NUMPY_VERSION}"

				fi

				# IMPORTANT: helion needs to be installed without dependencies.

				# It depends on torch and triton. We don't want to install

				# triton and torch from production on Docker CI images

				if [[ "$ANACONDA_PYTHON_VERSION" != 3.9* ]]; then

				  pip_install helion

				  pip_install helion --no-deps

				fi

									
										53

.ci/docker/common/install_xpu.sh
									
												View File
												
				@ -34,18 +34,27 @@ function install_ubuntu() {

				    # The xpu-smi packages

				    apt-get install -y flex bison xpu-smi

				    # Compute and Media Runtimes

				    apt-get install -y \

				        intel-opencl-icd intel-level-zero-gpu level-zero \

				        intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \

				        libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \

				        libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \

				        mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

				    if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then

				        apt-get install -y intel-ocloc

				    if [[ "${XPU_DRIVER_TYPE,,}" == "lts" ]]; then

				        # Compute and Media Runtimes

				        apt-get install -y \

				            intel-opencl-icd intel-level-zero-gpu level-zero \

				            intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \

				            libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \

				            libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \

				            mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

				        # Development Packages

				        apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev

				    else # rolling driver

				        apt-get install -y \

				            intel-opencl-icd libze-intel-gpu1 libze1 \

				            intel-media-va-driver-non-free libmfx-gen1 libvpl2 \

				            libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \

				            libglapi-mesa libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \

				            mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo intel-ocloc

				        apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev libze-dev

				    fi

				    # Development Packages

				    apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev

				    # Install Intel Support Packages

				    apt-get install -y ${XPU_PACKAGES}

				@ -56,14 +65,10 @@ function install_ubuntu() {

				function install_rhel() {

				    . /etc/os-release

				    if [[ "${ID}" == "rhel" ]]; then

				        if [[ ! " 8.8 8.9 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then

				            echo "RHEL version ${VERSION_ID} not supported"

				            exit

				        fi

				    elif [[ "${ID}" == "almalinux" ]]; then

				        # Workaround for almalinux8 which used by quay.io/pypa/manylinux_2_28_x86_64

				        VERSION_ID="8.8"

				    if [[ ! " 8.8 8.10 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then

				        echo "RHEL version ${VERSION_ID} not supported"

				        exit

				    fi

				    dnf install -y 'dnf-command(config-manager)'

				@ -134,11 +139,11 @@ function install_sles() {

				}

				# Default use GPU driver LTS releases

				XPU_DRIVER_VERSION="/lts/2350"

				if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then

				    # Use GPU driver rolling releases

				    XPU_DRIVER_VERSION=""

				# Default use GPU driver rolling releases

				XPU_DRIVER_VERSION=""

				if [[ "${XPU_DRIVER_TYPE,,}" == "lts" ]]; then

				    # Use GPU driver LTS releases

				    XPU_DRIVER_VERSION="/lts/2350"

				fi

				# Default use Intel® oneAPI Deep Learning Essentials 2025.0

									
										4

.ci/docker/libtorch/build.sh
									
												View File
												
				@ -39,6 +39,10 @@ case ${DOCKER_TAG_PREFIX} in

				        DOCKER_GPU_BUILD_ARG=""

				        ;;

				    rocm*)

				        # we want the patch version of 6.4 instead

				        if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then

				            GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"

				        fi

				        BASE_TARGET=rocm

				        GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

									
										2

.ci/docker/linter/Dockerfile
									
												View File
												
				@ -27,5 +27,7 @@ COPY ./common/install_linter.sh install_linter.sh

				RUN bash ./install_linter.sh

				RUN rm install_linter.sh

				RUN chown -R jenkins:jenkins /var/lib/jenkins/ci_env

				USER jenkins

				CMD ["bash"]

2

.ci/docker/manywheel/Dockerfile_s390x

View File

 @ -131,6 +131,8 @@ RUN pip3 install flatbuffers && \
   git clone https://github.com/microsoft/onnxruntime && \
   cd onnxruntime && git checkout v1.21.0 && \
   git submodule update --init --recursive && \
   wget https://github.com/microsoft/onnxruntime/commit/f57db79743c4d1a3553aa05cf95bcd10966030e6.patch && \
   patch -p1 < f57db79743c4d1a3553aa05cf95bcd10966030e6.patch && \
   ./build.sh --config Release --parallel 0 --enable_pybind \
   --build_wheel --enable_training --enable_training_apis \
   --enable_training_ops --skip_tests --allow_running_as_root \

									
										6

.ci/docker/manywheel/build.sh
									
												View File
												
				@ -41,7 +41,7 @@ case ${image} in

				        GPU_IMAGE=arm64v8/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13 --build-arg NINJA_VERSION=1.12.1"

				        MANY_LINUX_VERSION="2_28_aarch64"

				        OPENBLAS_VERSION="v0.3.29"

				        OPENBLAS_VERSION="v0.3.30"

				        ;;

				    manylinuxcxx11-abi-builder:cpu-cxx11-abi)

				        TARGET=final

				@ -75,6 +75,10 @@ case ${image} in

				        DOCKERFILE_SUFFIX="_cuda_aarch64"

				        ;;

				    manylinux2_28-builder:rocm*)

				        # we want the patch version of 6.4 instead

				        if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then

				            GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"

				        fi

				        TARGET=rocm_final

				        MANY_LINUX_VERSION="2_28"

				        DEVTOOLSET_VERSION="11"

40

.ci/docker/requirements-ci.txt

View File

 @ -16,6 +16,7 @@ click
 #test that import:
 coremltools==5.0b5 ; python_version < "3.12"
 coremltools==8.3 ; python_version == "3.12"
 #Description: Apple framework for ML integration
 #Pinned versions: 5.0b5
 #test that import:
 @ -49,7 +50,7 @@ flatbuffers==24.12.23
 hypothesis==5.35.1
 # Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
 #Description: advanced library for generating parametrized tests
 #Pinned versions: 3.44.6, 4.53.2
 #Pinned versions: 5.35.1
 #test that import: test_xnnpack_integration.py, test_pruning_op.py, test_nn.py
 junitparser==2.1.1
 @ -62,10 +63,12 @@ lark==0.12.0
 #Pinned versions: 0.12.0
 #test that import:
 librosa>=0.6.2 ; python_version < "3.11"
 librosa>=0.6.2 ; python_version < "3.11" and platform_machine != "s390x"
 librosa==0.10.2 ; python_version == "3.12" and platform_machine != "s390x"
 #Description: A python package for music and audio analysis
 #Pinned versions: >=0.6.2
 #test that import: test_spectral_ops.py
 #librosa depends on numba; disable it for s390x while numba is disabled too
 #mkl #this breaks linux-bionic-rocm4.5-py3.7
 #Description: Intel oneAPI Math Kernel Library
 @ -108,13 +111,15 @@ ninja==1.11.1.3
 #Pinned versions: 1.11.1.3
 #test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
 numba==0.49.0 ; python_version < "3.9"
 numba==0.55.2 ; python_version == "3.9"
 numba==0.55.2 ; python_version == "3.10"
 numba==0.49.0 ; python_version < "3.9" and platform_machine != "s390x"
 numba==0.55.2 ; python_version == "3.9" and platform_machine != "s390x"
 numba==0.55.2 ; python_version == "3.10" and platform_machine != "s390x"
 numba==0.60.0 ; python_version == "3.12" and platform_machine != "s390x"
 #Description: Just-In-Time Compiler for Numerical Functions
 #Pinned versions: 0.54.1, 0.49.0, <=0.49.1
 #test that import: test_numba_integration.py
 #For numba issue see https://github.com/pytorch/pytorch/issues/51511
 #Need release > 0.61.2 for s390x due to https://github.com/numba/numba/pull/10073
 #numpy
 #Description: Provides N-dimensional arrays and linear algebra
 @ -218,9 +223,9 @@ pygments==2.15.0
 #Pinned versions: 2.12.0
 #test that import: the doctests
 #PyYAML
 #pyyaml
 #Description: data serialization format
 #Pinned versions:
 #Pinned versions: 6.0.2
 #test that import:
 #requests
 @ -230,7 +235,7 @@ pygments==2.15.0
 #rich
 #Description: rich text and beautiful formatting in the terminal
 #Pinned versions: 10.9.0
 #Pinned versions: 14.1.0
 #test that import:
 scikit-image==0.19.3 ; python_version < "3.10"
 @ -304,7 +309,7 @@ pytest-cpp==2.3.0
 #Pinned versions: 2.3.0
 #test that import:
 z3-solver==4.12.6.0
 z3-solver==4.15.1.0 ; platform_machine != "s390x"
 #Description: The Z3 Theorem Prover Project
 #Pinned versions:
 #test that import:
 @ -339,7 +344,7 @@ onnx==1.18.0
 #Pinned versions:
 #test that import:
 onnxscript==0.2.6
 onnxscript==0.3.1
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 @ -358,12 +363,11 @@ pwlf==2.2.1
 #Pinned versions: 2.2.1
 #test that import: test_sac_estimator.py
 # To build PyTorch itself
 astunparse
 PyYAML
 pyyaml
 pyzstd
 setuptools
 setuptools>=70.1.0
 six
 scons==4.5.2 ; platform_machine == "aarch64"
 @ -383,6 +387,12 @@ cmake==4.0.0
 tlparse==0.3.30
 #Description: required for log parsing
 cuda-bindings>=12.0,<13.0
 cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x"
 #Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
 #test that import: test_cuda.py
 setuptools-git-versioning==2.1.0
 scikit-build==0.18.1
 pyre-extensions==0.0.32
 tabulate==0.9.0
 #Description: These package are needed to build FBGEMM and torchrec on PyTorch CI

15

.ci/docker/requirements-docs.txt

View File

 @ -1,11 +1,11 @@
 sphinx==5.3.0
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 5.3.0
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@pytorch_sphinx_theme2#egg=pytorch_sphinx_theme2
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@722b7e6f9ca512fcc526ad07d62b3d28c50bb6cd#egg=pytorch_sphinx_theme2
 # TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
 # but it doesn't seem to work and hangs around idly. The initial thought is probably
 # something related to Docker setup. We can investigate this later
 # but it doesn't seem to work and hangs around idly. The initial thought that it is probably
 # something related to Docker setup. We can investigate this later.
 sphinxcontrib.katex==0.8.6
 #Description: This is used to generate PyTorch docs
 @ -19,9 +19,10 @@ sphinx_sitemap==2.6.0
 #Description: This is used to generate sitemap for PyTorch docs
 #Pinned versions: 2.6.0
 matplotlib==3.5.3
 matplotlib==3.5.3 ; python_version < "3.13"
 matplotlib==3.6.3 ; python_version >= "3.13"
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 3.5.3
 #Pinned versions: 3.6.3 if python > 3.12. Otherwise 3.5.3.
 tensorboard==2.13.0 ; python_version < "3.13"
 tensorboard==2.18.0 ; python_version >= "3.13"
 @ -49,8 +50,8 @@ IPython==8.12.0
 #Pinned versions: 8.12.0
 myst-nb==0.17.2
 #Description: This is used to generate PyTorch functorch docs
 #Pinned versions: 0.13.2
 #Description: This is used to generate PyTorch functorch and torch.compile docs.
 #Pinned versions: 0.17.2
 # The following are required to build torch.distributed.elastic.rendezvous.etcd* docs
 python-etcd==0.4.5

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .3.1
 .4.0

									
										155

.ci/docker/ubuntu-cross-riscv/Dockerfile
									
										Normal file
									
												View File
												
				@ -0,0 +1,155 @@

				# Cross-compilation Docker container for RISC-V architecture

				ARG UBUNTU_VERSION

				FROM --platform=linux/amd64 ubuntu:${UBUNTU_VERSION} as base

				ARG UBUNTU_VERSION

				ENV GCC_VERSION=14

				ENV PYTHON_VERSION=3.12.3

				ENV DEBIAN_FRONTEND=noninteractive

				ENV CC=riscv64-linux-gnu-gcc-${GCC_VERSION}

				ENV CXX=riscv64-linux-gnu-g++-${GCC_VERSION}

				ENV QEMU_LD_PREFIX=/usr/riscv64-linux-gnu/

				ENV SYSROOT=/opt/sysroot

				# Install basic dependencies

				RUN apt-get update && apt-get install -y \

				    ninja-build \

				    autoconf \

				    automake \

				    libtool \

				    patchelf \

				    ccache \

				    git \

				    wget \

				    python3-pip \

				    python3-venv \

				    python-is-python3 \

				    cmake \

				    sudo \

				    lsb-release \

				    gcc-${GCC_VERSION}-riscv64-linux-gnu \

				    g++-${GCC_VERSION}-riscv64-linux-gnu \

				    pkg-config \

				    && rm -rf /var/lib/apt/lists/*

				# Install user

				COPY ./common/install_user.sh install_user.sh

				RUN bash ./install_user.sh && rm install_user.sh

				FROM base as python

				ARG ZLIB_VERSION=1.3.1

				ARG FFI_VERSION=3.4.6

				ARG BZ2_VERSION=1.0.8

				ARG XZ_VERSION=5.4.6

				ARG OPENSSL_VERSION=3.2.1

				# Set up sysroot directory for dependencies

				ENV PKG_CONFIG_PATH=${SYSROOT}/lib/pkgconfig

				ENV PKG_CONFIG_SYSROOT_DIR=${SYSROOT}

				WORKDIR /opt

				# Build zlib (for compression)

				RUN echo "--- Building zlib ---" \

				    && wget -c https://www.zlib.net/zlib-${ZLIB_VERSION}.tar.gz \

				    && tar -xf zlib-${ZLIB_VERSION}.tar.gz --no-same-permissions --no-same-owner \

				    && cd zlib-${ZLIB_VERSION}/ \

				    && mkdir build && cd build \

				    && ../configure --prefix=${SYSROOT} \

				    && make -j$(nproc) && make install \

				    && cd ../..

				# Build libffi (for ctypes module)

				RUN echo "--- Building libffi ---" \

				    && wget -c https://github.com/libffi/libffi/releases/download/v${FFI_VERSION}/libffi-${FFI_VERSION}.tar.gz \

				    && tar -xf libffi-${FFI_VERSION}.tar.gz --no-same-permissions --no-same-owner \

				    && cd libffi-${FFI_VERSION}/ \

				    && mkdir build && cd build \

				    && ../configure --prefix=${SYSROOT} --host=riscv64-linux-gnu --build=x86_64-linux-gnu \

				    && make -j$(nproc) && make install \

				    && cd ../..

				# Build bzip2 (for bz2 module)

				RUN echo "--- Building bzip2 ---" \

				    && wget -c https://sourceware.org/pub/bzip2/bzip2-${BZ2_VERSION}.tar.gz \

				    && tar -xf bzip2-${BZ2_VERSION}.tar.gz --no-same-permissions --no-same-owner \

				    && cd bzip2-${BZ2_VERSION}/ \

				    && make CC=riscv64-linux-gnu-gcc-${GCC_VERSION} bzip2 bzip2recover libbz2.a \

				    && make CC=riscv64-linux-gnu-gcc-${GCC_VERSION} -f Makefile-libbz2_so \

				    && make install PREFIX=${SYSROOT} \

				    && cp libbz2.so.${BZ2_VERSION} ${SYSROOT}/lib/ \

				    && cd ${SYSROOT}/lib/ \

				    && ln -sf libbz2.so.${BZ2_VERSION} libbz2.so.1.0 \

				    && ln -sf libbz2.so.1.0 libbz2.so \

				    && cd /opt/

				# Build xz (for lzma module)

				RUN echo "--- Building xz ---" \

				    && wget -c https://github.com/tukaani-project/xz/releases/download/v${XZ_VERSION}/xz-${XZ_VERSION}.tar.gz \

				    && tar -xf xz-${XZ_VERSION}.tar.gz --no-same-permissions --no-same-owner \

				    && cd xz-${XZ_VERSION} \

				    && mkdir build && cd build \

				    && ../configure --prefix=${SYSROOT} --host=riscv64-linux-gnu --build=x86_64-linux-gnu \

				    && make -j$(nproc) && make install \

				    && cd ../..

				# Build OpenSSL (for ssl module)

				RUN echo "--- Building OpenSSL ---" \

				    && wget -c https://www.openssl.org/source/openssl-${OPENSSL_VERSION}.tar.gz \

				    && tar -xf openssl-${OPENSSL_VERSION}.tar.gz --no-same-permissions --no-same-owner \

				    && cd openssl-${OPENSSL_VERSION}/ \

				    && mkdir build && cd build \

				    && ../Configure linux64-riscv64 --prefix=${SYSROOT} \

				    && make -j$(nproc) && make install_sw \

				    && cd ../..

				# Build SQLite3 (for sqlite3 module)

				RUN echo "--- Building SQLite3 ---" \

				    && wget -c https://www.sqlite.org/2024/sqlite-autoconf-3450200.tar.gz \

				    && tar -xf sqlite-autoconf-3450200.tar.gz --no-same-permissions --no-same-owner \

				    && cd sqlite-autoconf-3450200 \

				    && mkdir build && cd build \

				    && ../configure --prefix=${SYSROOT} --host=riscv64-linux-gnu --build=x86_64-linux-gnu \

				    && make -j$(nproc) && make install \

				    && cd ../..

				# Build and install RISC-V Python with all modules

				RUN wget -c https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tgz \

				    && tar -xf Python-${PYTHON_VERSION}.tgz --no-same-permissions --no-same-owner \

				    && cd Python-${PYTHON_VERSION} \

				    && mkdir build && cd build \

				    && ../configure \

				        --host=riscv64-linux-gnu \

				        --build=x86_64-linux-gnu \

				        --prefix=${SYSROOT} \

				        --enable-shared \

				        --disable-ipv6 \

				        --with-build-python=/usr/bin/python3 \

				        --with-ensurepip=no \

				        ac_cv_file__dev_ptmx=yes \

				        ac_cv_file__dev_ptc=no \

				    && make -j$(nproc) \

				    && make install

				FROM base as final

				COPY --from=python             /opt/sysroot                       /opt/sysroot

				# Install crossenv and cmake

				RUN pip install crossenv cmake==4.0.0 --break-system-packages \

				    && /usr/bin/python3 -m crossenv ${SYSROOT}/bin/python3 /opt/riscv-cross-env

				# Add pip-installed cmake binaries to PATH

				ENV PATH="/usr/local/bin:${PATH}"

				# Set up cross Python environment

				SHELL ["/bin/bash", "-c"]

				RUN source /opt/riscv-cross-env/bin/activate \

				    && pip install setuptools pyyaml typing_extensions wheel

				# Set default environment variables for PyTorch build

				ENV Python_ROOT_DIR=${SYSROOT}

				ENV OPENSSL_ROOT_DIR=${SYSROOT}

				USER jenkins

				CMD ["bash"]

									
										5

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -96,10 +96,11 @@ ARG ANACONDA_PYTHON_VERSION

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

				COPY ci_commit_pins/huggingface-requirements.txt huggingface-requirements.txt

				COPY ci_commit_pins/timm.txt timm.txt

				COPY ci_commit_pins/torchbench.txt torchbench.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface-requirements.txt torchbench.txt

				# (optional) Install non-default Ninja version

				ARG NINJA_VERSION

									
										4

.ci/docker/ubuntu-xpu/Dockerfile
									
												View File
												
				@ -56,10 +56,10 @@ RUN rm install_openssl.sh

				ARG INDUCTOR_BENCHMARKS

				COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

				COPY ci_commit_pins/huggingface-requirements.txt huggingface-requirements.txt

				COPY ci_commit_pins/timm.txt timm.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface-requirements.txt

				# Install XPU Dependencies

				ARG XPU_VERSION

									
										11

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -96,10 +96,11 @@ RUN rm install_openssl.sh

				ARG INDUCTOR_BENCHMARKS

				COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

				COPY ci_commit_pins/huggingface-requirements.txt huggingface-requirements.txt

				COPY ci_commit_pins/timm.txt timm.txt

				COPY ci_commit_pins/torchbench.txt torchbench.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface-requirements.txt torchbench.txt

				ARG TRITON

				ARG TRITON_CPU

				@ -147,6 +148,12 @@ RUN if [ -n "${ACL}" ]; then bash ./install_acl.sh; fi

				RUN rm install_acl.sh

				ENV INSTALLED_ACL ${ACL}

				ARG OPENBLAS

				COPY ./common/install_openblas.sh install_openblas.sh

				RUN if [ -n "${OPENBLAS}" ]; then bash ./install_openblas.sh; fi

				RUN rm install_openblas.sh

				ENV INSTALLED_OPENBLAS ${OPENBLAS}

				# Install ccache/sccache (do this last, so we get priority in PATH)

				ARG SKIP_SCCACHE_INSTALL

				COPY ./common/install_cache.sh install_cache.sh

									
										31

.ci/lumen_cli/README.md
									
										Normal file
									
												View File
												
				@ -0,0 +1,31 @@

				# 🔧 Lumen_cli

				A Python CLI tool for building and testing PyTorch-based components, using a YAML configuration file for structured, repeatable workflows.

				## Features

				- **Build**

				    - external projects (e.g. vLLM)

				## 📦 Installation

				at the root of the pytorch repo

				```bash

				pip install -e .ci/lumen_cli

				```

				## Run the cli tool

				The cli tool must be used at root of pytorch repo, as example to run build external vllm:

				```bash

				python -m cli.run build external vllm

				```

				this will run the build steps with default behaviour for vllm project.

				to see help messages, run

				```bash

				python3 -m cli.run --help

				```

				## Add customized external build logics

				To add a new external build, for instance, add a new external build logics:

				1. create the build function in cli/lib folder

				2. register your target and the main build function at  EXTERNAL_BUILD_TARGET_DISPATCH in `cli/build_cli/register_build.py`

				3. [optional] create your ci config file in .github/ci_configs/${EXTERNAL_PACKAGE_NAME}.yaml

0

test/dynamo_expected_failures/CPython313-test_complex-ComplexTest.test_truediv_zero_division → .ci/lumen_cli/cli/build_cli/init.py

View File

									
										37

.ci/lumen_cli/cli/build_cli/register_build.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,37 @@

				import argparse

				import logging

				from cli.lib.common.cli_helper import register_targets, RichHelp, TargetSpec

				from cli.lib.core.vllm import VllmBuildRunner

				logger = logging.getLogger(__name__)

				# Maps targets to their argparse configuration and runner

				# it adds new target to path python -m cli.run build external {target} with buildrunner

				_TARGETS: dict[str, TargetSpec] = {

				    "vllm": {

				        "runner": VllmBuildRunner,

				        "help": "Build vLLM using docker buildx.",

				    }

				    # add yours ...

				}

				def register_build_commands(subparsers: argparse._SubParsersAction) -> None:

				    build_parser = subparsers.add_parser(

				        "build",

				        help="Build related commands",

				        formatter_class=RichHelp,

				    )

				    build_subparsers = build_parser.add_subparsers(dest="build_command", required=True)

				    overview = "\n".join(

				        f"  {name:12} {spec.get('help', '')}" for name, spec in _TARGETS.items()

				    )

				    external_parser = build_subparsers.add_parser(

				        "external",

				        help="Build external targets",

				        description="Build third-party targets.\n\nAvailable targets:\n" + overview,

				        formatter_class=RichHelp,

				    )

				    register_targets(external_parser, _TARGETS)

0

test/dynamo_expected_failures/CPython313-test_contextlib-ClosingTestCase.test_closing → .ci/lumen_cli/cli/lib/init.py

View File

									
										71

.ci/lumen_cli/cli/lib/common/cli_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,71 @@

				"""

				Cli Argparser Utility helpers for CLI tasks.

				"""

				import argparse

				from abc import ABC, abstractmethod

				try:

				    from typing import Any, Callable, Required, TypedDict  # Python 3.11+

				except ImportError:

				    from typing import Any, Callable, TypedDict

				    from typing_extensions import Required  # Fallback for Python <3.11

				class BaseRunner(ABC):

				    def __init__(self, args: Any) -> None:

				        self.args = args

				    @abstractmethod

				    def run(self) -> None:

				        """runs main logics, required"""

				# Pretty help: keep newlines + show defaults

				class RichHelp(

				    argparse.ArgumentDefaultsHelpFormatter, argparse.RawDescriptionHelpFormatter

				):

				    pass

				class TargetSpec(TypedDict, total=False):

				    """CLI subcommand specification with bA."""

				    runner: Required[type[BaseRunner]]

				    help: str

				    description: str

				    add_arguments: Callable[[argparse.ArgumentParser], None]

				def register_targets(

				    parser: argparse.ArgumentParser,

				    target_specs: dict[str, TargetSpec],

				    common_args: Callable[[argparse.ArgumentParser], None] = lambda _: None,

				) -> None:

				    """Register target subcommands."""

				    targets = parser.add_subparsers(

				        dest="target",

				        required=True,

				        metavar="{" + ",".join(target_specs.keys()) + "}",

				    )

				    for name, spec in target_specs.items():

				        desc = spec.get("description") or spec["runner"].__doc__ or ""

				        p = targets.add_parser(

				            name,

				            help=spec.get("help", ""),

				            description=desc.strip(),

				            formatter_class=RichHelp,

				        )

				        p.set_defaults(

				            func=lambda args, cls=spec["runner"]: cls(args).run(),

				            _runner_class=spec["runner"],

				        )

				        if "add_arguments" in spec and callable(spec["add_arguments"]):

				            spec["add_arguments"](p)

				        if common_args:

				            common_args(p)

									
										42

.ci/lumen_cli/cli/lib/common/docker_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,42 @@

				"""

				Docker Utility helpers for CLI tasks.

				"""

				import logging

				from typing import Optional

				import docker

				from docker.errors import APIError, NotFound

				logger = logging.getLogger(__name__)

				# lazy singleton so we don't reconnect every call

				_docker_client: Optional[docker.DockerClient] = None

				def _get_client() -> docker.DockerClient:

				    global _docker_client

				    if _docker_client is None:

				        _docker_client = docker.from_env()

				    return _docker_client

				def local_image_exists(

				    image_name: str, client: Optional[docker.DockerClient] = None

				) -> bool:

				    """Return True if a local Docker image exists."""

				    if not image_name:

				        return False

				    client = client or _get_client()

				    try:

				        client.images.get(image_name)

				        return True

				    except (NotFound, APIError) as e:

				        logger.error(

				            "Error when checking Docker image '%s': %s",

				            image_name,

				            e.explanation if hasattr(e, "explanation") else str(e),

				        )

				        return False

									
										110

.ci/lumen_cli/cli/lib/common/envs_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,110 @@

				"""

				Environment Variables and Dataclasses Utility helpers for CLI tasks.

				"""

				import os

				from dataclasses import field, fields, is_dataclass, MISSING

				from pathlib import Path

				from textwrap import indent

				from typing import Optional, Union

				from cli.lib.common.utils import str2bool

				def get_env(name: str, default: str = "") -> str:

				    """Get environment variable with default fallback."""

				    return os.environ.get(name) or default

				def env_path_optional(

				    name: str,

				    default: Optional[Union[str, Path]] = None,

				    resolve: bool = True,

				) -> Optional[Path]:

				    """Get environment variable as optional Path."""

				    val = get_env(name) or default

				    if not val:

				        return None

				    path = Path(val)

				    return path.resolve() if resolve else path

				def env_path(

				    name: str,

				    default: Optional[Union[str, Path]] = None,

				    resolve: bool = True,

				) -> Path:

				    """Get environment variable as Path, raise if missing."""

				    path = env_path_optional(name, default, resolve)

				    if not path:

				        raise ValueError(f"Missing path value for {name}")

				    return path

				def env_bool(

				    name: str,

				    default: bool = False,

				) -> bool:

				    val = get_env(name)

				    if not val:

				        return default

				    return str2bool(val)

				def env_bool_field(

				    name: str,

				    default: bool = False,

				):

				    return field(default_factory=lambda: env_bool(name, default))

				def env_path_field(

				    name: str,

				    default: Union[str, Path] = "",

				    *,

				    resolve: bool = True,

				) -> Path:

				    return field(default_factory=lambda: env_path(name, default, resolve=resolve))

				def env_str_field(

				    name: str,

				    default: str = "",

				) -> str:

				    return field(default_factory=lambda: get_env(name, default))

				def generate_dataclass_help(cls) -> str:

				    """Auto-generate help text for dataclass fields."""

				    if not is_dataclass(cls):

				        raise TypeError(f"{cls} is not a dataclass")

				    def get_value(f):

				        if f.default is not MISSING:

				            return f.default

				        if f.default_factory is not MISSING:

				            try:

				                return f.default_factory()

				            except Exception as e:

				                return f"<error: {e}>"

				        return "<required>"

				    lines = [f"{f.name:<22} = {repr(get_value(f))}" for f in fields(cls)]

				    return indent("\n".join(lines), "    ")

				def with_params_help(params_cls: type, title: str = "Parameter defaults"):

				    """

				    Class decorator that appends a help table generated from another dataclass

				    (e.g., VllmParameters) to the decorated class's docstring.

				    """

				    if not is_dataclass(params_cls):

				        raise TypeError(f"{params_cls} must be a dataclass")

				    def _decorator(cls: type) -> type:

				        block = generate_dataclass_help(params_cls)

				        cls.__doc__ = (cls.__doc__ or "") + f"\n\n{title}:\n{block}"

				        return cls

				    return _decorator

									
										69

.ci/lumen_cli/cli/lib/common/git_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,69 @@

				"""

				Git Utility helpers for CLI tasks.

				"""

				import logging

				from pathlib import Path

				from cli.lib.common.path_helper import remove_dir

				from git import GitCommandError, RemoteProgress, Repo

				logger = logging.getLogger(__name__)

				class PrintProgress(RemoteProgress):

				    """Simple progress logger for git operations."""

				    def __init__(self, interval: int = 5):

				        super().__init__()

				        self._last_percent = -1

				        self._interval = interval

				    def update(self, op_code, cur, max=None, message=""):

				        msg = self._cur_line or message

				        if max and cur:

				            percent = int(cur / max * 100)

				            if percent != self._last_percent and percent % self._interval == 0:

				                self._last_percent = percent

				                logger.info("Progress: %d%% - %s", percent, msg)

				        elif msg:

				            logger.info(msg)

				def clone_external_repo(target: str, repo: str, dst: str = "", update_submodules=False):

				    """Clone repository with pinned commit and optional submodules."""

				    dst = dst or target

				    try:

				        logger.info("Cloning %s to %s", target, dst)

				        # Clone and fetch

				        remove_dir(dst)

				        r = Repo.clone_from(repo, dst, progress=PrintProgress())

				        r.git.fetch("--all", "--tags")

				        # Checkout pinned commit

				        commit = get_post_build_pinned_commit(target)

				        logger.info("Checking out pinned commit %s", commit)

				        r.git.checkout(commit)

				        # Update submodules if requested

				        if update_submodules and r.submodules:

				            logger.info("Updating %d submodule(s)", len(r.submodules))

				            for sm in r.submodules:

				                sm.update(init=True, recursive=True, progress=PrintProgress())

				        logger.info("Successfully cloned %s", target)

				        return r

				    except GitCommandError as e:

				        logger.error("Git operation failed: %s", e)

				        raise

				def get_post_build_pinned_commit(name: str, prefix=".github/ci_commit_pins") -> str:

				    path = Path(prefix) / f"{name}.txt"

				    if not path.exists():

				        raise FileNotFoundError(f"Pin file not found: {path}")

				    return path.read_text(encoding="utf-8").strip()

									
										14

.ci/lumen_cli/cli/lib/common/logger.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,14 @@

				"""

				Logger Utility helpers for CLI tasks.

				"""

				import logging

				import sys

				def setup_logging(level: int = logging.INFO):

				    logging.basicConfig(

				        level=level,

				        format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",

				        stream=sys.stdout,

				    )

									
										62

.ci/lumen_cli/cli/lib/common/path_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,62 @@

				"""Path utility helpers for CLI tasks."""

				import logging

				import shutil

				from pathlib import Path

				from typing import Union

				logger = logging.getLogger(__name__)

				def get_path(path: Union[str, Path], resolve: bool = False) -> Path:

				    """Convert to Path object, optionally resolving to absolute path."""

				    if not path:

				        raise ValueError("Path cannot be None or empty")

				    result = Path(path)

				    return result.resolve() if resolve else result

				def ensure_dir_exists(path: Union[str, Path]) -> Path:

				    """Create directory if it doesn't exist."""

				    path_obj = get_path(path)

				    path_obj.mkdir(parents=True, exist_ok=True)

				    return path_obj

				def remove_dir(path: Union[str, Path, None]) -> None:

				    """Remove directory if it exists."""

				    if not path:

				        return

				    path_obj = get_path(path)

				    if path_obj.exists():

				        shutil.rmtree(path_obj)

				def force_create_dir(path: Union[str, Path]) -> Path:

				    """Remove directory if exists, then create fresh empty directory."""

				    remove_dir(path)

				    return ensure_dir_exists(path)

				def copy(src: Union[str, Path], dst: Union[str, Path]) -> None:

				    """Copy file or directory from src to dst."""

				    src_path = get_path(src, resolve=True)

				    dst_path = get_path(dst, resolve=True)

				    if not src_path.exists():

				        raise FileNotFoundError(f"Source does not exist: {src_path}")

				    dst_path.parent.mkdir(parents=True, exist_ok=True)

				    if src_path.is_file():

				        shutil.copy2(src_path, dst_path)

				    elif src_path.is_dir():

				        shutil.copytree(src_path, dst_path, dirs_exist_ok=True)

				    else:

				        raise ValueError(f"Unsupported path type: {src_path}")

				def is_path_exist(path: Union[str, Path, None]) -> bool:

				    """Check if path exists."""

				    return bool(path and get_path(path).exists())

									
										79

.ci/lumen_cli/cli/lib/common/utils.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,79 @@

				"""

				General Utility helpers for CLI tasks.

				"""

				import logging

				import os

				import shlex

				import subprocess

				import sys

				from typing import Optional

				logger = logging.getLogger(__name__)

				def run_command(

				    cmd: str,

				    use_shell: bool = False,

				    log_cmd: bool = True,

				    cwd: Optional[str] = None,

				    env: Optional[dict] = None,

				    check: bool = True,

				) -> int:

				    """Run a command with optional shell execution."""

				    if use_shell:

				        args = cmd

				        log_prefix = "[shell]"

				        executable = "/bin/bash"

				    else:

				        args = shlex.split(cmd)

				        log_prefix = "[cmd]"

				        executable = None

				    if log_cmd:

				        display_cmd = cmd if use_shell else " ".join(args)

				        logger.info("%s %s", log_prefix, display_cmd)

				    run_env = {**os.environ, **(env or {})}

				    proc = subprocess.run(

				        args,

				        shell=use_shell,

				        executable=executable,

				        stdout=sys.stdout,

				        stderr=sys.stderr,

				        cwd=cwd,

				        env=run_env,

				        check=False,

				    )

				    if check and proc.returncode != 0:

				        logger.error(

				            "%s Command failed (exit %s): %s", log_prefix, proc.returncode, cmd

				        )

				        raise subprocess.CalledProcessError(

				            proc.returncode, args if not use_shell else cmd

				        )

				    return proc.returncode

				def str2bool(value: Optional[str]) -> bool:

				    """Convert environment variables to boolean values."""

				    if not value:

				        return False

				    if not isinstance(value, str):

				        raise ValueError(

				            f"Expected a string value for boolean conversion, got {type(value)}"

				        )

				    value = value.strip().lower()

				    true_value_set = {"1", "true", "t", "yes", "y", "on", "enable", "enabled", "found"}

				    false_value_set = {"0", "false", "f", "no", "n", "off", "disable"}

				    if value in true_value_set:

				        return True

				    if value in false_value_set:

				        return False

				    raise ValueError(f"Invalid string value for boolean conversion: {value}")

									
										263

.ci/lumen_cli/cli/lib/core/vllm.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,263 @@

				import logging

				import os

				import textwrap

				from dataclasses import dataclass

				from pathlib import Path

				from typing import Optional

				from cli.lib.common.cli_helper import BaseRunner

				from cli.lib.common.docker_helper import local_image_exists

				from cli.lib.common.envs_helper import (

				    env_bool_field,

				    env_path_field,

				    env_str_field,

				    with_params_help,

				)

				from cli.lib.common.git_helper import clone_external_repo

				from cli.lib.common.path_helper import (

				    copy,

				    ensure_dir_exists,

				    force_create_dir,

				    get_path,

				    is_path_exist,

				)

				from cli.lib.common.utils import run_command

				logger = logging.getLogger(__name__)

				# Default path for docker build artifacts

				_DEFAULT_RESULT_PATH = "./shared"

				# Temp folder in vllm work place to cp torch whls in vllm work directory for docker build

				_VLLM_TEMP_FOLDER = "tmp"

				@dataclass

				class VllmBuildParameters:

				    """

				    Parameters defining the vllm external input configurations.

				    Combine with VllmDockerBuildArgs to define the vllm build environment

				    """

				    # USE_TORCH_WHEEL: when true, use local Torch wheels; requires TORCH_WHEELS_PATH.

				    #  Otherwise docker build pull torch nightly during build

				    # TORCH_WHEELS_PATH: directory containing local torch wheels when use_torch_whl is True

				    use_torch_whl: bool = env_bool_field("USE_TORCH_WHEEL", True)

				    torch_whls_path: Path = env_path_field("TORCH_WHEELS_PATH", "./dist")

				    # USE_LOCAL_BASE_IMAGE: when true, use an existing local Docker base image; requires BASE_IMAGE

				    # Otherwise, pull dockerfile's default image remotely

				    # BASE_IMAGE: name:tag (only needed when use_local_base_image is True)

				    use_local_base_image: bool = env_bool_field("USE_LOCAL_BASE_IMAGE", True)

				    base_image: str = env_str_field("BASE_IMAGE")

				    # USE_LOCAL_DOCKERFILE: when true("1"), use a local Dockerfile; requires DOCKERFILE_PATH.

				    # otherwise, use vllm's default dockerfile.torch_nightly for build

				    # DOCKERFILE_PATH: path to Dockerfile used when use_local_dockerfile is True"

				    use_local_dockerfile: bool = env_bool_field("USE_LOCAL_DOCKERFILE", True)

				    dockerfile_path: Path = env_path_field(

				        "DOCKERFILE_PATH", ".github/ci_configs/vllm/Dockerfile.tmp_vllm"

				    )

				    # OUTPUT_DIR: where docker buildx (local exporter) will write artifacts

				    output_dir: Path = env_path_field("OUTPUT_DIR", "external/vllm")

				    # --- Build args ----------------------------------------------------------

				    target_stage: str = env_str_field("TARGET_STAGE", "export-wheels")

				    tag_name: str = env_str_field("TAG", "vllm-wheels")

				    cuda_version: str = env_str_field("CUDA_VERSION", "12.8.1")

				    python_version: str = env_str_field("PYTHON_VERSION", "3.12")

				    max_jobs: str = env_str_field("MAX_JOBS", "64")

				    sccache_bucket: str = env_str_field("SCCACHE_BUCKET")

				    sccache_region: str = env_str_field("SCCACHE_REGION")

				    torch_cuda_arch_list: str = env_str_field("TORCH_CUDA_ARCH_LIST", "8.9")

				    def __post_init__(self):

				        checks = [

				            (

				                self.use_torch_whl,  # flag

				                True,  # trigger_value

				                "torch_whls_path",  # resource

				                is_path_exist,  # check_func

				                "TORCH_WHEELS_PATH is not provided, but USE_TORCH_WHEEL is set to 1",

				            ),

				            (

				                self.use_local_base_image,

				                True,

				                "base_image",

				                local_image_exists,

				                f"BASE_IMAGE {self.base_image} does not found, but USE_LOCAL_BASE_IMAGE is set to 1",

				            ),

				            (

				                self.use_local_dockerfile,

				                True,

				                "dockerfile_path",

				                is_path_exist,

				                " DOCKERFILE_PATH path does not found, but USE_LOCAL_DOCKERFILE is set to 1",

				            ),

				        ]

				        for flag, trigger_value, attr_name, check_func, error_msg in checks:

				            value = getattr(self, attr_name)

				            if flag == trigger_value:

				                if not value or not check_func(value):

				                    raise ValueError(error_msg)

				            else:

				                logger.info("flag  %s is not set", flag)

				        if not self.output_dir:

				            raise ValueError("missing required output_dir")

				@with_params_help(VllmBuildParameters)

				class VllmBuildRunner(BaseRunner):

				    """

				    Build vLLM using docker buildx.

				    Environment variable options:

				        "USE_TORCH_WHEEL":      "1: use local wheels; 0: pull nightly from pypi",

				        "TORCH_WHEELS_PATH":    "Path to local wheels (when USE_TORCH_WHEEL=1)",

				        "USE_LOCAL_BASE_IMAGE": "1: use local base image; 0: default image",

				         "BASE_IMAGE":           "name:tag to indicate base image the dockerfile depends on (when USE_LOCAL_BASE_IMAGE=1)",

				        "USE_LOCAL_DOCKERFILE": "1: use local Dockerfile; 0: vllm repo default dockerfile.torch_nightly",

				        "DOCKERFILE_PATH":      "Path to Dockerfile (when USE_LOCAL_DOCKERFILE=1)",

				        "OUTPUT_DIR":           "e.g. './shared'",

				        "TORCH_CUDA_ARCH_LIST": "e.g. '8.0' or '8.0;9.0'",

				        "CUDA_VERSION":         "e.g. '12.8.1'",

				        "PYTHON_VERSION":       "e.g. '3.12'",

				        "MAX_JOBS":             "e.g. '64'",

				        "SCCACHE_BUCKET":       "e.g. 'my-bucket'",

				        "SCCACHE_REGION":       "e.g. 'us-west-2'",

				    """

				    def __init__(self, args=None):

				        self.work_directory = "vllm"

				    def run(self):

				        """

				        main function to run vllm build

				        1. prepare vllm build environment

				        2. prepare the docker build command args

				        3. run docker build

				        """

				        inputs = VllmBuildParameters()

				        clone_vllm()

				        self.cp_dockerfile_if_exist(inputs)

				        # cp torch wheels from root direct to vllm workspace if exist

				        self.cp_torch_whls_if_exist(inputs)

				        ensure_dir_exists(inputs.output_dir)

				        cmd = self._generate_docker_build_cmd(inputs)

				        logger.info("Running docker build: \n %s", cmd)

				        run_command(cmd, cwd="vllm", env=os.environ.copy())

				    def cp_torch_whls_if_exist(self, inputs: VllmBuildParameters) -> str:

				        if not inputs.use_torch_whl:

				            return ""

				        tmp_dir = f"./{self.work_directory}/{_VLLM_TEMP_FOLDER}"

				        tmp_path = Path(tmp_dir)

				        force_create_dir(tmp_path)

				        copy(inputs.torch_whls_path, tmp_dir)

				        return tmp_dir

				    def cp_dockerfile_if_exist(self, inputs: VllmBuildParameters):

				        if not inputs.use_local_dockerfile:

				            logger.info("using vllm default dockerfile.torch_nightly for build")

				            return

				        dockerfile_path = get_path(inputs.dockerfile_path, resolve=True)

				        vllm_torch_dockerfile = Path(

				            f"./{self.work_directory}/docker/Dockerfile.nightly_torch"

				        )

				        copy(dockerfile_path, vllm_torch_dockerfile)

				    def get_result_path(self, path):

				        """

				        Get the absolute path of the result path

				        """

				        if not path:

				            path = _DEFAULT_RESULT_PATH

				        abs_path = get_path(path, resolve=True)

				        return abs_path

				    def _get_torch_wheel_path_arg(self, torch_whl_dir: Optional[Path]) -> str:

				        if not torch_whl_dir:

				            return ""

				        return f"--build-arg TORCH_WHEELS_PATH={_VLLM_TEMP_FOLDER}"

				    def _get_base_image_args(self, inputs: VllmBuildParameters) -> tuple[str, str, str]:

				        """

				        Returns:

				            - base_image_arg: docker buildx arg string for base image

				            - final_base_image_arg:  docker buildx arg string for vllm-base stage

				            - pull_flag: --pull=true or --pull=false depending on whether the image exists locally

				        """

				        if not inputs.use_local_base_image:

				            return "", "", ""

				        base_image = inputs.base_image

				        # set both base image and final base image to the same local image

				        base_image_arg = f"--build-arg BUILD_BASE_IMAGE={base_image}"

				        final_base_image_arg = f"--build-arg FINAL_BASE_IMAGE={base_image}"

				        if local_image_exists(base_image):

				            pull_flag = "--pull=false"

				            return base_image_arg, final_base_image_arg, pull_flag

				        logger.info(

				            "[INFO] Local image not found:%s will try to pull from remote", {base_image}

				        )

				        return base_image_arg, final_base_image_arg, ""

				    def _generate_docker_build_cmd(

				        self,

				        inputs: VllmBuildParameters,

				    ) -> str:

				        base_image_arg, final_base_image_arg, pull_flag = self._get_base_image_args(

				            inputs

				        )

				        torch_arg = self._get_torch_wheel_path_arg(inputs.torch_whls_path)

				        return textwrap.dedent(

				            f"""

				            docker buildx build \

				                --output type=local,dest={inputs.output_dir} \

				                -f docker/Dockerfile.nightly_torch \

				                {pull_flag} \

				                {torch_arg} \

				                {base_image_arg} \

				                {final_base_image_arg} \

				                --build-arg max_jobs={inputs.max_jobs} \

				                --build-arg CUDA_VERSION={inputs.cuda_version} \

				                --build-arg PYTHON_VERSION={inputs.python_version} \

				                --build-arg USE_SCCACHE={int(bool(inputs.sccache_bucket and inputs.sccache_region))} \

				                --build-arg SCCACHE_BUCKET_NAME={inputs.sccache_bucket} \

				                --build-arg SCCACHE_REGION_NAME={inputs.sccache_region} \

				                --build-arg torch_cuda_arch_list='{inputs.torch_cuda_arch_list}' \

				                --target {inputs.target_stage} \

				                -t {inputs.tag_name} \

				                --progress=plain .

				        """

				        ).strip()

				def clone_vllm():

				    clone_external_repo(

				        target="vllm",

				        repo="https://github.com/vllm-project/vllm.git",

				        dst="vllm",

				        update_submodules=True,

				    )

									
										38

.ci/lumen_cli/cli/run.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,38 @@

				# main.py

				import argparse

				import logging

				from cli.build_cli.register_build import register_build_commands

				from cli.lib.common.logger import setup_logging

				logger = logging.getLogger(__name__)

				def main():

				    # Define top-level parser

				    parser = argparse.ArgumentParser(description="Lumos CLI")

				    subparsers = parser.add_subparsers(dest="command", required=True)

				    parser.add_argument(

				        "--log-level", default="INFO", help="Log level (DEBUG, INFO, WARNING, ERROR)"

				    )

				    # registers second-level subcommands

				    register_build_commands(subparsers)

				    # parse args after all options are registered

				    args = parser.parse_args()

				    # setup global logging

				    setup_logging(getattr(logging, args.log_level.upper(), logging.INFO))

				    logger.debug("Parsed args: %s", args)

				    if hasattr(args, "func"):

				        args.func(args)

				    else:

				        parser.print_help()

				if __name__ == "__main__":

				    main()

									
										22

.ci/lumen_cli/pyproject.toml
									
										Normal file
									
												View File
												
				@ -0,0 +1,22 @@

				[project]

				name = "lumen-ci"

				version = "0.1.0"

				dependencies = [

				    "pyyaml==6.0.2",

				    "GitPython==3.1.45",

				    "docker==7.1.0",

				    "pytest==7.3.2",

				]

				[tool.setuptools]

				packages = ["cli"]

				[tool.setuptools.package-dir]

				cli = "cli"

				[tool.ruff.lint]

				# Enable preview mode for linting

				preview = true

				# Now you can select your preview rules, like RUF048

				extend-select = ["RUF048"]

									
										47

.ci/lumen_cli/tests/test_app.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,47 @@

				# tests/test_cli.py

				import io

				import sys

				import unittest

				from contextlib import redirect_stderr, redirect_stdout

				from unittest.mock import patch

				from cli.run import main

				class TestArgparseCLI(unittest.TestCase):

				    @patch("cli.build_cli.register_build.VllmBuildRunner.run", return_value=None)

				    @patch("cli.build_cli.register_build.VllmBuildRunner.__init__", return_value=None)

				    def test_cli_run_build_external(self, mock_init, mock_run):

				        from cli.run import main  # import after patches if needed

				        test_args = ["cli.run", "build", "external", "vllm"]

				        with patch.object(sys, "argv", test_args):

				            # argparse may call sys.exit on error; capture to avoid test aborts

				            try:

				                main()

				            except SystemExit:

				                pass

				        mock_init.assert_called_once()  # got constructed

				        mock_run.assert_called_once_with()  # run() called

				    def test_build_help(self):

				        test_args = ["cli.run", "build", "--help"]

				        with patch.object(sys, "argv", test_args):

				            stdout = io.StringIO()

				            stderr = io.StringIO()

				            # --help always raises SystemExit(0)

				            with self.assertRaises(SystemExit) as cm:

				                with redirect_stdout(stdout), redirect_stderr(stderr):

				                    main()

				            self.assertEqual(cm.exception.code, 0)

				            output = stdout.getvalue()

				            self.assertIn("usage", output)

				            self.assertIn("external", output)

				if __name__ == "__main__":

				    unittest.main()

									
										115

.ci/lumen_cli/tests/test_cli_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,115 @@

				import argparse

				import io

				import unittest

				from contextlib import redirect_stderr

				from unittest.mock import patch

				from cli.lib.common.cli_helper import BaseRunner, register_targets, RichHelp, TargetSpec

				# ---- Dummy runners for unittests----

				class FooRunner(BaseRunner):

				    """Foo description from docstring."""

				    def run(self) -> None:  # replaced by mock

				        pass

				class BarRunner(BaseRunner):

				    def run(self) -> None:  # replaced by mock

				        pass

				def add_foo_args(p: argparse.ArgumentParser) -> None:

				    p.add_argument("--x", type=int, required=True, help="x value")

				def common_args(p: argparse.ArgumentParser) -> None:

				    p.add_argument("--verbose", action="store_true", help="verbose flag")

				def build_parser(specs: dict[str, TargetSpec]) -> argparse.ArgumentParser:

				    parser = argparse.ArgumentParser(prog="app", formatter_class=RichHelp)

				    register_targets(

				        parser=parser,

				        target_specs=specs,

				        common_args=common_args,

				    )

				    return parser

				def get_subparser(

				    parser: argparse.ArgumentParser, name: str

				) -> argparse.ArgumentParser:

				    subparsers_action = next(

				        a

				        for a in parser._subparsers._group_actions  # type: ignore[attr-defined]

				        if isinstance(a, argparse._SubParsersAction)

				    )

				    return subparsers_action.choices[name]

				class TestRegisterTargets(unittest.TestCase):

				    def test_metavar_lists_targets(self):

				        specs: dict[str, TargetSpec] = {

				            "foo": {"runner": FooRunner, "add_arguments": add_foo_args},

				            "bar": {"runner": BarRunner},

				        }

				        parser = build_parser(specs)

				        subparsers_action = next(

				            a

				            for a in parser._subparsers._group_actions  # type: ignore[attr-defined]

				            if isinstance(a, argparse._SubParsersAction)

				        )

				        self.assertEqual(subparsers_action.metavar, "{foo,bar}")

				    def test_add_arguments_and_common_args_present(self):

				        specs: dict[str, TargetSpec] = {

				            "foo": {"runner": FooRunner, "add_arguments": add_foo_args},

				        }

				        parser = build_parser(specs)

				        foo = get_subparser(parser, "foo")

				        help_text = foo.format_help()

				        self.assertIn("--x", help_text)

				        self.assertIn("--verbose", help_text)

				    def test_runner_constructed_with_ns_and_run_called(self):

				        specs: dict[str, TargetSpec] = {

				            "foo": {"runner": FooRunner, "add_arguments": add_foo_args},

				        }

				        parser = build_parser(specs)

				        with (

				            patch.object(FooRunner, "__init__", return_value=None) as mock_init,

				            patch.object(FooRunner, "run", return_value=None) as mock_run,

				        ):

				            ns = parser.parse_args(["foo", "--x", "3", "--verbose"])

				            ns.func(ns)  # set by register_targets

				            # __init__ received the Namespace

				            self.assertEqual(mock_init.call_count, 1)

				            (called_ns,), _ = mock_init.call_args

				            self.assertIsInstance(called_ns, argparse.Namespace)

				            # run() called with no args

				            mock_run.assert_called_once_with()

				    def test_runner_docstring_used_as_description_when_missing(self):

				        specs: dict[str, TargetSpec] = {

				            "foo": {"runner": FooRunner, "add_arguments": add_foo_args},

				        }

				        parser = build_parser(specs)

				        foo = get_subparser(parser, "foo")

				        help_text = foo.format_help()

				        self.assertIn("Foo description from docstring.", help_text)

				    def test_missing_target_raises_systemexit_with_usage(self):

				        specs: dict[str, TargetSpec] = {"foo": {"runner": FooRunner}}

				        parser = build_parser(specs)

				        buf = io.StringIO()

				        with self.assertRaises(SystemExit), redirect_stderr(buf):

				            parser.parse_args([])

				        err = buf.getvalue()

				        self.assertIn("usage:", err)

				if __name__ == "__main__":

				    unittest.main()

									
										75

.ci/lumen_cli/tests/test_docker_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,75 @@

				import unittest

				from unittest import mock

				from unittest.mock import MagicMock

				import docker.errors as derr

				from cli.lib.common.docker_helper import _get_client, local_image_exists

				class TestDockerImageHelpers(unittest.TestCase):

				    def setUp(self):

				        # Reset the singleton in the target module

				        patcher = mock.patch("cli.lib.common.docker_helper._docker_client", None)

				        self.addCleanup(patcher.stop)

				        patcher.start()

				    def test_local_image_exists_true(self):

				        # Mock a docker client whose images.get returns an object (no exception)

				        mock_client = MagicMock()

				        mock_client.images.get.return_value = object()

				        ok = local_image_exists("repo:tag", client=mock_client)

				        self.assertTrue(ok)

				    def test_local_image_exists_not_found_false(self):

				        mock_client = MagicMock()

				        # Raise docker.errors.NotFound

				        mock_client.images.get.side_effect = derr.NotFound("nope")

				        ok = local_image_exists("missing:latest", client=mock_client)

				        self.assertFalse(ok)

				    def test_local_image_exists_api_error_false(self):

				        mock_client = MagicMock()

				        mock_client.images.get.side_effect = derr.APIError("boom", None)

				        ok = local_image_exists("broken:tag", client=mock_client)

				        self.assertFalse(ok)

				    def test_local_image_exists_uses_lazy_singleton(self):

				        # Patch docker.from_env used by _get_client()

				        with mock.patch(

				            "cli.lib.common.docker_helper.docker.from_env"

				        ) as mock_from_env:

				            mock_docker_client = MagicMock()

				            mock_from_env.return_value = mock_docker_client

				            # First call should create and cache the client

				            c1 = _get_client()

				            self.assertIs(c1, mock_docker_client)

				            mock_from_env.assert_called_once()

				            # Second call should reuse cached client (no extra from_env calls)

				            c2 = _get_client()

				            self.assertIs(c2, mock_docker_client)

				            mock_from_env.assert_called_once()  # still once

				    def test_local_image_exists_without_client_param_calls_get_client_once(self):

				        # Ensure _get_client is called and cached; local_image_exists should reuse it

				        with mock.patch("cli.lib.common.docker_helper._get_client") as mock_get_client:

				            mock_client = MagicMock()

				            mock_get_client.return_value = mock_client

				            # 1st call

				            local_image_exists("repo:tag")

				            # 2nd call

				            local_image_exists("repo:tag2")

				            # local_image_exists should call _get_client each time,

				            # but your _get_client itself caches docker.from_env.

				            self.assertEqual(mock_get_client.call_count, 2)

				            self.assertEqual(mock_client.images.get.call_count, 2)

				            mock_client.images.get.assert_any_call("repo:tag")

				            mock_client.images.get.assert_any_call("repo:tag2")

				if __name__ == "__main__":

				    unittest.main()

									
										149

.ci/lumen_cli/tests/test_envs_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,149 @@

				import os

				import unittest

				from dataclasses import dataclass

				from pathlib import Path

				from unittest.mock import patch

				import cli.lib.common.envs_helper as m

				class TestEnvHelpers(unittest.TestCase):

				    def setUp(self):

				        # Keep a copy of the original environment to restore later

				        self._env_backup = dict(os.environ)

				    def tearDown(self):

				        # Restore environment to original state

				        os.environ.clear()

				        os.environ.update(self._env_backup)

				    # -------- get_env --------

				    def test_get_env_unset_returns_default(self):

				        with patch.dict(os.environ, {}, clear=True):

				            self.assertEqual(m.get_env("FOO", "default"), "default")

				    def test_get_env_empty_returns_default(self):

				        with patch.dict(os.environ, {"FOO": ""}, clear=True):

				            self.assertEqual(m.get_env("FOO", "default"), "default")

				    def test_get_env_set_returns_value(self):

				        with patch.dict(os.environ, {"FOO": "bar"}, clear=True):

				            self.assertEqual(m.get_env("FOO", "default"), "bar")

				    def test_get_env_not_exist_returns_default(self):

				        with patch.dict(os.environ, {"FOO": "bar"}, clear=True):

				            self.assertEqual(m.get_env("TEST_NOT_EXIST", "default"), "default")

				    def test_get_env_not_exist_without_default(self):

				        with patch.dict(os.environ, {"FOO": "bar"}, clear=True):

				            self.assertEqual(m.get_env("TEST_NOT_EXIST"), "")

				    # -------- env_bool --------

				    def test_env_bool_uses_default_when_unset(self):

				        with patch.dict(os.environ, {}, clear=True):

				            self.assertTrue(m.env_bool("FLAG", default=True))

				            self.assertFalse(m.env_bool("FLAG", default=False))

				    def test_env_bool_uses_str2bool_when_set(self):

				        # Patch str2bool used by env_bool so we don't depend on its exact behavior

				        def fake_str2bool(s: str) -> bool:

				            return s.lower() in {"1", "true", "yes", "on", "y"}

				        with (

				            patch.dict(os.environ, {"FLAG": "yEs"}, clear=True),

				            patch.object(m, "str2bool", fake_str2bool),

				        ):

				            self.assertTrue(m.env_bool("FLAG", default=False))

				    # -------- env_path_optional / env_path --------

				    def test_env_path_optional_unset_returns_none_by_default(self):

				        with patch.dict(os.environ, {}, clear=True):

				            self.assertIsNone(m.env_path_optional("P"))

				    def test_env_path_optional_unset_returns_none_when_env_var_is_empty(self):

				        with patch.dict(os.environ, {"P": ""}, clear=True):

				            self.assertIsNone(m.env_path_optional("P"))

				    def test_env_path_optional_unset_returns_default_str(self):

				        # default as string; resolve=True by default -> absolute path

				        default_str = "x/y"

				        with patch.dict(os.environ, {}, clear=True):

				            p = m.env_path_optional("P", default=default_str)

				            self.assertIsInstance(p, Path)

				            self.assertIsNotNone(p)

				            if p:

				                self.assertTrue(p.is_absolute())

				                self.assertEqual(p.parts[-2:], ("x", "y"))

				    def test_env_path_optional_unset_returns_default_path_no_resolve(self):

				        d = Path("z")

				        with patch.dict(os.environ, {}, clear=True):

				            p = m.env_path_optional("P", default=d, resolve=False)

				            self.assertEqual(p, d)

				    def test_env_path_optional_respects_resolve_true(self):

				        with patch.dict(os.environ, {"P": "a/b"}, clear=True):

				            p = m.env_path_optional("P", resolve=True)

				            self.assertIsInstance(p, Path)

				            if p:

				                self.assertTrue(p.is_absolute())

				    def test_env_path_optional_respects_resolve_false(self):

				        with patch.dict(os.environ, {"P": "rel/dir"}, clear=True):

				            p = m.env_path_optional("P", resolve=False)

				            self.assertEqual(p, Path("rel/dir"))

				            if p:

				                self.assertFalse(p.is_absolute())

				    def test_env_path_raises_when_missing_and_default_none(self):

				        with patch.dict(os.environ, {}, clear=True):

				            with self.assertRaises(ValueError):

				                m.env_path("P", None, resolve=True)

				    def test_env_path_returns_path_when_present(self):

				        tmp = Path("./b").resolve()

				        with patch.dict(os.environ, {"P": str(tmp)}, clear=True):

				            p = m.env_path("P", None, resolve=True)

				            self.assertEqual(p, tmp)

				    # -------- dataclass field helpers --------

				    def test_dataclass_fields_read_env_at_instantiation(self):

				        @dataclass

				        class Cfg:

				            flag: bool = m.env_bool_field("FLAG", default=False)

				            out: Path = m.env_path_field("OUT", default="ab", resolve=True)

				            name: str = m.env_str_field("NAME", default="anon")

				        # First instantiation

				        with patch.dict(

				            os.environ, {"FLAG": "true", "OUT": "outdir", "NAME": "alice"}, clear=True

				        ):

				            cfg1 = Cfg()

				            self.assertTrue(cfg1.flag)

				            self.assertIsInstance(cfg1.out, Path)

				            self.assertTrue(cfg1.out.is_absolute())

				            self.assertEqual(cfg1.name, "alice")

				            cfg1.name = "bob"  # change instance value

				            self.assertEqual(cfg1.name, "bob")  # change is reflected

				        # Change env; new instance should reflect new values

				        with patch.dict(os.environ, {"FLAG": "false", "NAME": ""}, clear=True):

				            cfg2 = Cfg()

				            self.assertFalse(cfg2.flag)  # str2bool("false") -> False

				            self.assertTrue("ab" in str(cfg2.out))

				            self.assertIsInstance(cfg2.out, Path)

				            self.assertTrue(cfg2.out.is_absolute())

				            self.assertEqual(cfg2.name, "anon")  # empty -> fallback to default

				    def test_dataclass_path_field_with_default_value(self):

				        @dataclass

				        class C2:

				            out: Path = m.env_path_field("OUT", default="some/dir", resolve=False)

				        with patch.dict(os.environ, {}, clear=True):

				            c = C2()

				            self.assertEqual(c.out, Path("some/dir"))

				if __name__ == "__main__":

				    unittest.main()

									
										122

.ci/lumen_cli/tests/test_path_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,122 @@

				# test_path_utils.py

				# Run: pytest -q

				import os

				import unittest

				from pathlib import Path

				from tempfile import TemporaryDirectory

				from cli.lib.common.path_helper import (

				    copy,

				    ensure_dir_exists,

				    force_create_dir,

				    get_path,

				    is_path_exist,

				    remove_dir,

				)

				class TestPathHelper(unittest.TestCase):

				    def setUp(self):

				        self.tmpdir = TemporaryDirectory()

				        self.tmp_path = Path(self.tmpdir.name)

				    def tearDown(self):

				        self.tmpdir.cleanup()

				    # -------- get_path --------

				    def test_get_path_returns_path_for_str(self):

				        # Use relative path to avoid absolute-ness

				        rel_str = "sub/f.txt"

				        os.chdir(self.tmp_path)

				        p = get_path(rel_str, resolve=False)

				        self.assertIsInstance(p, Path)

				        self.assertFalse(p.is_absolute())

				        self.assertEqual(str(p), rel_str)

				    def test_get_path_resolves(self):

				        rel_str = "sub/f.txt"

				        p = get_path(str(self.tmp_path / rel_str), resolve=True)

				        self.assertTrue(p.is_absolute())

				        self.assertTrue(str(p).endswith(rel_str))

				    def test_get_path_with_path_input(self):

				        p_in = self.tmp_path / "sub/f.txt"

				        p_out = get_path(p_in, resolve=False)

				        self.assertTrue(str(p_out) == str(p_in))

				    def test_get_path_with_none_raises(self):

				        with self.assertRaises(ValueError):

				            get_path(None)  # type: ignore[arg-type]

				    def test_get_path_invalid_type_raises(self):

				        with self.assertRaises(TypeError):

				            get_path(123)  # type: ignore[arg-type]

				    # -------- ensure_dir_exists / force_create_dir / remove_dir --------

				    def test_ensure_dir_exists_creates_and_is_idempotent(self):

				        d = self.tmp_path / "made"

				        ensure_dir_exists(d)

				        self.assertTrue(d.exists() and d.is_dir())

				        ensure_dir_exists(d)

				    def test_force_create_dir_clears_existing(self):

				        d = self.tmp_path / "fresh"

				        (d / "inner").mkdir(parents=True)

				        (d / "inner" / "f.txt").write_text("x")

				        force_create_dir(d)

				        self.assertTrue(d.exists())

				        self.assertEqual(list(d.iterdir()), [])

				    def test_remove_dir_none_is_noop(self):

				        remove_dir(None)  # type: ignore[arg-type]

				    def test_remove_dir_nonexistent_is_noop(self):

				        ghost = self.tmp_path / "ghost"

				        remove_dir(ghost)

				    def test_remove_dir_accepts_str(self):

				        d = self.tmp_path / "to_rm"

				        d.mkdir()

				        remove_dir(str(d))

				        self.assertFalse(d.exists())

				    # -------- copy --------

				    def test_copy_file_to_file(self):

				        src = self.tmp_path / "src.txt"

				        dst = self.tmp_path / "out" / "dst.txt"

				        src.write_text("hello")

				        copy(src, dst)

				        self.assertEqual(dst.read_text(), "hello")

				    def test_copy_dir_to_new_dir(self):

				        src = self.tmp_path / "srcdir"

				        (src / "a").mkdir(parents=True)

				        (src / "a" / "f.txt").write_text("content")

				        dst = self.tmp_path / "destdir"

				        copy(src, dst)

				        self.assertEqual((dst / "a" / "f.txt").read_text(), "content")

				    def test_copy_dir_into_existing_dir_overwrite_true_merges(self):

				        src = self.tmp_path / "srcdir"

				        dst = self.tmp_path / "destdir"

				        (src / "x").mkdir(parents=True)

				        (src / "x" / "new.txt").write_text("new")

				        dst.mkdir()

				        (dst / "existing.txt").write_text("old")

				        copy(src, dst)

				        self.assertEqual((dst / "existing.txt").read_text(), "old")

				        self.assertEqual((dst / "x" / "new.txt").read_text(), "new")

				    def test_is_str_path_exist(self):

				        p = self.tmp_path / "x.txt"

				        p.write_text("1")

				        self.assertTrue(is_path_exist(str(p)))

				        self.assertTrue(is_path_exist(p))

				        self.assertFalse(is_path_exist(str(self.tmp_path / "missing")))

				        self.assertFalse(is_path_exist(self.tmp_path / "missing"))

				        self.assertFalse(is_path_exist(""))

				if __name__ == "__main__":

				    unittest.main()

									
										181

.ci/lumen_cli/tests/test_vllm.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,181 @@

				import os

				import tempfile

				import unittest

				from pathlib import Path

				from unittest.mock import MagicMock, patch

				import cli.lib.core.vllm as vllm

				class TestVllmBuildParameters(unittest.TestCase):

				    @patch("cli.lib.core.vllm.local_image_exists", return_value=True)

				    @patch("cli.lib.core.vllm.is_path_exist", return_value=True)

				    @patch(

				        "cli.lib.common.envs_helper.env_path_optional",

				        side_effect=lambda name, default=None, resolve=True: {

				            "DOCKERFILE_PATH": Path("/abs/vllm/Dockerfile"),

				            "TORCH_WHEELS_PATH": Path("/abs/dist"),

				            "OUTPUT_DIR": Path("/abs/shared"),

				        }.get(name, Path(default) if default is not None else None),

				    )

				    @patch.dict(

				        os.environ,

				        {

				            "USE_TORCH_WHEEL": "1",

				            "USE_LOCAL_BASE_IMAGE": "1",

				            "USE_LOCAL_DOCKERFILE": "1",

				            "BASE_IMAGE": "my/image:tag",

				            "DOCKERFILE_PATH": "vllm/Dockerfile",

				            "TORCH_WHEELS_PATH": "dist",

				            "OUTPUT_DIR": "shared",

				        },

				        clear=True,

				    )

				    def test_params_success_normalizes_and_validates(

				        self, mock_env_path, mock_is_path, mock_local_img

				    ):

				        params = vllm.VllmBuildParameters()

				        self.assertEqual(params.torch_whls_path, Path("/abs/dist"))

				        self.assertEqual(params.dockerfile_path, Path("/abs/vllm/Dockerfile"))

				        self.assertEqual(params.output_dir, Path("/abs/shared"))

				        self.assertEqual(params.base_image, "my/image:tag")

				    @patch("cli.lib.core.vllm.is_path_exist", return_value=False)

				    @patch.dict(

				        os.environ, {"USE_TORCH_WHEEL": "1", "TORCH_WHEELS_PATH": "dist"}, clear=True

				    )

				    def test_params_missing_torch_whls_raises(self, _is_path):

				        with tempfile.TemporaryDirectory() as td:

				            os.chdir(td)

				            with self.assertRaises(ValueError) as cm:

				                vllm.VllmBuildParameters(

				                    use_local_base_image=False,

				                    use_local_dockerfile=False,

				                )

				        err = cm.exception

				        self.assertIn("TORCH_WHEELS_PATH", str(err))

				    @patch("cli.lib.core.vllm.local_image_exists", return_value=False)

				    @patch.dict(

				        os.environ, {"USE_LOCAL_BASE_IMAGE": "1", "BASE_IMAGE": "img:tag"}, clear=True

				    )

				    def test_params_missing_local_base_image_raises(self, _local_img):

				        with tempfile.TemporaryDirectory() as td:

				            os.chdir(td)

				            with self.assertRaises(ValueError) as cm:

				                vllm.VllmBuildParameters(

				                    use_torch_whl=False,

				                    use_local_dockerfile=False,

				                )

				        err = cm.exception

				        self.assertIn("BASE_IMAGE", str(err))

				    @patch("cli.lib.core.vllm.is_path_exist", return_value=False)

				    @patch.dict(

				        os.environ,

				        {"USE_LOCAL_DOCKERFILE": "1", "DOCKERFILE_PATH": "Dockerfile"},

				        clear=True,

				    )

				    def test_params_missing_dockerfile_raises(self, _is_path):

				        with tempfile.TemporaryDirectory() as td:

				            os.chdir(td)

				            with self.assertRaises(ValueError) as cm:

				                vllm.VllmBuildParameters(

				                    use_torch_whl=False,

				                    use_local_base_image=False,

				                )

				        err = cm.exception

				        self.assertIn("DOCKERFILE_PATH", str(err))

				    @patch("cli.lib.core.vllm.is_path_exist", return_value=False)

				    @patch.dict(

				        os.environ,

				        {"OUTPUT_DIR": ""},

				        clear=True,

				    )

				    def test_params_missing_output_dir(self, _is_path):

				        with self.assertRaises(FileNotFoundError):

				            vllm.VllmBuildParameters()

				class TestBuildCmdAndRun(unittest.TestCase):

				    @patch("cli.lib.core.vllm.local_image_exists", return_value=True)

				    def test_generate_docker_build_cmd_includes_bits(self, _exists):

				        runner = vllm.VllmBuildRunner()

				        # Craft inputs that simulate a prepared build

				        inputs = MagicMock()

				        inputs.output_dir = Path("/abs/out")

				        inputs.use_local_base_image = True

				        inputs.base_image = "img:tag"

				        inputs.torch_whls_path = Path("./vllm/tmp")

				        inputs.max_jobs = 64

				        inputs.cuda_version = "12.8.1"

				        inputs.python_version = "3.12"

				        inputs.sccache_bucket = "my-bucket"

				        inputs.sccache_region = "us-west-2"

				        inputs.torch_cuda_arch_list = "8.0;9.0"

				        inputs.target_stage = "export-wheels"

				        inputs.tag_name = "vllm-wheels"

				        cmd = runner._generate_docker_build_cmd(inputs)

				        squashed = " ".join(cmd.split())  # normalize whitespace for matching

				        self.assertIn("--output type=local,dest=/abs/out", squashed)

				        self.assertIn("-f docker/Dockerfile.nightly_torch", squashed)

				        self.assertIn("--pull=false", squashed)

				        self.assertIn("--build-arg TORCH_WHEELS_PATH=tmp", squashed)

				        self.assertIn("--build-arg BUILD_BASE_IMAGE=img:tag", squashed)

				        self.assertIn("--build-arg FINAL_BASE_IMAGE=img:tag", squashed)

				        self.assertIn("--build-arg max_jobs=64", squashed)

				        self.assertIn("--build-arg CUDA_VERSION=12.8.1", squashed)

				        self.assertIn("--build-arg PYTHON_VERSION=3.12", squashed)

				        self.assertIn("--build-arg USE_SCCACHE=1", squashed)

				        self.assertIn("--build-arg SCCACHE_BUCKET_NAME=my-bucket", squashed)

				        self.assertIn("--build-arg SCCACHE_REGION_NAME=us-west-2", squashed)

				        self.assertIn("--build-arg torch_cuda_arch_list='8.0;9.0'", squashed)

				        self.assertIn("--target export-wheels", squashed)

				        self.assertIn("-t vllm-wheels", squashed)

				    @patch("cli.lib.core.vllm.run_command")

				    @patch("cli.lib.core.vllm.ensure_dir_exists")

				    @patch("cli.lib.core.vllm.clone_vllm")

				    @patch.object(

				        vllm.VllmBuildRunner,

				        "_generate_docker_build_cmd",

				        return_value="docker buildx ...",

				    )

				    @patch.dict(

				        os.environ,

				        {

				            # Make __post_init__ validations pass cheaply

				            "USE_TORCH_WHEEL": "0",

				            "USE_LOCAL_BASE_IMAGE": "0",

				            "USE_LOCAL_DOCKERFILE": "0",

				            "OUTPUT_DIR": "shared",

				        },

				        clear=True,

				    )

				    def test_run_calls_clone_prepare_and_build(

				        self, mock_gen, mock_clone, mock_ensure, mock_run

				    ):

				        # Stub parameters instance so we avoid FS/Docker accesses in run()

				        params = MagicMock()

				        params.output_dir = Path("shared")

				        params.use_local_dockerfile = False

				        params.use_torch_whl = False

				        with patch("cli.lib.core.vllm.VllmBuildParameters", return_value=params):

				            runner = vllm.VllmBuildRunner()

				            runner.run()

				        mock_clone.assert_called_once()

				        mock_ensure.assert_called_once_with(Path("shared"))

				        mock_gen.assert_called_once_with(params)

				        mock_run.assert_called_once()

				        # ensure we run in vllm workdir

				        _, kwargs = mock_run.call_args

				        assert kwargs.get("cwd") == "vllm"

				if __name__ == "__main__":

				    unittest.main()

									
										7

.ci/magma/Makefile
									
												View File
												
				@ -16,6 +16,7 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \

					magma/build_magma.sh

				.PHONY: all

				all: magma-cuda130

				all: magma-cuda129

				all: magma-cuda128

				all: magma-cuda126

				@ -25,6 +26,12 @@ clean:

					$(RM) -r magma-*

					$(RM) -r output

				.PHONY: magma-cuda130

				magma-cuda130: DESIRED_CUDA := 13.0

				magma-cuda130: CUDA_ARCH_LIST := -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120

				magma-cuda130:

					$(DOCKER_RUN)

				.PHONY: magma-cuda129

				magma-cuda129: DESIRED_CUDA := 12.9

				magma-cuda129: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120

									
										2

.ci/magma/build_magma.sh
									
												View File
												
				@ -28,6 +28,7 @@ pushd ${PACKAGE_DIR}/magma-${MAGMA_VERSION}

				patch < ${PACKAGE_FILES}/CMake.patch

				patch < ${PACKAGE_FILES}/cmakelists.patch

				patch -p0 < ${PACKAGE_FILES}/thread_queue.patch

				patch -p1 < ${PACKAGE_FILES}/cuda13.patch

				patch -p1 < ${PACKAGE_FILES}/getrf_shfl.patch

				patch -p1 < ${PACKAGE_FILES}/getrf_nbparam.patch

				# The build.sh script expects to be executed from the sources root folder

				@ -37,6 +38,7 @@ popd

				# Package recipe, license and tarball

				# Folder and package name are backward compatible for the build workflow

				cp ${PACKAGE_FILES}/build.sh ${PACKAGE_RECIPE}/build.sh

				cp ${PACKAGE_FILES}/cuda13.patch ${PACKAGE_RECIPE}/cuda13.patch

				cp ${PACKAGE_FILES}/thread_queue.patch ${PACKAGE_RECIPE}/thread_queue.patch

				cp ${PACKAGE_FILES}/cmakelists.patch ${PACKAGE_RECIPE}/cmakelists.patch

				cp ${PACKAGE_FILES}/getrf_shfl.patch ${PACKAGE_RECIPE}/getrf_shfl.patch

									
										26

.ci/magma/package_files/cuda13.patch
									
										Normal file
									
												View File
												
				@ -0,0 +1,26 @@

				diff --git a/interface_cuda/interface.cpp b/interface_cuda/interface.cpp

				index 73fed1b20..e77519bfe 100644

				--- a/interface_cuda/interface.cpp

				+++ b/interface_cuda/interface.cpp

				@@ -438,14 +438,20 @@ magma_print_environment()

				         cudaDeviceProp prop;

				         err = cudaGetDeviceProperties( &prop, dev );

				         check_error( err );

				+        #ifdef MAGMA_HAVE_CUDA

				+#if CUDA_VERSION < 13000

				         printf( "%% device %d: %s, %.1f MHz clock, %.1f MiB memory, capability %d.%d\n",

				                 dev,

				                 prop.name,

				                 prop.clockRate / 1000.,

				+#else

				+        printf( "%% device %d: %s, ??? MHz clock, %.1f MiB memory, capability %d.%d\n",

				+                dev,

				+                prop.name,

				+#endif

				                 prop.totalGlobalMem / (1024.*1024.),

				                 prop.major,

				                 prop.minor );

				-        #ifdef MAGMA_HAVE_CUDA

				         int arch = prop.major*100 + prop.minor*10;

				         if ( arch < MAGMA_CUDA_ARCH_MIN ) {

				             printf("\n"

									
										4

.ci/manywheel/build.sh
									
												View File
												
				@ -5,10 +5,6 @@ set -ex

				SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

				case "${GPU_ARCH_TYPE:-BLANK}" in

				    BLANK)

				        # Legacy behavior for CircleCI

				        bash "${SCRIPTPATH}/build_cuda.sh"

				        ;;

				    cuda)

				        bash "${SCRIPTPATH}/build_cuda.sh"

				        ;;

									
										35

.ci/manywheel/build_common.sh
									
												View File
												
				@ -97,7 +97,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then

				    exit 1

				fi

				pushd "$PYTORCH_ROOT"

				retry pip install -q cmake

				retry pip install -qUr requirements-build.txt

				python setup.py clean

				retry pip install -qr requirements.txt

				case ${DESIRED_PYTHON} in

				@ -138,28 +138,11 @@ fi

				echo "Calling setup.py bdist at $(date)"

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 \

				time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				    EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				    echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				    time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    CMAKE_FRESH=1 python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				    echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				else

				    time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				        EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				        BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				        USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				        python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				fi

				echo "Finished setup.py bdist at $(date)"

				# Build libtorch packages

				@ -272,10 +255,6 @@ ls /tmp/$WHEELHOUSE_DIR

				mkdir -p "/$WHEELHOUSE_DIR"

				mv /tmp/$WHEELHOUSE_DIR/torch*linux*.whl /$WHEELHOUSE_DIR/

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    mv /tmp/$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/ || true

				fi

				if [[ -n "$BUILD_PYTHONLESS" ]]; then

				    mkdir -p /$LIBTORCH_HOUSE_DIR

				    mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR

				@ -452,16 +431,8 @@ if [[ -z "$BUILD_PYTHONLESS" ]]; then

				  pushd $PYTORCH_ROOT/test

				  # Install the wheel for this Python version

				  if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    pip uninstall -y "$TORCH_NO_PYTHON_PACKAGE_NAME" || true

				  fi

				  pip uninstall -y "$TORCH_PACKAGE_NAME"

				  if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    pip install "$TORCH_NO_PYTHON_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v

				  fi

				  pip install "$TORCH_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v

				  # Print info on the libraries installed in this wheel

									
										28

.ci/manywheel/build_cuda.sh
									
												View File
												
				@ -51,20 +51,23 @@ else

				fi

				cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')

				EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"

				case ${CUDA_VERSION} in

				    12.8|12.9)

				        TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				    #removing sm_50-sm_60 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases

				    #however we would like to keep sm_70 architecture see: https://github.com/pytorch/pytorch/issues/157517

				    12.8)

				        TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0;10.0;12.0"

				        ;;

				    12.9)

				        TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX"

				        # WAR to resolve the ld error in libtorch build with CUDA 12.9

				        if [[ "$DESIRED_CUDA" == "cu129" && "$PACKAGE_TYPE" == "libtorch" ]]; then

				        if [[ "$PACKAGE_TYPE" == "libtorch" ]]; then

				            TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"

				        fi

				        ;;

				    12.6)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6;9.0"

				        ;;

				    *)

				        echo "unknown cuda version $CUDA_VERSION"

				@ -131,6 +134,9 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so"

				            "/usr/local/cuda/lib64/libcufile.so.0"

				            "/usr/local/cuda/lib64/libcufile_rdma.so.1"

				            "/usr/local/cuda/lib64/libnvshmem_host.so.3"

				            "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12"

				            "/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so"

				        )

				        DEPS_SONAME+=(

				            "libcudnn_adv.so.9"

				@ -147,9 +153,17 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            "libcudart.so.12"

				            "libnvrtc.so.12"

				            "libnvrtc-builtins.so"

				            "libnvshmem_host.so.3"

				            "libcufile.so.0"

				            "libcufile_rdma.so.1"

				            "libcupti.so.12"

				            "libnvperf_host.so"

				        )

				        # Add libnvToolsExt only if CUDA version is not 12.9

				        if [[ $CUDA_VERSION != 12.9* ]]; then

				            DEPS_LIST+=("/usr/local/cuda/lib64/libnvToolsExt.so.1")

				            DEPS_SONAME+=("libnvToolsExt.so.1")

				        fi

				    else

				        echo "Using nvidia libs from pypi."

				        CUDA_RPATHS=(

									
										6

.ci/manywheel/build_libtorch.sh
									
												View File
												
				@ -92,7 +92,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then

				    exit 1

				fi

				pushd "$PYTORCH_ROOT"

				retry pip install -q cmake

				retry pip install -qUr requirements-build.txt

				python setup.py clean

				retry pip install -qr requirements.txt

				retry pip install -q numpy==2.0.1

				@ -104,7 +104,7 @@ if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				    export ROCclr_DIR=/opt/rocm/rocclr/lib/cmake/rocclr

				fi

				echo "Calling setup.py install at $(date)"

				echo "Calling 'python -m pip install .' at $(date)"

				if [[ $LIBTORCH_VARIANT = *"static"* ]]; then

				    STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"

				@ -120,7 +120,7 @@ fi

				        # TODO: Remove this flag once https://github.com/pytorch/pytorch/issues/55952 is closed

				        CFLAGS='-Wno-deprecated-declarations' \

				        BUILD_LIBTORCH_CPU_WITH_DEBUG=1 \

				        python setup.py install

				        python -m pip install --no-build-isolation -v .

				    mkdir -p libtorch/{lib,bin,include,share}

									
										2

.ci/manywheel/build_rocm.sh
									
												View File
												
				@ -194,7 +194,7 @@ ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library

				ROCBLAS_LIB_DST=lib/rocblas/library

				ROCBLAS_ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)

				ROCBLAS_OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)

				ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $OTHER_FILES)

				ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $ROCBLAS_OTHER_FILES)

				# hipblaslt library files

				HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library

									
										2

.ci/onnx/test.sh
									
												View File
												
				@ -19,7 +19,7 @@ git config --global --add safe.directory /var/lib/jenkins/workspace

				if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then

				  # TODO: This can be removed later once vision is also part of the Docker image

				  pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"

				  pip install -q --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"

				  # JIT C++ extensions require ninja, so put it into PATH.

				  export PATH="/var/lib/jenkins/.local/bin:$PATH"

				  # NB: ONNX test is fast (~15m) so it's ok to retry it few more times to avoid any flaky issue, we

									
										34

.ci/pytorch/build-mobile.sh
									
												View File
											
				@ -1,34 +0,0 @@

				#!/usr/bin/env bash

				# DO NOT ADD 'set -x' not to reveal CircleCI secret context environment variables

				set -eu -o pipefail

				# This script uses linux host toolchain + mobile build options in order to

				# build & test mobile libtorch without having to setup Android/iOS

				# toolchain/simulator.

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				# shellcheck source=./common-build.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"

				# Install torch & torchvision - used to download & trace test model.

				# Ideally we should use the libtorch built on the PR so that backward

				# incompatible changes won't break this script - but it will significantly slow

				# down mobile CI jobs.

				# Here we install nightly instead of stable so that we have an option to

				# temporarily skip mobile CI jobs on BC-breaking PRs until they are in nightly.

				retry pip install --pre torch torchvision \

				  -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html \

				  --progress-bar off

				# Run end-to-end process of building mobile library, linking into the predictor

				# binary, and running forward pass with a real model.

				if [[ "$BUILD_ENVIRONMENT" == *-mobile-custom-build-static* ]]; then

				  TEST_CUSTOM_BUILD_STATIC=1 test/mobile/custom_build/build.sh

				elif [[ "$BUILD_ENVIRONMENT" == *-mobile-lightweight-dispatch* ]]; then

				  test/mobile/lightweight_dispatch/build.sh

				else

				  TEST_DEFAULT_BUILD=1 test/mobile/custom_build/build.sh

				fi

				print_sccache_stats

									
										101

.ci/pytorch/build.sh
									
												View File
												
				@ -11,10 +11,6 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				# shellcheck source=./common-build.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"

				if [[ "$BUILD_ENVIRONMENT" == *-mobile-*build* ]]; then

				  exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile.sh" "$@"

				fi

				echo "Python version:"

				python --version

				@ -54,9 +50,6 @@ if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then

				  export ATEN_THREADING=NATIVE

				fi

				# Enable LLVM dependency for TensorExpr testing

				export USE_LLVM=/opt/llvm

				export LLVM_DIR=/opt/llvm/lib/cmake/llvm

				if ! which conda; then

				  # In ROCm CIs, we are doing cross compilation on build machines with

				@ -99,6 +92,27 @@ if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then

				  export ACL_ROOT_DIR=/ComputeLibrary

				fi

				if [[ "$BUILD_ENVIRONMENT" == *riscv64* ]]; then

				  if [[ -f /opt/riscv-cross-env/bin/activate ]]; then

				    # shellcheck disable=SC1091

				    source /opt/riscv-cross-env/bin/activate

				  else

				    echo "Activation file not found"

				    exit 1

				  fi

				  export CMAKE_CROSSCOMPILING=TRUE

				  export CMAKE_SYSTEM_NAME=Linux

				  export CMAKE_SYSTEM_PROCESSOR=riscv64

				  export USE_CUDA=0

				  export USE_MKLDNN=0

				  export SLEEF_TARGET_EXEC_USE_QEMU=ON

				  sudo chown -R jenkins /var/lib/jenkins/workspace /opt

				fi

				if [[ "$BUILD_ENVIRONMENT" == *libtorch* ]]; then

				  POSSIBLE_JAVA_HOMES=()

				  POSSIBLE_JAVA_HOMES+=(/usr/local)

				@ -124,26 +138,8 @@ if [[ "$BUILD_ENVIRONMENT" == *libtorch* ]]; then

				fi

				# Use special scripts for Android builds

				if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then

				  export ANDROID_NDK=/opt/ndk

				  build_args=()

				  if [[ "${BUILD_ENVIRONMENT}" == *-arm-v7a* ]]; then

				    build_args+=("-DANDROID_ABI=armeabi-v7a")

				  elif [[ "${BUILD_ENVIRONMENT}" == *-arm-v8a* ]]; then

				    build_args+=("-DANDROID_ABI=arm64-v8a")

				  elif [[ "${BUILD_ENVIRONMENT}" == *-x86_32* ]]; then

				    build_args+=("-DANDROID_ABI=x86")

				  elif [[ "${BUILD_ENVIRONMENT}" == *-x86_64* ]]; then

				    build_args+=("-DANDROID_ABI=x86_64")

				  fi

				  if [[ "${BUILD_ENVIRONMENT}" == *vulkan* ]]; then

				    build_args+=("-DUSE_VULKAN=ON")

				  fi

				  build_args+=("-DUSE_LITE_INTERPRETER_PROFILER=OFF")

				  exec ./scripts/build_android.sh "${build_args[@]}" "$@"

				fi

				if [[ "$BUILD_ENVIRONMENT" != *android* && "$BUILD_ENVIRONMENT" == *vulkan* ]]; then

				if [[ "$BUILD_ENVIRONMENT" == *vulkan* ]]; then

				  export USE_VULKAN=1

				  # shellcheck disable=SC1091

				  source /var/lib/jenkins/vulkansdk/setup-env.sh

				@ -198,10 +194,8 @@ fi

				# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of

				# memory to build and will OOM

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]] && [ -z "$MAX_JOBS_OVERRIDE" ]; then

				  echo "WARNING: FlashAttention files require large amounts of memory to build and will OOM"

				  echo "Setting MAX_JOBS=(nproc-2)/3 to reduce memory usage"

				  export MAX_JOBS="$(( $(nproc --ignore=2) / 3 ))"

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && echo "${TORCH_CUDA_ARCH_LIST}" | tr ' ' '\n' | sed 's/$/>= 8.0/' | bc | grep -q 1; then

				  export BUILD_CUSTOM_STEP="ninja -C build flash_attention -j 2"

				fi

				if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then

				@ -216,7 +210,6 @@ if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then

				  export USE_ASAN=1

				  export REL_WITH_DEB_INFO=1

				  export UBSAN_FLAGS="-fno-sanitize-recover=all"

				  unset USE_LLVM

				fi

				if [[ "${BUILD_ENVIRONMENT}" == *no-ops* ]]; then

				@ -227,7 +220,7 @@ if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then

				    export USE_PRECOMPILED_HEADERS=1

				fi

				if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then

				if [[ "${BUILD_ENVIRONMENT}" != *cuda* ]]; then

				  export BUILD_STATIC_RUNTIME_BENCHMARK=ON

				fi

				@ -237,7 +230,7 @@ fi

				# Do not change workspace permissions for ROCm and s390x CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /var/lib/jenkins/workspace ]]; then

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *riscv64* && -d /var/lib/jenkins/workspace ]]; then

				  # Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

				  WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")

				  cleanup_workspace() {

				@ -282,32 +275,38 @@ else

				    # XLA test build fails when WERROR=1

				    # set only when building other architectures

				    # or building non-XLA tests.

				    if [[ "$BUILD_ENVIRONMENT" != *rocm*  &&

				          "$BUILD_ENVIRONMENT" != *xla* ]]; then

				    if [[ "$BUILD_ENVIRONMENT" != *rocm*  && "$BUILD_ENVIRONMENT" != *xla* && "$BUILD_ENVIRONMENT" != *riscv64* ]]; then

				      # Install numpy-2.0.2 for builds which are backward compatible with 1.X

				      python -mpip install numpy==2.0.2

				      WERROR=1 python setup.py clean

				      if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				        python3 tools/packaging/split_wheel.py bdist_wheel

				      else

				        WERROR=1 python setup.py bdist_wheel

				      fi

				      WERROR=1 python setup.py bdist_wheel

				    else

				      python setup.py clean

				      if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then

				        source .ci/pytorch/install_cache_xla.sh

				      fi

				      if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				        echo "USE_SPLIT_BUILD cannot be used with xla or rocm"

				        exit 1

				      else

				        python setup.py bdist_wheel

				      fi

				      python setup.py bdist_wheel

				    fi

				    pip_install_whl "$(echo dist/*.whl)"

				    if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *vision* ]]; then

				      install_torchvision

				    fi

				    if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *audio* ]]; then

				      install_torchaudio

				    fi

				    if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *torchrec* || "${BUILD_ADDITIONAL_PACKAGES:-}" == *fbgemm* ]]; then

				      install_torchrec_and_fbgemm

				    fi

				    if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *torchao* ]]; then

				      install_torchao

				    fi

				    if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				      echo "Checking that xpu is compiled"

				      pushd dist/

				@ -395,10 +394,8 @@ else

				    # This is an attempt to mitigate flaky libtorch build OOM error. By default, the build parallelization

				    # is set to be the number of CPU minus 2. So, let's try a more conservative value here. A 4xlarge has

				    # 16 CPUs

				    if [ -z "$MAX_JOBS_OVERRIDE" ]; then

				      MAX_JOBS=$(nproc --ignore=4)

				      export MAX_JOBS

				    fi

				    MAX_JOBS=$(nproc --ignore=4)

				    export MAX_JOBS

				    # NB: Install outside of source directory (at the same level as the root

				    # pytorch folder) so that it doesn't get cleaned away prior to docker push.

				@ -415,7 +412,7 @@ if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]];

				  # don't do this for libtorch as libtorch is C++ only and thus won't have python tests run on its build

				  python tools/stats/export_test_times.py

				fi

				# don't do this for bazel or s390x as they don't use sccache

				if [[ "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then

				# don't do this for bazel or s390x or riscv64 as they don't use sccache

				if [[ "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *riscv64* && "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then

				  print_sccache_stats

				fi

									
										7

.ci/pytorch/common-build.sh
									
												View File
												
				@ -13,6 +13,13 @@ if [[ "$BUILD_ENVIRONMENT" != *win-* ]]; then

				    fi

				    if which sccache > /dev/null; then

				        # Clear SCCACHE_BUCKET and SCCACHE_REGION if they are empty, otherwise

				        # sccache will complain about invalid bucket configuration

				        if [[ -z "${SCCACHE_BUCKET:-}" ]]; then

				          unset SCCACHE_BUCKET

				          unset SCCACHE_REGION

				        fi

				        # Save sccache logs to file

				        sccache --stop-server > /dev/null  2>&1 || true

				        rm -f ~/sccache_error.log || true

									
										148

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -78,6 +78,34 @@ function pip_install_whl() {

				  fi

				}

				function pip_build_and_install() {

				  local build_target=$1

				  local wheel_dir=$2

				  local found_whl=0

				  for file in "${wheel_dir}"/*.whl

				  do

				    if [[ -f "${file}" ]]; then

				      found_whl=1

				      break

				    fi

				  done

				  # Build the wheel if it doesn't exist

				  if [ "${found_whl}" == "0" ]; then

				    python3 -m pip wheel \

				      --no-build-isolation \

				      --no-deps \

				      --no-use-pep517 \

				      -w "${wheel_dir}" \

				      "${build_target}"

				  fi

				  for file in "${wheel_dir}"/*.whl

				  do

				    pip_install_whl "${file}"

				  done

				}

				function pip_install() {

				  # retry 3 times

				@ -124,14 +152,7 @@ function get_pinned_commit() {

				function install_torchaudio() {

				  local commit

				  commit=$(get_pinned_commit audio)

				  if [[ "$1" == "cuda" ]]; then

				    # TODO: This is better to be passed as a parameter from _linux-test workflow

				    # so that it can be consistent with what is set in build

				    TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git@${commit}"

				  else

				    pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git@${commit}"

				  fi

				  pip_build_and_install "git+https://github.com/pytorch/audio.git@${commit}" dist/audio

				}

				function install_torchtext() {

				@ -139,8 +160,8 @@ function install_torchtext() {

				  local text_commit

				  data_commit=$(get_pinned_commit data)

				  text_commit=$(get_pinned_commit text)

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/data.git@${data_commit}"

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/text.git@${text_commit}"

				  pip_build_and_install "git+https://github.com/pytorch/data.git@${data_commit}" dist/data

				  pip_build_and_install "git+https://github.com/pytorch/text.git@${text_commit}" dist/text

				}

				function install_torchvision() {

				@ -153,7 +174,14 @@ function install_torchvision() {

				    echo 'char* dlerror(void) { return "";}'|gcc -fpic -shared -o "${HOME}/dlerror.so" -x c -

				    LD_PRELOAD=${orig_preload}:${HOME}/dlerror.so

				  fi

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git@${commit}"

				  if [[ "${BUILD_ENVIRONMENT}" == *cuda* ]]; then

				    # Not sure if both are needed, but why not

				    export FORCE_CUDA=1

				    export WITH_CUDA=1

				  fi

				  pip_build_and_install "git+https://github.com/pytorch/vision.git@${commit}" dist/vision

				  if [ -n "${LD_PRELOAD}" ]; then

				    LD_PRELOAD=${orig_preload}

				  fi

				@ -173,25 +201,71 @@ function install_torchrec_and_fbgemm() {

				  if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then

				    # install torchrec first because it installs fbgemm nightly on top of rocm fbgemm

				    pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"

				    pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec

				    pip_uninstall fbgemm-gpu-nightly

				    # Set ROCM_HOME isn't available, use ROCM_PATH if set or /opt/rocm

				    ROCM_HOME="${ROCM_HOME:-${ROCM_PATH:-/opt/rocm}}"

				    # Find rocm_version.h header file for ROCm version extract

				    rocm_version_h="${ROCM_HOME}/include/rocm-core/rocm_version.h"

				    if [ ! -f "$rocm_version_h" ]; then

				        rocm_version_h="${ROCM_HOME}/include/rocm_version.h"

				    fi

				    # Error out if rocm_version.h not found

				    if [ ! -f "$rocm_version_h" ]; then

				        echo "Error: rocm_version.h not found in expected locations." >&2

				        exit 1

				    fi

				    # Extract major, minor and patch ROCm version numbers

				    MAJOR_VERSION=$(grep 'ROCM_VERSION_MAJOR' "$rocm_version_h" | awk '{print $3}')

				    MINOR_VERSION=$(grep 'ROCM_VERSION_MINOR' "$rocm_version_h" | awk '{print $3}')

				    PATCH_VERSION=$(grep 'ROCM_VERSION_PATCH' "$rocm_version_h" | awk '{print $3}')

				    ROCM_INT=$((MAJOR_VERSION * 10000 + MINOR_VERSION * 100 + PATCH_VERSION))

				    echo "ROCm version: $ROCM_INT"

				    export BUILD_ROCM_VERSION="$MAJOR_VERSION.$MINOR_VERSION"

				    pip_install tabulate  # needed for newer fbgemm

				    pip_install patchelf  # needed for rocm fbgemm

				    git clone --recursive https://github.com/pytorch/fbgemm

				    pushd fbgemm/fbgemm_gpu

				    git checkout "${fbgemm_commit}"

				    python setup.py install \

				      --package_variant=rocm \

				      -DHIP_ROOT_DIR="${ROCM_PATH}" \

				      -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \

				      -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"

				    popd

				    local wheel_dir=dist/fbgemm_gpu

				    local found_whl=0

				    for file in "${wheel_dir}"/*.whl

				    do

				      if [[ -f "${file}" ]]; then

				        found_whl=1

				        break

				      fi

				    done

				    # Build the wheel if it doesn't exist

				    if [ "${found_whl}" == "0" ]; then

				      git clone --recursive https://github.com/pytorch/fbgemm

				      pushd fbgemm/fbgemm_gpu

				      git checkout "${fbgemm_commit}" --recurse-submodules

				      python setup.py bdist_wheel \

				        --build-variant=rocm \

				        -DHIP_ROOT_DIR="${ROCM_PATH}" \

				        -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \

				        -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"

				      popd

				      # Save the wheel before cleaning up

				      mkdir -p dist/fbgemm_gpu

				      cp fbgemm/fbgemm_gpu/dist/*.whl dist/fbgemm_gpu

				    fi

				    for file in "${wheel_dir}"/*.whl

				    do

				      pip_install_whl "${file}"

				    done

				    rm -rf fbgemm

				  else

				    # See https://github.com/pytorch/pytorch/issues/106971

				    CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"

				    pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"

				    pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec

				    pip_build_and_install "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#subdirectory=fbgemm_gpu" dist/fbgemm_gpu

				  fi

				}

				@ -207,34 +281,10 @@ function clone_pytorch_xla() {

				  fi

				}

				function checkout_install_torchbench() {

				  local commit

				  commit=$(get_pinned_commit torchbench)

				  git clone https://github.com/pytorch/benchmark torchbench

				  pushd torchbench

				  git checkout "$commit"

				  if [ "$1" ]; then

				    python install.py --continue_on_fail models "$@"

				  else

				    # Occasionally the installation may fail on one model but it is ok to continue

				    # to install and test other models

				    python install.py --continue_on_fail

				  fi

				  # TODO (huydhn): transformers-4.44.2 added by https://github.com/pytorch/benchmark/pull/2488

				  # is regressing speedup metric. This needs to be investigated further

				  pip install transformers==4.38.1

				  echo "Print all dependencies after TorchBench is installed"

				  python -mpip freeze

				  popd

				}

				function install_torchao() {

				  local commit

				  commit=$(get_pinned_commit torchao)

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/ao.git@${commit}"

				  pip_build_and_install "git+https://github.com/pytorch/ao.git@${commit}" dist/ao

				}

				function print_sccache_stats() {

									
										123

.ci/pytorch/create_test_cert.py
									
												View File
											
				@ -1,123 +0,0 @@

				from datetime import datetime, timedelta, timezone

				from tempfile import mkdtemp

				from cryptography import x509

				from cryptography.hazmat.primitives import hashes, serialization

				from cryptography.hazmat.primitives.asymmetric import rsa

				from cryptography.x509.oid import NameOID

				temp_dir = mkdtemp()

				print(temp_dir)

				def genrsa(path):

				    key = rsa.generate_private_key(

				        public_exponent=65537,

				        key_size=2048,

				    )

				    with open(path, "wb") as f:

				        f.write(

				            key.private_bytes(

				                encoding=serialization.Encoding.PEM,

				                format=serialization.PrivateFormat.TraditionalOpenSSL,

				                encryption_algorithm=serialization.NoEncryption(),

				            )

				        )

				    return key

				def create_cert(path, C, ST, L, O, key):

				    subject = issuer = x509.Name(

				        [

				            x509.NameAttribute(NameOID.COUNTRY_NAME, C),

				            x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),

				            x509.NameAttribute(NameOID.LOCALITY_NAME, L),

				            x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),

				        ]

				    )

				    cert = (

				        x509.CertificateBuilder()

				        .subject_name(subject)

				        .issuer_name(issuer)

				        .public_key(key.public_key())

				        .serial_number(x509.random_serial_number())

				        .not_valid_before(datetime.now(timezone.utc))

				        .not_valid_after(

				            # Our certificate will be valid for 10 days

				            datetime.now(timezone.utc) + timedelta(days=10)

				        )

				        .add_extension(

				            x509.BasicConstraints(ca=True, path_length=None),

				            critical=True,

				        )

				        .sign(key, hashes.SHA256())

				    )

				    # Write our certificate out to disk.

				    with open(path, "wb") as f:

				        f.write(cert.public_bytes(serialization.Encoding.PEM))

				    return cert

				def create_req(path, C, ST, L, O, key):

				    csr = (

				        x509.CertificateSigningRequestBuilder()

				        .subject_name(

				            x509.Name(

				                [

				                    # Provide various details about who we are.

				                    x509.NameAttribute(NameOID.COUNTRY_NAME, C),

				                    x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),

				                    x509.NameAttribute(NameOID.LOCALITY_NAME, L),

				                    x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),

				                ]

				            )

				        )

				        .sign(key, hashes.SHA256())

				    )

				    with open(path, "wb") as f:

				        f.write(csr.public_bytes(serialization.Encoding.PEM))

				    return csr

				def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):

				    cert = (

				        x509.CertificateBuilder()

				        .subject_name(csr_cert.subject)

				        .issuer_name(ca_cert.subject)

				        .public_key(csr_cert.public_key())

				        .serial_number(x509.random_serial_number())

				        .not_valid_before(datetime.now(timezone.utc))

				        .not_valid_after(

				            # Our certificate will be valid for 10 days

				            datetime.now(timezone.utc) + timedelta(days=10)

				            # Sign our certificate with our private key

				        )

				        .sign(private_ca_key, hashes.SHA256())

				    )

				    with open(path, "wb") as f:

				        f.write(cert.public_bytes(serialization.Encoding.PEM))

				    return cert

				ca_key = genrsa(temp_dir + "/ca.key")

				ca_cert = create_cert(

				    temp_dir + "/ca.pem",

				    "US",

				    "New York",

				    "New York",

				    "Gloo Certificate Authority",

				    ca_key,

				)

				pkey = genrsa(temp_dir + "/pkey.key")

				csr = create_req(

				    temp_dir + "/csr.csr",

				    "US",

				    "California",

				    "San Francisco",

				    "Gloo Testing Company",

				    pkey,

				)

				cert = sign_certificate_request(temp_dir + "/cert.pem", csr, ca_cert, ca_key)

									
										32

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -157,6 +157,34 @@ test_jit_hooks() {

				  assert_git_not_dirty

				}

				# Shellcheck doesn't like it when you pass no arguments to a function

				# that can take args. See https://www.shellcheck.net/wiki/SC2120

				# shellcheck disable=SC2120

				checkout_install_torchbench() {

				  local commit

				  commit=$(cat .ci/docker/ci_commit_pins/torchbench.txt)

				  git clone https://github.com/pytorch/benchmark torchbench

				  pushd torchbench

				  git checkout "$commit"

				  if [ "$1" ]; then

				    python install.py --continue_on_fail models "$@"

				  else

				    # Occasionally the installation may fail on one model but it is ok to continue

				    # to install and test other models

				    python install.py --continue_on_fail

				  fi

				  popd

				  pip install -r .ci/docker/ci_commit_pins/huggingface-requirements.txt

				  # https://github.com/pytorch/pytorch/issues/160689 to remove torchao because

				  # its current version 0.12.0 doesn't work with transformers 4.54.0

				  pip uninstall -y torchao

				  echo "Print all dependencies after TorchBench is installed"

				  python -mpip freeze

				}

				torchbench_setup_macos() {

				  git clone --recursive https://github.com/pytorch/vision torchvision

				  git clone --recursive https://github.com/pytorch/audio torchaudio

				@ -179,13 +207,11 @@ torchbench_setup_macos() {

				  USE_OPENMP=0 python setup.py develop

				  popd

				  # Shellcheck doesn't like it when you pass no arguments to a function that can take args. See https://www.shellcheck.net/wiki/SC2120

				  # shellcheck disable=SC2119,SC2120

				  checkout_install_torchbench

				}

				pip_benchmark_deps() {

				  python -mpip install --no-input astunparse requests cython scikit-learn

				  python -mpip install --no-input requests cython scikit-learn six

				}

									
										18

.ci/pytorch/run_glootls_test.sh
									
												View File
											
				@ -1,18 +0,0 @@

				#!/bin/bash

				CREATE_TEST_CERT="$(dirname "${BASH_SOURCE[0]}")/create_test_cert.py"

				TMP_CERT_DIR=$(python "$CREATE_TEST_CERT")

				openssl verify -CAfile "${TMP_CERT_DIR}/ca.pem" "${TMP_CERT_DIR}/cert.pem"

				export GLOO_DEVICE_TRANSPORT=TCP_TLS

				export GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=${TMP_CERT_DIR}/pkey.key

				export GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=${TMP_CERT_DIR}/cert.pem

				export GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=${TMP_CERT_DIR}/ca.pem

				time python test/run_test.py --include distributed/test_c10d_gloo --verbose -- ProcessGroupGlooTest

				unset GLOO_DEVICE_TRANSPORT

				unset GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY

				unset GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT

				unset GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE

									
										5

.ci/pytorch/run_tests.sh
									
												View File
												
				@ -74,12 +74,13 @@ else

				fi

				# Environment initialization

				retry pip install -qUr requirements-build.txt

				if [[ "$(uname)" == Darwin ]]; then

				    # Install the testing dependencies

				    retry pip install -q future hypothesis ${NUMPY_PACKAGE} ${PROTOBUF_PACKAGE} pytest setuptools six typing_extensions pyyaml

				    retry pip install -q future hypothesis ${NUMPY_PACKAGE} ${PROTOBUF_PACKAGE} pytest

				else

				    retry pip install -qr requirements.txt || true

				    retry pip install -q hypothesis protobuf pytest setuptools || true

				    retry pip install -q hypothesis protobuf pytest || true

				    numpy_ver=1.15

				    case "$(python --version 2>&1)" in

				      *2* | *3.5* | *3.6*)

									
										25

.ci/pytorch/smoke_test/smoke_test.py
									
												View File
												
				@ -385,6 +385,29 @@ def smoke_test_compile(device: str = "cpu") -> None:

				    x_pt2 = torch.compile(model, mode="max-autotune")(x)

				def smoke_test_nvshmem() -> None:

				    if not torch.cuda.is_available():

				        print("CUDA is not available, skipping NVSHMEM test")

				        return

				    # Check if NVSHMEM is compiled in current build

				    try:

				        from torch._C._distributed_c10d import _is_nvshmem_available

				    except ImportError:

				        # Not built with NVSHMEM support.

				        # torch is not compiled with NVSHMEM prior to 2.9

				        if torch.__version__ < "2.9":

				            return

				        else:

				            # After 2.9: NVSHMEM is expected to be compiled in current build

				            raise RuntimeError("torch not compiled with NVSHMEM") from None

				    print("torch compiled with NVSHMEM")

				    # Check if NVSHMEM is available on current system.

				    print(f"NVSHMEM available at run time: {_is_nvshmem_available()}")

				def smoke_test_modules():

				    cwd = os.getcwd()

				    for module in MODULES:

				@ -479,6 +502,8 @@ def main() -> None:

				        options.pypi_pkg_check,

				    )

				    smoke_test_nvshmem()

				if __name__ == "__main__":

				    main()

									
										222

.ci/pytorch/test.sh
									
												View File
												
				@ -11,6 +11,8 @@ export TERM=vt100

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				# shellcheck source=./common-build.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"

				# Do not change workspace permissions for ROCm and s390x CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				@ -163,8 +165,6 @@ elif [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  export PYTORCH_TESTING_DEVICE_ONLY_FOR="xpu"

				  # setting PYTHON_TEST_EXTRA_OPTION

				  export PYTHON_TEST_EXTRA_OPTION="--xpu"

				  # Disable sccache for xpu test due to flaky issue https://github.com/pytorch/pytorch/issues/143585

				  sudo rm -rf /opt/cache

				fi

				if [[ "$TEST_CONFIG" == *crossref* ]]; then

				@ -201,7 +201,7 @@ fi

				if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then

				  # JIT C++ extensions require ninja.

				  pip_install --user "ninja==1.10.2"

				  pip_install "ninja==1.10.2"

				  # ninja is installed in $HOME/.local/bin, e.g., /var/lib/jenkins/.local/bin for CI user jenkins

				  # but this script should be runnable by any user, including root

				  export PATH="$HOME/.local/bin:$PATH"

				@ -289,6 +289,12 @@ elif [[ $TEST_CONFIG == 'nogpu_AVX512' ]]; then

				  export ATEN_CPU_CAPABILITY=avx2

				fi

				if [[ "${TEST_CONFIG}" == "legacy_nvidia_driver" ]]; then

				  # Make sure that CUDA can be initialized

				  (cd test && python -c "import torch; torch.rand(2, 2, device='cuda')")

				  export USE_LEGACY_DRIVER=1

				fi

				test_python_legacy_jit() {

				  time python test/run_test.py --include test_jit_legacy test_jit_fuser_legacy --verbose

				  assert_git_not_dirty

				@ -327,12 +333,24 @@ test_h100_distributed() {

				  time python test/run_test.py --include distributed/_composable/test_composability/test_pp_composability.py  $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  # This test requires multicast support

				  time python test/run_test.py --include distributed/_composable/fsdp/test_fully_shard_comm.py -k TestFullyShardAllocFromPG $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				test_h100_symm_mem() {

				  # symmetric memory test

				  time python test/run_test.py --include distributed/test_symmetric_memory.py  $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  time python test/run_test.py --include distributed/test_nvshmem.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  time python test/run_test.py --include distributed/test_nvshmem_triton.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  time python test/run_test.py --include distributed/test_nccl.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				test_h100_cutlass_backend() {

				  # cutlass backend tests for H100

				  TORCHINDUCTOR_CUTLASS_DIR=$(realpath "./third_party/cutlass") python test/run_test.py --include inductor/test_cutlass_backend -k "not addmm" $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  TORCHINDUCTOR_CUTLASS_DIR=$(realpath "./third_party/cutlass") python test/run_test.py --include inductor/test_cutlass_evt $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				}

				test_lazy_tensor_meta_reference_disabled() {

				  export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1

				  echo "Testing lazy tensor operations without meta reference"

				@ -347,7 +365,6 @@ test_dynamo_wrapped_shard() {

				    exit 1

				  fi

				  python tools/dynamo/verify_dynamo.py

				  python tools/dynamo/gb_id_mapping.py verify

				  # PLEASE DO NOT ADD ADDITIONAL EXCLUDES HERE.

				  # Instead, use @skipIfTorchDynamo on your tests.

				  time python test/run_test.py --dynamo \

				@ -362,13 +379,24 @@ test_dynamo_wrapped_shard() {

				  assert_git_not_dirty

				}

				test_einops() {

				  pip install einops==0.6.1

				  time python test/run_test.py --einops --verbose --upload-artifacts-while-running

				  pip install einops==0.7.0

				  time python test/run_test.py --einops --verbose --upload-artifacts-while-running

				  pip install einops==0.8.1

				  time python test/run_test.py --einops --verbose --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				test_inductor_distributed() {

				  # Smuggle a few multi-gpu tests here so that we don't have to request another large node

				  echo "Testing multi_gpu tests in test_torchinductor"

				  python test/run_test.py -i inductor/test_torchinductor.py -k test_multi_gpu --verbose

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_non_default_cuda_device --verbose

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_on_gpu_device1 --verbose

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_non_default_gpu_device --verbose

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_load_package_multiple_gpus --verbose

				  python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose

				  python test/run_test.py -i distributed/tensor/test_dtensor_compile.py --verbose

				  python test/run_test.py -i distributed/tensor/parallel/test_micro_pipeline_tp.py --verbose

				@ -420,14 +448,21 @@ test_inductor_aoti() {

				    python3 tools/amd_build/build_amd.py

				  fi

				  if [[ "$BUILD_ENVIRONMENT" == *sm86* ]]; then

				    BUILD_AOT_INDUCTOR_TEST=1 TORCH_CUDA_ARCH_LIST=8.6 USE_FLASH_ATTENTION=OFF python setup.py develop

				    BUILD_COMMAND=(TORCH_CUDA_ARCH_LIST=8.6 USE_FLASH_ATTENTION=OFF python -m pip install --no-build-isolation -v -e .)

				    # TODO: Replace me completely, as one should not use conda libstdc++, nor need special path to TORCH_LIB

				    LD_LIBRARY_PATH=/opt/conda/envs/py_3.10/lib/:${TORCH_LIB_DIR}:$LD_LIBRARY_PATH

				    CPP_TESTS_DIR="${BUILD_BIN_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference -dist=loadfile

				    TEST_ENVS=(CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="/opt/conda/envs/py_3.10/lib:${TORCH_LIB_DIR}:${LD_LIBRARY_PATH}")

				  else

				    BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop

				    CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference -dist=loadfile

				    BUILD_COMMAND=(python -m pip install --no-build-isolation -v -e .)

				    TEST_ENVS=(CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}")

				  fi

				  # aoti cmake custom command requires `torch` to be installed

				  # initialize the cmake build cache and install torch

				  /usr/bin/env "${BUILD_COMMAND[@]}"

				  # rebuild with the build cache with `BUILD_AOT_INDUCTOR_TEST` enabled

				  /usr/bin/env CMAKE_FRESH=1 BUILD_AOT_INDUCTOR_TEST=1 "${BUILD_COMMAND[@]}"

				  /usr/bin/env "${TEST_ENVS[@]}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference cpp/test_vec_half_AVX2 -dist=loadfile

				}

				test_inductor_cpp_wrapper_shard() {

				@ -440,47 +475,26 @@ test_inductor_cpp_wrapper_shard() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  if [[ "$1" -eq "2" ]]; then

				    # For now, manually put the opinfo tests in shard 2, and all other tests in

				    # shard 1.  Run all CPU tests, as well as specific GPU tests triggering past

				    # bugs, for now.

				    python test/run_test.py \

				      --include inductor/test_torchinductor_opinfo \

				      -k 'linalg or to_sparse or TestInductorOpInfoCPU' \

				      --verbose

				    exit

				  fi

				  # Run certain inductor unit tests with cpp wrapper. In the end state, we

				  # should be able to run all the inductor unit tests with cpp_wrapper.

				  #

				  # TODO: I'm pretty sure that "TestInductorOpInfoCPU" is not a valid filter,

				  # but change that in another PR to more accurately monitor the increased CI

				  # usage.

				  python test/run_test.py \

				    --include inductor/test_torchinductor_opinfo \

				    -k 'linalg or to_sparse or TestInductorOpInfoCPU' \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				  python test/run_test.py \

				    --include inductor/test_torchinductor inductor/test_max_autotune inductor/test_cpu_repro \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				  python test/run_test.py --inductor \

				    --include test_torch \

				    -k 'take' \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				  python test/run_test.py --inductor --include test_torch -k 'take' --verbose

				  # Run inductor benchmark tests with cpp wrapper.

				  # Skip benchmark tests if it's in rerun-disabled-mode.

				  if [[ "${PYTORCH_TEST_RERUN_DISABLED_TESTS}" == "1" ]]; then

				    echo "skip dynamo benchmark tests for rerun-disabled-test"

				  else

				    echo "run dynamo benchmark tests with cpp wrapper"

				    python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \

				    --training --inductor --disable-cudagraphs --only vit_base_patch16_224 \

				    --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_timm_training.csv"

				    python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				      --bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				    python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				      --bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				    python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				      --bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_torchbench_inference.csv"

				  fi

				}

				# "Global" flags for inductor benchmarking controlled by TEST_CONFIG

				@ -493,7 +507,7 @@ DYNAMO_BENCHMARK_FLAGS=()

				pr_time_benchmarks() {

				  pip_install --user "fbscribelogger"

				  pip_install "fbscribelogger"

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				@ -601,8 +615,8 @@ test_perf_for_dashboard() {

				  local device=cuda

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    if [[ "${TEST_CONFIG}" == *zen_cpu_x86* ]]; then

				      device=zen_cpu_x86

				    if [[ "${TEST_CONFIG}" == *cpu_x86_zen* ]]; then

				      device=cpu_x86_zen

				    elif [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then

				      device=cpu_x86

				    elif [[ "${TEST_CONFIG}" == *cpu_aarch64* ]]; then

				@ -613,13 +627,19 @@ test_perf_for_dashboard() {

				    device=cuda_a10g

				  elif [[ "${TEST_CONFIG}" == *h100* ]]; then

				    device=cuda_h100

				  elif [[ "${TEST_CONFIG}" == *b200* ]]; then

				    device=cuda_b200

				  elif [[ "${TEST_CONFIG}" == *rocm* ]]; then

				    device=rocm

				  fi

				  for mode in "${modes[@]}"; do

				    if [[ "$mode" == "inference" ]]; then

				      dtype=bfloat16

				      if [[ "$device" == "cpu_x86" ]]; then

				        dtype=amp

				      else

				        dtype=bfloat16

				      fi

				    elif [[ "$mode" == "training" ]]; then

				      dtype=amp

				    fi

				@ -631,6 +651,10 @@ test_perf_for_dashboard() {

				        target_flag+=( --no-translation-validation)

				      fi

				      if [[ "$DASHBOARD_TAG" == *freezing-true* ]]; then

				        target_flag+=( --freezing)

				      fi

				      if [[ "$DASHBOARD_TAG" == *default-true* ]]; then

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \

				@ -779,6 +803,16 @@ test_dynamo_benchmark() {

				  if [[ "${TEST_CONFIG}" == *perf_compare* ]]; then

				    test_single_dynamo_benchmark "training" "$suite" "$shard_id" --training --amp "$@"

				  elif [[ "${TEST_CONFIG}" == *perf* ]]; then

				    # TODO (huydhn): Just smoke test some sample models

				    if [[ "${TEST_CONFIG}" == *b200* ]]; then

				      if [[ "${suite}" == "huggingface" ]]; then

				        export TORCHBENCH_ONLY_MODELS="DistillGPT2"

				      elif [[ "${suite}" == "timm_models" ]]; then

				        export TORCHBENCH_ONLY_MODELS="inception_v3"

				      elif [[ "${suite}" == "torchbench" ]]; then

				        export TORCHBENCH_ONLY_MODELS="hf_Bert"

				      fi

				    fi

				    test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"

				  else

				    if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				@ -906,12 +940,6 @@ test_torchbench_gcp_smoketest(){

				  popd

				}

				test_python_gloo_with_tls() {

				  source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh"

				  assert_git_not_dirty

				}

				test_aten() {

				  # Test ATen

				  # The following test(s) of ATen have already been skipped by caffe2 in rocm environment:

				@ -958,6 +986,8 @@ test_without_numpy() {

				  if [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then

				    python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch;torch.compile(lambda x:print(x))('Hello World')"

				  fi

				  # Regression test for https://github.com/pytorch/pytorch/pull/157734 (torch.onnx should be importable without numpy)

				  python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch; import torch.onnx"

				  popd

				}

				@ -1021,20 +1051,10 @@ test_libtorch_api() {

				    mkdir -p $TEST_REPORTS_DIR

				    OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" "$TORCH_BIN_DIR"/test_api --gtest_filter='-IMethodTest.*' --gtest_output=xml:$TEST_REPORTS_DIR/test_api.xml

				    "$TORCH_BIN_DIR"/test_tensorexpr --gtest_output=xml:$TEST_REPORTS_DIR/test_tensorexpr.xml

				  else

				    # Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy

				    OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_api -k "not IMethodTest"

				    # On s390x, pytorch is built without llvm.

				    # Even if it would be built with llvm, llvm currently doesn't support used features on s390x and

				    # test fails with errors like:

				    # JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer

				    # unknown file: Failure

				    # C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }

				    if [[ "${BUILD_ENVIRONMENT}" != *s390x* ]]; then

				      python test/run_test.py --cpp --verbose -i cpp/test_tensorexpr

				    fi

				  fi

				  # quantization is not fully supported on s390x yet

				@ -1302,10 +1322,13 @@ EOF

				  # Step 2. Make sure that the public API test "test_correct_module_names" fails when an existing

				  # file is modified to introduce an invalid public API function.

				  EXISTING_FILEPATH="${TORCH_INSTALL_DIR}/nn/parameter.py"

				  # The filepath here must not have __all__ defined in it, otherwise the test will pass.

				  # If your PR introduces __all__ to torch/cuda/streams.py please point this to another file

				  # that does not have __all__ defined.

				  EXISTING_FILEPATH="${TORCH_INSTALL_DIR}/cuda/streams.py"

				  cp -v "${EXISTING_FILEPATH}" "${EXISTING_FILEPATH}.orig"

				  echo "${BAD_PUBLIC_FUNC}" >> "${EXISTING_FILEPATH}"

				  invalid_api="torch.nn.parameter.new_public_func"

				  invalid_api="torch.cuda.streams.new_public_func"

				  echo "Appended an invalid public API function to existing file ${EXISTING_FILEPATH}..."

				  check_public_api_test_fails \

				@ -1460,8 +1483,8 @@ test_bazel() {

				test_benchmarks() {

				  if [[ "$BUILD_ENVIRONMENT" == *cuda* && $TEST_CONFIG != *nogpu* ]]; then

				    pip_install --user "pytest-benchmark==3.2.3"

				    pip_install --user "requests"

				    pip_install "pytest-benchmark==3.2.3"

				    pip_install "requests"

				    BENCHMARK_DATA="benchmarks/.data"

				    mkdir -p ${BENCHMARK_DATA}

				    pytest benchmarks/fastrnns/test_bench.py --benchmark-sort=Name --benchmark-json=${BENCHMARK_DATA}/fastrnns_default.json --fuser=default --executor=default

				@ -1539,7 +1562,7 @@ test_executorch() {

				test_linux_aarch64() {

				  python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \

				        test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \

				        test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops test_cpp_extensions_open_device_registration \

				        test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops \

				        --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Dynamo tests

				@ -1569,7 +1592,7 @@ test_operator_benchmark() {

				  test_inductor_set_cpu_affinity

				  cd benchmarks/operator_benchmark/pt_extension

				  python setup.py install

				  python -m pip install .

				  cd "${TEST_DIR}"/benchmarks/operator_benchmark

				  $TASKSET python -m benchmark_all_test --device "$1" --tag-filter "$2" \

				@ -1589,7 +1612,13 @@ if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-baze

				fi

				if [[ "${TEST_CONFIG}" == *numpy_2* ]]; then

				  # Install numpy-2.0.2 and compatible scipy & numba versions

				  python -mpip install --pre numpy==2.0.2 scipy==1.13.1 numba==0.60.0

				  # Force re-install of pandas to avoid error where pandas checks numpy version from initial install and fails upon import

				  TMP_PANDAS_VERSION=$(python -c "import pandas; print(pandas.__version__)" 2>/dev/null)

				  if [ -n "$TMP_PANDAS_VERSION" ]; then

				    python -m pip install --pre numpy==2.0.2 scipy==1.13.1 numba==0.60.0 pandas=="$TMP_PANDAS_VERSION" --force-reinstall

				  else

				    python -m pip install --pre numpy==2.0.2 scipy==1.13.1 numba==0.60.0

				  fi

				  python test/run_test.py --include dynamo/test_functions.py dynamo/test_unspec.py test_binary_ufuncs.py test_fake_tensor.py test_linalg.py test_numpy_interop.py test_tensor_creation_ops.py test_torch.py torch_np/test_basic.py

				elif [[ "${BUILD_ENVIRONMENT}" == *aarch64* && "${TEST_CONFIG}" != *perf_cpu_aarch64* ]]; then

				  test_linux_aarch64

				@ -1643,52 +1672,39 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then

				  id=$((SHARD_NUMBER-1))

				  test_dynamo_benchmark timm_models "$id"

				elif [[ "${TEST_CONFIG}" == cachebench ]]; then

				  install_torchaudio cuda

				  install_torchaudio

				  install_torchvision

				  checkout_install_torchbench nanogpt BERT_pytorch resnet50 hf_T5 llama moco

				  PYTHONPATH=$(pwd)/torchbench test_cachebench

				  PYTHONPATH=/torchbench test_cachebench

				elif [[ "${TEST_CONFIG}" == verify_cachebench ]]; then

				  install_torchaudio cpu

				  install_torchaudio

				  install_torchvision

				  checkout_install_torchbench nanogpt

				  PYTHONPATH=$(pwd)/torchbench test_verify_cachebench

				  PYTHONPATH=/torchbench test_verify_cachebench

				elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    install_torchaudio cpu

				  else

				    install_torchaudio cuda

				  fi

				  install_torchaudio

				  install_torchvision

				  TORCH_CUDA_ARCH_LIST="8.0;8.6" install_torchao

				  id=$((SHARD_NUMBER-1))

				  # https://github.com/opencv/opencv-python/issues/885

				  pip_install opencv-python==4.8.0.74

				  if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then

				    checkout_install_torchbench hf_Bert hf_Albert timm_vision_transformer

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf

				    PYTHONPATH=/torchbench test_inductor_torchbench_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then

				    checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \

				      llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \

				      functorch_maml_omniglot yolov3 mobilenet_v2 resnext50_32x4d densenet121 mnasnet1_0

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf

				    PYTHONPATH=/torchbench test_inductor_torchbench_cpu_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then

				    checkout_install_torchbench

				    TORCHBENCHPATH=$(pwd)/torchbench test_torchbench_gcp_smoketest

				    TORCHBENCHPATH=/torchbench test_torchbench_gcp_smoketest

				  else

				    checkout_install_torchbench

				    # Do this after checkout_install_torchbench to ensure we clobber any

				    # nightlies that torchbench may pull in

				    if [[ "${TEST_CONFIG}" != *cpu* ]]; then

				      install_torchrec_and_fbgemm

				    fi

				    PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"

				    PYTHONPATH=/torchbench test_dynamo_benchmark torchbench "$id"

				  fi

				elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then

				  install_torchaudio cuda

				  install_torchvision

				  checkout_install_torchbench hf_T5 llama moco

				  PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"

				  test_inductor_aoti

				  PYTHONPATH=/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"

				  if [[ "$SHARD_NUMBER" -eq "1" ]]; then

				    test_inductor_aoti

				  fi

				elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor_shard "${SHARD_NUMBER}"

				@ -1697,6 +1713,8 @@ elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				      test_inductor_distributed

				    fi

				  fi

				elif [[ "${TEST_CONFIG}" == *einops* ]]; then

				  test_einops

				elif [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then

				  install_torchvision

				  test_dynamo_wrapped_shard "${SHARD_NUMBER}"

				@ -1746,6 +1764,10 @@ elif [[ "${TEST_CONFIG}" == smoke ]]; then

				  test_python_smoke

				elif [[ "${TEST_CONFIG}" == h100_distributed ]]; then

				  test_h100_distributed

				elif [[ "${TEST_CONFIG}" == "h100-symm-mem" ]]; then

				  test_h100_symm_mem

				elif [[ "${TEST_CONFIG}" == h100_cutlass_backend ]]; then

				  test_h100_cutlass_backend

				else

				  install_torchvision

				  install_monkeytype

									
										2

.ci/pytorch/test_example_code/CMakeLists.txt
									
												View File
												
				@ -16,7 +16,7 @@ target_link_libraries(simple-torch-test CUDA::cudart CUDA::cufft CUDA::cusparse

				find_library(CUDNN_LIBRARY NAMES cudnn)

				target_link_libraries(simple-torch-test  ${CUDNN_LIBRARY} )

				if(MSVC)

				  file(GLOB TORCH_DLLS  "$ENV{CUDA_PATH}/bin/cudnn64_8.dll")

				  file(GLOB TORCH_DLLS  "$ENV{CUDA_PATH}/bin/cudnn64_8.dll" "$ENV{NVTOOLSEXT_PATH}/bin/x64/*.dll")

				  message("dlls to copy "  ${TORCH_DLLS})

				  add_custom_command(TARGET simple-torch-test

				                     POST_BUILD

									
										34

.ci/pytorch/win-arm64-build.ps1
									
										Normal file
									
												View File
												
				@ -0,0 +1,34 @@

				# If you want to rebuild, run this with $env:REBUILD=1

				# If you want to build with CUDA, run this with $env:USE_CUDA=1

				# If you want to build without CUDA, run this with $env:USE_CUDA=0

				# Check for setup.py in the current directory

				if (-not (Test-Path "setup.py")) {

				    Write-Host "ERROR: Please run this build script from PyTorch root directory."

				    exit 1

				}

				# Get the script's parent directory

				$ScriptParentDir = Split-Path -Parent $MyInvocation.MyCommand.Definition

				# Set TMP_DIR and convert to Windows path

				$env:TMP_DIR = Join-Path (Get-Location) "build\win_tmp"

				$env:TMP_DIR_WIN = $env:TMP_DIR  # Already in Windows format, no cygpath needed

				# Set final package directory with default fallback

				if (-not $env:PYTORCH_FINAL_PACKAGE_DIR) {

				    $env:PYTORCH_FINAL_PACKAGE_DIR = "C:\w\build-results"

				}

				# Create the final package directory if it doesn't exist

				if (-not (Test-Path $env:PYTORCH_FINAL_PACKAGE_DIR)) {

				    New-Item -Path $env:PYTORCH_FINAL_PACKAGE_DIR -ItemType Directory -Force | Out-Null

				}

				# Set script helpers directory

				$env:SCRIPT_HELPERS_DIR = Join-Path $ScriptParentDir "win-test-helpers\arm64"

				# Run the main build script

				& "$env:SCRIPT_HELPERS_DIR\build_pytorch.ps1"

				Write-Host "BUILD PASSED"

									
										24

.ci/pytorch/win-arm64-test.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,24 @@

				#!/bin/bash

				set -ex -o pipefail

				SCRIPT_PARENT_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )

				# shellcheck source=./common.sh

				source "$SCRIPT_PARENT_DIR/common.sh"

				run_tests() {

				    echo Running smoke_test.py...

				    python ./.ci/pytorch/smoke_test/smoke_test.py --package torchonly

				    echo Running test_autograd.oy, test_nn.py, test_torch.py...

				    cd test

				    CORE_TEST_LIST=("test_autograd.py" "test_nn.py" "test_modules.py")

				    for t in "${CORE_TEST_LIST[@]}"; do

				        echo "Running test: $t"

				        python "$t" --verbose --save-xml --use-pytest -vvvv -rfEsxXP -p no:xdist

				    done

				}

				run_tests

				echo "TEST PASSED"

									
										98

.ci/pytorch/win-test-helpers/arm64/build_pytorch.ps1
									
										Normal file
									
												View File
												
				@ -0,0 +1,98 @@

				# TODO: we may can use existing build_pytorch.bat for arm64

				if ($env:DEBUG -eq "1") {

				    $env:BUILD_TYPE = "debug"

				} else {

				    $env:BUILD_TYPE = "release"

				}

				# This inflates our log size slightly, but it is REALLY useful to be

				# able to see what our cl.exe commands are. (since you can actually

				# just copy-paste them into a local Windows setup to just rebuild a

				# single file.)

				# log sizes are too long, but leaving this here in case someone wants to use it locally

				# $env:CMAKE_VERBOSE_MAKEFILE = "1"

				$env:INSTALLER_DIR = Join-Path $env:SCRIPT_HELPERS_DIR "installation-helpers"

				cd ..

				# Environment variables

				$env:SCCACHE_IDLE_TIMEOUT = "0"

				$env:SCCACHE_IGNORE_SERVER_IO_ERROR = "1"

				$env:CMAKE_BUILD_TYPE = $env:BUILD_TYPE

				$env:CMAKE_C_COMPILER_LAUNCHER = "sccache"

				$env:CMAKE_CXX_COMPILER_LAUNCHER = "sccache"

				$env:libuv_ROOT = Join-Path $env:DEPENDENCIES_DIR "libuv\install"

				$env:MSSdk = "1"

				if ($env:PYTORCH_BUILD_VERSION) {

				    $env:PYTORCH_BUILD_VERSION = $env:PYTORCH_BUILD_VERSION

				    $env:PYTORCH_BUILD_NUMBER = "1"

				}

				$env:CMAKE_POLICY_VERSION_MINIMUM = "3.5"

				# Set BLAS type

				if ($env:ENABLE_APL -eq "1") {

				    $env:BLAS = "APL"

				    $env:USE_LAPACK = "1"

				} elseif ($env:ENABLE_OPENBLAS -eq "1") {

				    $env:BLAS = "OpenBLAS"

				    $env:OpenBLAS_HOME = Join-Path $env:DEPENDENCIES_DIR "OpenBLAS\install"

				}

				# Change to source directory

				Set-Location $env:PYTORCH_ROOT

				# Copy libuv.dll

				Copy-Item -Path (Join-Path $env:libuv_ROOT "lib\Release\uv.dll") -Destination "torch\lib\uv.dll" -Force

				# Create virtual environment

				python -m venv .venv

				.\.venv\Scripts\Activate.ps1

				where.exe python

				# Python install dependencies

				python -m pip install --upgrade pip

				pip install setuptools pyyaml

				pip install -r requirements.txt

				# Set after installing psutil

				$env:DISTUTILS_USE_SDK = "1"

				# Print all environment variables

				Get-ChildItem Env:

				# Start and inspect sccache

				sccache --start-server

				sccache --zero-stats

				sccache --show-stats

				# Build the wheel

				python setup.py bdist_wheel

				if ($LASTEXITCODE -ne 0) { exit 1 }

				# Install the wheel locally

				$whl = Get-ChildItem -Path "dist\*.whl" | Select-Object -First 1

				if ($whl) {

				    python -mpip install --no-index --no-deps $whl.FullName

				}

				# Copy final wheel

				robocopy "dist" "$env:PYTORCH_FINAL_PACKAGE_DIR" *.whl

				# Export test times

				python tools/stats/export_test_times.py

				# Copy additional CI files

				robocopy ".additional_ci_files" "$env:PYTORCH_FINAL_PACKAGE_DIR\.additional_ci_files" /E

				# Save ninja log

				Copy-Item -Path "build\.ninja_log" -Destination $env:PYTORCH_FINAL_PACKAGE_DIR -Force

				# Final sccache stats and stop

				sccache --show-stats

				sccache --stop-server

				exit 0

									
										9

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -42,7 +42,7 @@ call choco upgrade -y cmake --no-progress --installargs 'ADD_CMAKE_TO_PATH=Syste

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				call pip install mkl-include==2021.4.0 mkl-devel==2021.4.0

				call pip install mkl==2024.2.0 mkl-static==2024.2.0 mkl-include==2024.2.0

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				@ -61,9 +61,10 @@ if "%USE_XPU%"=="1" (

				  call "C:\Program Files (x86)\Intel\oneAPI\compiler\latest\env\vars.bat"

				  call "C:\Program Files (x86)\Intel\oneAPI\ocloc\latest\env\vars.bat"

				  if errorlevel 1 exit /b 1

				  :: Reduce build time. Only have MTL self-hosted runner now

				  SET TORCH_XPU_ARCH_LIST=xe-lpg

				  SET USE_KINETO=0

				  :: Reduce build time

				  SET TORCH_XPU_ARCH_LIST=bmg

				  :: Re-setup python env for build

				  call pip install -r requirements.txt

				)

				@echo on

									
										5

.ci/pytorch/win-test.sh
									
												View File
												
				@ -41,7 +41,7 @@ fi

				python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1

				# Install Z3 optional dependency for Windows builds.

				python -m pip install z3-solver==4.12.2.0

				python -m pip install z3-solver==4.15.1.0

				# Install tlparse for test\dynamo\test_structured_trace.py UTs.

				python -m pip install tlparse==0.3.30

				@ -52,6 +52,9 @@ python -m pip install parameterized==0.8.1

				# Install pulp for testing ilps under torch\distributed\_tools

				python -m pip install pulp==2.9.0

				# Install expecttest to merge https://github.com/pytorch/pytorch/pull/155308

				python -m pip install expecttest==0.3.0

				run_tests() {

				    # Run nvidia-smi if available

				    for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

									
										11

.ci/pytorch/windows/cuda126.bat
									
												View File
												
				@ -18,6 +18,15 @@ REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%NVTOOLSEXT_PATH%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib"  (

				        set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				    ) ELSE (

				        echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing

				        exit /b 1

				    )

				)

				IF "%CUDA_PATH_V126%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe" (

				        set "CUDA_PATH_V126=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6"

				@ -28,7 +37,7 @@ IF "%CUDA_PATH_V126%"=="" (

				)

				IF "%BUILD_VISION%" == "" (

				    set TORCH_CUDA_ARCH_LIST=6.1;7.0;7.5;8.0;8.6;9.0

				    set TORCH_CUDA_ARCH_LIST=5.0;6.0;6.1;7.0;7.5;8.0;8.6;9.0

				    set TORCH_NVCC_FLAGS=-Xfatbin -compress-all

				) ELSE (

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90

									
										9

.ci/pytorch/windows/cuda128.bat
									
												View File
												
				@ -18,6 +18,15 @@ REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%NVTOOLSEXT_PATH%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib"  (

				        set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				    ) ELSE (

				        echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing

				        exit /b 1

				    )

				)

				IF "%CUDA_PATH_V128%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin\nvcc.exe" (

				        set "CUDA_PATH_V128=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8"

									
										13

.ci/pytorch/windows/cuda129.bat
									
												View File
												
				@ -18,6 +18,15 @@ REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%NVTOOLSEXT_PATH%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib"  (

				        set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				    ) ELSE (

				        echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing

				        exit /b 1

				    )

				)

				IF "%CUDA_PATH_V129%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin\nvcc.exe" (

				        set "CUDA_PATH_V129=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9"

				@ -28,10 +37,10 @@ IF "%CUDA_PATH_V129%"=="" (

				)

				IF "%BUILD_VISION%" == "" (

				    set TORCH_CUDA_ARCH_LIST=7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_NVCC_FLAGS=-Xfatbin -compress-all

				) ELSE (

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				)

				set "CUDA_PATH=%CUDA_PATH_V129%"

									
										2

.ci/pytorch/windows/internal/copy.bat
									
												View File
												
				@ -8,7 +8,9 @@ copy "%CUDA_PATH%\bin\cusolver*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\cudnn*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\nvrtc*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\extras\CUPTI\lib64\cupti64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\extras\CUPTI\lib64\nvperf_host*.dll*" pytorch\torch\lib

				copy "C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64\nvToolsExt64_1.dll*" pytorch\torch\lib

				copy "%PYTHON_LIB_PATH%\libiomp*5md.dll" pytorch\torch\lib

				:: Should be set in build_pytorch.bat

									
										15

.ci/pytorch/windows/internal/cuda_install.bat
									
												View File
												
				@ -119,6 +119,11 @@ goto cuda_common

				:: If you cannot find the CUDA version you want to build for here then please

				:: add it @ https://github.com/pytorch/test-infra/tree/main/aws/ami/windows

				if not exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin\nvcc.exe" (

				    if not exist "%SRC_DIR%\temp_build\NvToolsExt.7z" (

				        curl -k -L https://ossci-windows.s3.us-east-1.amazonaws.com/builder/NvToolsExt.7z --output "%SRC_DIR%\temp_build\NvToolsExt.7z"

				        if errorlevel 1 exit /b 1

				    )

				    if not exist "%SRC_DIR%\temp_build\gpu_driver_dlls.zip" (

				        curl -k -L "https://ossci-windows.s3.us-east-1.amazonaws.com/builder/additional_dlls.zip" --output "%SRC_DIR%\temp_build\gpu_driver_dlls.zip"

				        if errorlevel 1 exit /b 1

				@ -145,6 +150,15 @@ if not exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_

				        xcopy /Y "%SRC_DIR%\temp_build\cuda\CUDAVisualStudioIntegration\extras\visual_studio_integration\MSBuildExtensions\*.*" "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations"

				    )

				    echo Installing NvToolsExt...

				    7z x %SRC_DIR%\temp_build\NvToolsExt.7z -o"%SRC_DIR%\temp_build\NvToolsExt"

				    mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\bin\x64"

				    mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\include"

				    mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\lib\x64"

				    xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\bin\x64\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\bin\x64"

				    xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\include\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\include"

				    xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\lib\x64\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\lib\x64"

				    echo Installing cuDNN...

				    7z x %CUDNN_SETUP_FILE% -o"%SRC_DIR%\temp_build\cudnn"

				    xcopy /Y "%SRC_DIR%\temp_build\cudnn\%CUDNN_FOLDER%\bin\*.*" "%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin"

				@ -175,3 +189,4 @@ echo Setting up environment...

				set "PATH=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin;%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\libnvvp;%PATH%"

				set "CUDA_PATH=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%"

				set "CUDA_PATH_V%CUDA_VER_MAJOR%_%CUDA_VER_MINOR%=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%"

				set "NVTOOLSEXT_PATH=%ProgramFiles%\NVIDIA Corporation\NvToolsExt"

									
										2

.ci/pytorch/windows/internal/install_python.bat
									
												View File
												
				@ -18,3 +18,5 @@ start /wait "" python-amd64.exe /quiet InstallAllUsers=1 PrependPath=0 Include_t

				if errorlevel 1 exit /b 1

				set "PATH=%CD%\Python\Scripts;%CD%\Python;%PATH%"

				%PYTHON_EXEC% -m pip install --upgrade pip setuptools packaging wheel

				if errorlevel 1 exit /b 1

									
										9

.ci/pytorch/windows/internal/smoke_test.bat
									
												View File
												
				@ -148,14 +148,7 @@ if "%NVIDIA_GPU_EXISTS%" == "0" (

				    goto end

				)

				set BUILD_SPLIT_CUDA=

				if exist "%install_root%\lib\torch_cuda_cu.lib" if exist "%install_root%\lib\torch_cuda_cpp.lib" set BUILD_SPLIT_CUDA=ON

				if "%BUILD_SPLIT_CUDA%" == "ON" (

				    cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-cuda.cpp torch_cpu.lib c10.lib torch_cuda_cu.lib torch_cuda_cpp.lib /EHsc /std:c++17 /link /INCLUDE:?warp_size@cuda@at@@YAHXZ /INCLUDE:?_torch_cuda_cu_linker_symbol_op_cuda@native@at@@YA?AVTensor@2@AEBV32@@Z

				) else (

				    cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-cuda.cpp torch_cpu.lib c10.lib torch_cuda.lib /EHsc /std:c++17 /link /INCLUDE:?warp_size@cuda@at@@YAHXZ

				)

				cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-cuda.cpp torch_cpu.lib c10.lib torch_cuda.lib /EHsc /std:c++17 /link /INCLUDE:?warp_size@cuda@at@@YAHXZ

				.\check-torch-cuda.exe

				if ERRORLEVEL 1 exit /b 1

									
										50

.ci/wheel/build_wheel.sh
									
												View File
												
				@ -127,15 +127,34 @@ export INSTALL_TEST=0 # dont install test binaries into site-packages

				export MACOSX_DEPLOYMENT_TARGET=10.15

				export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

				SETUPTOOLS_PINNED_VERSION="=46.0.0"

				SETUPTOOLS_PINNED_VERSION="==70.1.0"

				PYYAML_PINNED_VERSION="=5.3"

				EXTRA_CONDA_INSTALL_FLAGS=""

				CONDA_ENV_CREATE_FLAGS=""

				RENAME_WHEEL=true

				case $desired_python in

				    3.14t)

				        echo "Using 3.14 deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=6.0.1"

				        NUMPY_PINNED_VERSION="=2.1.0"

				        CONDA_ENV_CREATE_FLAGS="python-freethreading"

				        EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"

				        desired_python="3.14.0rc1"

				        RENAME_WHEEL=false

				        ;;

				    3.14)

				        echo "Using 3.14t deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=6.0.1"

				        NUMPY_PINNED_VERSION="=2.1.0"

				        EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"

				        desired_python="3.14.0rc1"

				        RENAME_WHEEL=false

				        ;;

				    3.13t)

				        echo "Using 3.13 deps"

				        SETUPTOOLS_PINNED_VERSION=">=68.0.0"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=6.0.1"

				        NUMPY_PINNED_VERSION="=2.1.0"

				        CONDA_ENV_CREATE_FLAGS="python-freethreading"

				@ -145,31 +164,31 @@ case $desired_python in

				        ;;

				    3.13)

				        echo "Using 3.13 deps"

				        SETUPTOOLS_PINNED_VERSION=">=68.0.0"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=6.0.1"

				        NUMPY_PINNED_VERSION="=2.1.0"

				        ;;

				    3.12)

				        echo "Using 3.12 deps"

				        SETUPTOOLS_PINNED_VERSION=">=68.0.0"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=6.0.1"

				        NUMPY_PINNED_VERSION="=2.0.2"

				        ;;

				    3.11)

				        echo "Using 3.11 deps"

				        SETUPTOOLS_PINNED_VERSION=">=46.0.0"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=5.3"

				        NUMPY_PINNED_VERSION="=2.0.2"

				        ;;

				    3.10)

				        echo "Using 3.10 deps"

				        SETUPTOOLS_PINNED_VERSION=">=46.0.0"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=5.3"

				        NUMPY_PINNED_VERSION="=2.0.2"

				        ;;

				    3.9)

				        echo "Using 3.9 deps"

				        SETUPTOOLS_PINNED_VERSION=">=46.0.0"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=5.3"

				        NUMPY_PINNED_VERSION="=2.0.2"

				        ;;

				@ -184,16 +203,14 @@ tmp_env_name="wheel_py$python_nodot"

				conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS}

				source activate "$tmp_env_name"

				pip install "numpy=${NUMPY_PINNED_VERSION}"  "pyyaml${PYYAML_PINNED_VERSION}" requests ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing_extensions

				retry pip install -r "${pytorch_rootdir}/requirements-build.txt"

				pip install "numpy=${NUMPY_PINNED_VERSION}"  "pyyaml${PYYAML_PINNED_VERSION}" requests ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing-extensions

				retry pip install -r "${pytorch_rootdir}/requirements.txt" || true

				retry brew install libomp

				# For USE_DISTRIBUTED=1 on macOS, need libuv, which is build as part of tensorpipe submodule

				export USE_DISTRIBUTED=1

				if [[ -n "$CROSS_COMPILE_ARM64" ]]; then

				    export CMAKE_OSX_ARCHITECTURES=arm64

				fi

				export USE_MKLDNN=OFF

				export USE_QNNPACK=OFF

				export BUILD_TEST=OFF

				@ -201,16 +218,7 @@ export BUILD_TEST=OFF

				pushd "$pytorch_rootdir"

				echo "Calling setup.py bdist_wheel at $(date)"

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel -d "$whl_tmp_dir"

				    echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				    BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 CMAKE_FRESH=1 python setup.py bdist_wheel -d "$whl_tmp_dir"

				    echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				else

				    python setup.py bdist_wheel -d "$whl_tmp_dir"

				fi

				python setup.py bdist_wheel -d "$whl_tmp_dir"

				echo "Finished setup.py bdist_wheel at $(date)"

									
										12

.circleci/scripts/binary_linux_test.sh
									
												View File
												
				@ -65,16 +65,8 @@ fi

				if [[ "$PACKAGE_TYPE" != libtorch ]]; then

				  if [[ "\$BUILD_ENVIRONMENT" != *s390x* ]]; then

				    if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				      pkg_no_python="$(ls -1 /final_pkgs/torch_no_python* | sort |tail -1)"

				      pkg_torch="$(ls -1 /final_pkgs/torch-* | sort |tail -1)"

				      # todo: after folder is populated use the pypi_pkg channel instead

				      pip install "\$pkg_no_python" "\$pkg_torch" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}_pypi_pkg"

				      retry pip install -q numpy protobuf typing-extensions

				    else

				      pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"

				      retry pip install -q numpy protobuf typing-extensions

				    fi

				    pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"

				    retry pip install -q numpy protobuf typing-extensions

				  else

				    pip install "\$pkg"

				    retry pip install -q numpy protobuf typing-extensions

									
										5

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -75,8 +75,8 @@ TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"

				# CUDA 12.8 builds have triton for Linux and Linux aarch64 binaries.

				if [[ "$DESIRED_CUDA" == cu128 ]]; then

				# CUDA 12.9 builds have triton for Linux and Linux aarch64 binaries.

				if [[ "$DESIRED_CUDA" == "cu129" ]]; then

				  TRITON_CONSTRAINT="platform_system == 'Linux'"

				fi

				@ -134,7 +134,6 @@ export DESIRED_PYTHON="${DESIRED_PYTHON:-}"

				export DESIRED_CUDA="$DESIRED_CUDA"

				export LIBTORCH_VARIANT="${LIBTORCH_VARIANT:-}"

				export BUILD_PYTHONLESS="${BUILD_PYTHONLESS:-}"

				export USE_SPLIT_BUILD="${USE_SPLIT_BUILD:-}"

				if [[ "${OSTYPE}" == "msys" ]]; then

				  export LIBTORCH_CONFIG="${LIBTORCH_CONFIG:-}"

				  if [[ "${LIBTORCH_CONFIG:-}" == 'debug' ]]; then

									
										4

.circleci/scripts/binary_upload.sh
									
												View File
												
				@ -23,10 +23,6 @@ if [[ "${DRY_RUN}" = "disabled" ]]; then

				  AWS_S3_CP="aws s3 cp"

				fi

				if [[ "${USE_SPLIT_BUILD:-false}" == "true" ]]; then

				  UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_pypi_pkg"

				fi

				# this is special build with all dependencies packaged

				if [[ ${BUILD_NAME} == *-full* ]]; then

				  UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_full"

1

.clang-format

View File

 @ -120,6 +120,7 @@ UseTab:          Never
 Language: ObjC
 ColumnLimit: 120
 AlignAfterOpenBracket: Align
 IndentWidth: 2
 ObjCBlockIndentWidth: 2
 ObjCSpaceAfterProperty: false
 ObjCSpaceBeforeProtocolList: false

									
										4

.devcontainer/README.md
									
												View File
												
				@ -61,8 +61,8 @@ You are now all set to start developing with PyTorch in a DevContainer environme

				## Step 8: Build PyTorch

				To build pytorch from source, simply run:

				   ```

				   python setup.py develop

				   ```bash

				   python -m pip install --no-build-isolation -v -e .

				   ```

				The process involves compiling thousands of files, and would take a long time. Fortunately, the compiled objects can be useful for your next build. When you modify some files, you only need to compile the changed files the next time.

									
										24

.editorconfig
									
												View File
												
				@ -1,14 +1,36 @@

				root = true

				[*]

				charset = utf-8

				end_of_line = lf

				insert_final_newline = true

				# Python

				[*.py]

				[*.{py,pyi,py.in,pyi.in}]

				indent_style = space

				indent_size = 4

				# C/C++/CUDA

				[*.{cpp,hpp,cxx,cc,c,h,cu,cuh}]

				indent_style = space

				indent_size = 2

				# Objective-C

				[*.{mm,m,M}]

				indent_style = space

				indent_size = 2

				# Clang tools

				[.clang-{format,tidy}]

				indent_style = space

				indent_size = 2

				# Make

				[Makefile]

				indent_style = tab

				# Batch file

				[*.bat]

				indent_style = space

				indent_size = 2

				end_of_line = crlf

Compare commits

2282 Commits csl/revert ... windows_li

2 .bazelrc Unescape Escape View File

2 .ci/aarch64_linux/aarch64_ci_build.sh Unescape Escape View File

5 .ci/aarch64_linux/aarch64_wheel_ci_build.py Unescape Escape View File

16 .ci/aarch64_linux/build_aarch64_wheel.py Unescape Escape View File

6 .ci/caffe2/test.sh Unescape Escape View File

101 .ci/docker/README.md Unescape Escape View File

6 .ci/docker/almalinux/Dockerfile Unescape Escape View File

146 .ci/docker/build.sh Unescape Escape View File

2 .ci/docker/ci_commit_pins/huggingface-requirements.txt Normal file Unescape Escape View File

1 .ci/docker/ci_commit_pins/huggingface.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/nccl-cu12.txt Unescape Escape View File

1 .ci/docker/ci_commit_pins/nccl-cu13.txt Normal file Unescape Escape View File

0 .github/ci_commit_pins/torchbench.txt → .ci/docker/ci_commit_pins/torchbench.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

4 .ci/docker/common/common_utils.sh Unescape Escape View File

3 .ci/docker/common/install_base.sh Unescape Escape View File

21 .ci/docker/common/install_conda.sh Unescape Escape View File

39 .ci/docker/common/install_cpython.sh Unescape Escape View File

125 .ci/docker/common/install_cuda.sh Unescape Escape View File

24 .ci/docker/common/install_cudnn.sh Unescape Escape View File

18 .ci/docker/common/install_cusparselt.sh Unescape Escape View File

31 .ci/docker/common/install_inductor_benchmark_deps.sh Unescape Escape View File

2 .ci/docker/common/install_nccl.sh Unescape Escape View File

4 .ci/docker/common/install_onnx.sh Unescape Escape View File

6 .ci/docker/common/install_openblas.sh Unescape Escape View File

57 .ci/docker/common/install_rocm.sh Unescape Escape View File

7 .ci/docker/common/install_rocm_magma.sh Unescape Escape View File

6 .ci/docker/common/install_triton.sh Unescape Escape View File

53 .ci/docker/common/install_xpu.sh Unescape Escape View File

4 .ci/docker/libtorch/build.sh Unescape Escape View File

2 .ci/docker/linter/Dockerfile Unescape Escape View File

2 .ci/docker/manywheel/Dockerfile_s390x Unescape Escape View File

6 .ci/docker/manywheel/build.sh Unescape Escape View File

40 .ci/docker/requirements-ci.txt Unescape Escape View File

15 .ci/docker/requirements-docs.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

155 .ci/docker/ubuntu-cross-riscv/Dockerfile Normal file Unescape Escape View File

5 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

4 .ci/docker/ubuntu-xpu/Dockerfile Unescape Escape View File

11 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

31 .ci/lumen_cli/README.md Normal file Unescape Escape View File

0 test/dynamo_expected_failures/CPython313-test_complex-ComplexTest.test_truediv_zero_division → .ci/lumen_cli/cli/build_cli/__init__.py Unescape Escape View File

37 .ci/lumen_cli/cli/build_cli/register_build.py Normal file Unescape Escape View File

0 test/dynamo_expected_failures/CPython313-test_contextlib-ClosingTestCase.test_closing → .ci/lumen_cli/cli/lib/__init__.py Unescape Escape View File

71 .ci/lumen_cli/cli/lib/common/cli_helper.py Normal file Unescape Escape View File

42 .ci/lumen_cli/cli/lib/common/docker_helper.py Normal file Unescape Escape View File

110 .ci/lumen_cli/cli/lib/common/envs_helper.py Normal file Unescape Escape View File

69 .ci/lumen_cli/cli/lib/common/git_helper.py Normal file Unescape Escape View File

14 .ci/lumen_cli/cli/lib/common/logger.py Normal file Unescape Escape View File

62 .ci/lumen_cli/cli/lib/common/path_helper.py Normal file Unescape Escape View File

79 .ci/lumen_cli/cli/lib/common/utils.py Normal file Unescape Escape View File

263 .ci/lumen_cli/cli/lib/core/vllm.py Normal file Unescape Escape View File

38 .ci/lumen_cli/cli/run.py Normal file Unescape Escape View File

22 .ci/lumen_cli/pyproject.toml Normal file Unescape Escape View File

47 .ci/lumen_cli/tests/test_app.py Normal file Unescape Escape View File

115 .ci/lumen_cli/tests/test_cli_helper.py Normal file Unescape Escape View File

75 .ci/lumen_cli/tests/test_docker_helper.py Normal file Unescape Escape View File

149 .ci/lumen_cli/tests/test_envs_helper.py Normal file Unescape Escape View File

122 .ci/lumen_cli/tests/test_path_helper.py Normal file Unescape Escape View File

181 .ci/lumen_cli/tests/test_vllm.py Normal file Unescape Escape View File

7 .ci/magma/Makefile Unescape Escape View File

2 .ci/magma/build_magma.sh Unescape Escape View File

26 .ci/magma/package_files/cuda13.patch Normal file Unescape Escape View File

4 .ci/manywheel/build.sh Unescape Escape View File

35 .ci/manywheel/build_common.sh Unescape Escape View File

28 .ci/manywheel/build_cuda.sh Unescape Escape View File

6 .ci/manywheel/build_libtorch.sh Unescape Escape View File

2 .ci/manywheel/build_rocm.sh Unescape Escape View File

2 .ci/onnx/test.sh Unescape Escape View File

34 .ci/pytorch/build-mobile.sh Unescape Escape View File

101 .ci/pytorch/build.sh Unescape Escape View File

7 .ci/pytorch/common-build.sh Unescape Escape View File

148 .ci/pytorch/common_utils.sh Unescape Escape View File

123 .ci/pytorch/create_test_cert.py Unescape Escape View File

32 .ci/pytorch/macos-test.sh Unescape Escape View File

18 .ci/pytorch/run_glootls_test.sh Unescape Escape View File

5 .ci/pytorch/run_tests.sh Unescape Escape View File

2282 Commits

csl/revert ... windows_li

2

.bazelrc

View File

2

.ci/aarch64_linux/aarch64_ci_build.sh

View File

5

.ci/aarch64_linux/aarch64_wheel_ci_build.py

View File

16

.ci/aarch64_linux/build_aarch64_wheel.py

View File

6

.ci/caffe2/test.sh

View File

101

.ci/docker/README.md

View File

6

.ci/docker/almalinux/Dockerfile

View File

146

.ci/docker/build.sh

View File

2

.ci/docker/ci_commit_pins/huggingface-requirements.txt Normal file

View File

1

.ci/docker/ci_commit_pins/huggingface.txt

View File

2

.ci/docker/ci_commit_pins/nccl-cu12.txt

View File

1

.ci/docker/ci_commit_pins/nccl-cu13.txt Normal file

View File

0

.github/ci_commit_pins/torchbench.txt → .ci/docker/ci_commit_pins/torchbench.txt

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

4

.ci/docker/common/common_utils.sh

View File

3

.ci/docker/common/install_base.sh

View File

21

.ci/docker/common/install_conda.sh

View File

39

.ci/docker/common/install_cpython.sh

View File

125

.ci/docker/common/install_cuda.sh

View File

24

.ci/docker/common/install_cudnn.sh

View File

18

.ci/docker/common/install_cusparselt.sh

View File

31

.ci/docker/common/install_inductor_benchmark_deps.sh

View File

2

.ci/docker/common/install_nccl.sh

View File

4

.ci/docker/common/install_onnx.sh

View File

6

.ci/docker/common/install_openblas.sh

View File

57

.ci/docker/common/install_rocm.sh

View File

7

.ci/docker/common/install_rocm_magma.sh

View File

6

.ci/docker/common/install_triton.sh

View File

53

.ci/docker/common/install_xpu.sh

View File

4

.ci/docker/libtorch/build.sh

View File

2

.ci/docker/linter/Dockerfile

View File

2

.ci/docker/manywheel/Dockerfile_s390x

View File

6

.ci/docker/manywheel/build.sh

View File

40

.ci/docker/requirements-ci.txt

View File

15

.ci/docker/requirements-docs.txt

View File

2

.ci/docker/triton_version.txt

View File

155

.ci/docker/ubuntu-cross-riscv/Dockerfile Normal file

View File

5

.ci/docker/ubuntu-rocm/Dockerfile

View File

4

.ci/docker/ubuntu-xpu/Dockerfile

View File

11

.ci/docker/ubuntu/Dockerfile

View File

31

.ci/lumen_cli/README.md Normal file

View File

0

test/dynamo_expected_failures/CPython313-test_complex-ComplexTest.test_truediv_zero_division → .ci/lumen_cli/cli/build_cli/init.py

View File

37

.ci/lumen_cli/cli/build_cli/register_build.py Normal file

View File

0

test/dynamo_expected_failures/CPython313-test_contextlib-ClosingTestCase.test_closing → .ci/lumen_cli/cli/lib/init.py

View File

71

.ci/lumen_cli/cli/lib/common/cli_helper.py Normal file

View File

42

.ci/lumen_cli/cli/lib/common/docker_helper.py Normal file

View File

110

.ci/lumen_cli/cli/lib/common/envs_helper.py Normal file

View File

69

.ci/lumen_cli/cli/lib/common/git_helper.py Normal file

View File

14

.ci/lumen_cli/cli/lib/common/logger.py Normal file

View File

62

.ci/lumen_cli/cli/lib/common/path_helper.py Normal file

View File

79

.ci/lumen_cli/cli/lib/common/utils.py Normal file

View File

263

.ci/lumen_cli/cli/lib/core/vllm.py Normal file

View File

38

.ci/lumen_cli/cli/run.py Normal file

View File

22

.ci/lumen_cli/pyproject.toml Normal file

View File

47

.ci/lumen_cli/tests/test_app.py Normal file

View File

115

.ci/lumen_cli/tests/test_cli_helper.py Normal file

View File

75

.ci/lumen_cli/tests/test_docker_helper.py Normal file

View File

149

.ci/lumen_cli/tests/test_envs_helper.py Normal file

View File

122

.ci/lumen_cli/tests/test_path_helper.py Normal file

View File

181

.ci/lumen_cli/tests/test_vllm.py Normal file

View File

7

.ci/magma/Makefile

View File

2

.ci/magma/build_magma.sh

View File

26

.ci/magma/package_files/cuda13.patch Normal file

View File

4

.ci/manywheel/build.sh

View File

35

.ci/manywheel/build_common.sh

View File

28

.ci/manywheel/build_cuda.sh

View File

6

.ci/manywheel/build_libtorch.sh

View File

2

.ci/manywheel/build_rocm.sh

View File

2

.ci/onnx/test.sh

View File

34

.ci/pytorch/build-mobile.sh

View File

101

.ci/pytorch/build.sh

View File

7

.ci/pytorch/common-build.sh

View File

148

.ci/pytorch/common_utils.sh

View File

123

.ci/pytorch/create_test_cert.py

View File

32

.ci/pytorch/macos-test.sh

View File

18

.ci/pytorch/run_glootls_test.sh

View File

5

.ci/pytorch/run_tests.sh

View File

25

.ci/pytorch/smoke_test/smoke_test.py

View File