pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-23 23:04:52 +08:00

Author	SHA1	Message	Date
pytorchbot	b53536b427	Fix #156261 _foreach_copy indexing (#158238 ) Fix #156261 _foreach_copy indexing (#156719) Fixes #156261 Thanks to @ngimel's fast eyes For testing, I had experimented with a broader test case change but found that creating a tensor of 2**31+1 size was too expensive to do more than just a few times. Note that while the test case does not run in CI, I did run it locally to ensure it passes with new changes and fails without. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156719 Approved by: https://github.com/albanD (cherry picked from commit 4ee4863232b9e07728d85254768bcba3aadc9b9a) Co-authored-by: Jane Xu <janeyx@meta.com>	2025-07-14 16:42:28 -04:00
pytorchbot	ff2170c940	Add sm_70 to windows 12.9 build (#158265 ) Add sm_70 to windows 12.9 build (#158126) Please see: https://github.com/pytorch/pytorch/issues/157517 Volta architectures will be kept for 12.8/12.9 builds for release 2.8 (12.8 win build does not need change since already including sm70) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158126 Approved by: https://github.com/Skylion007, https://github.com/atalman (cherry picked from commit 86d8af6a6cc648134289de89d393d0dce5b3a5f4) Co-authored-by: Ting Lu <tingl@nvidia.com>	2025-07-14 16:40:59 -04:00
pytorchbot	f0101fd29f	docs: add get_default_backend_for_device to distributed documentation (#158236 ) docs: add get_default_backend_for_device to distributed documentation (#156783) `torch.distributed.get_default_backend_for_device()` API was added to torch 2.6, but is still missing in distributed documentation. This commit addresses the gap. CC: @guangyey, @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/156783 Approved by: https://github.com/guangyey, https://github.com/malfet (cherry picked from commit b146ca74f01df3cf711fd0f855e05805e490156c) Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>	2025-07-14 16:32:32 -04:00
pytorchbot	bd17a453df	don't error out in empty_cache under mempool context (#158180 ) don't error out in empty_cache under mempool context (#158152) Now instead of erroring out on `empty_cache` call during graph capture or under mempool context, we will just silently do nothing. This used to be the behavior for mempools, cudagraphs used to error out, but it's fine to just ignore the call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158152 Approved by: https://github.com/zou3519, https://github.com/eqy (cherry picked from commit 9056279f8159b052599a31b591a78da1acc4224c) Co-authored-by: Natalia Gimelshein <ngimel@meta.com>	2025-07-14 09:24:47 -07:00
pytorchbot	983b95f90b	[autograd] Avoid creating and recording event when unnecessary (#157914 ) [autograd] Avoid creating and recording event when unnecessary (#157503) Today, we always create and record an events in two places: 1) Upon seeing the first producer, we record an event on the producer, and we wait for this event in two places: (1) when the engine goes to run the consumer, the consumer stream waits for this event. (2) prior to doing accumulation, the accumulation stream waits for this event. 2) After doing accumulation, we record an event on the accumulation stream and wait for this event in a single place: when the engine goes to run the consumer. We do not actually need to record the event in the cases where the 1st producer stream is the same as the consumer and as the accumulation stream, and where the accumulation stream is the same as the consumer stream. Removing this unnecessary create + record event should save a few us for each instance avoided. Fixes https://github.com/pytorch/pytorch/issues/157407 ---- Manual test plan: - [x] @eqy to confirm perf is restored - [x] Running the repro originally reported before/after the patch Pull Request resolved: https://github.com/pytorch/pytorch/pull/157503 Approved by: https://github.com/eqy ghstack dependencies: #155715 (cherry picked from commit 8bda95228fbefa6ce204bf4da8b632d1516431bb) Co-authored-by: soulitzer <soulitzer@gmail.com>	2025-07-14 07:19:27 -07:00
pytorchbot	175782834a	[aarch64] Add sm_80 to CUDA SBSA build (#158118 ) [aarch64] Add sm_80 to CUDA SBSA build (#157843) related to https://github.com/pytorch/pytorch/issues/152690 This adds sm_80 to CUDA SBSA builds (12.9), so that we will be able to support Ampere family (e.g: sm_86) and Ada family (e.g: sm_89) on CUDA SBSA builds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157843 Approved by: https://github.com/Skylion007, https://github.com/atalman (cherry picked from commit 297daa1d30c80826b939d8f2dcd07422dec72642) Co-authored-by: Ting Lu <tingl@nvidia.com>	2025-07-11 13:41:21 -04:00
pytorchbot	6e08036e2b	[user triton] AOT inductor support for device-side TMA (#157241 ) [user triton] AOT inductor support for device-side TMA (#155896) Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma` Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel. To support this in AOTI, this PR: * records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen * allocates global scratch, if needed (cuda/device_op_overrides.py) * plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device) * updates tests to verify this works for dynamically shaped inputs This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works) Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space). For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda` https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155896 Approved by: https://github.com/desertfire (cherry picked from commit b6c00dfe249a7bfc1d61a322d5bc30f164353abf) Co-authored-by: David Berard <dberard@fb.com>	2025-07-11 13:21:56 -04:00
pytorchbot	fd227e5208	Add sm_70 arch for linux cuda 12.8 and 12.9 builds (#157968 ) Add sm_70 arch for linux cuda 12.8 and 12.9 builds (#157558) Please see: https://github.com/pytorch/pytorch/issues/157517 We would like to keep Volta architectures by default for release 2.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157558 Approved by: https://github.com/Skylion007, https://github.com/Camyll, https://github.com/seemethere, https://github.com/malfet (cherry picked from commit 179dcc10e4e0c742fb7d93b832021d0c177798bf) Co-authored-by: Andrey Talman <atalman@fb.com>	2025-07-11 10:28:47 -04:00
Andrey Talman	058b58ac9d	Revert "Turn on compile with NVSHMEM (#154538 )" (#158040 ) This reverts commit 3685b101709d466ec79a976c55e9efcd0d1a19fa.	2025-07-11 09:13:30 -04:00
Andrey Talman	0afa9afa5e	Revert "Add NVSHMEM to PYTORCH_EXTRA_INSTALL_REQUIREMENTS (#154568 )" (#158039 ) This reverts commit 34c6371d24a350db754a90c75361cdcf48cc0e71.	2025-07-11 09:12:52 -04:00
Camyll Harajli	d24444a4cf	[cherry-pick] revert #156552 (#156767 ) Revert "Add fx_graph_runnable tests boilerplate (#156552)" This reverts commit 0a2ec7681d2af973d8daaf7905431a088739dc90.	2025-07-10 15:04:11 -07:00
Camyll Harajli	f885f15603	cherrypick revert of #152932 for release 2.8 (#158031 ) * Revert "Add unified memory APIs for torch.accelerator (#152932)" This reverts commit 35e44067c4d9cc9be2652c0b9098885c5a321029. * Revert "Add DeviceAllocator as the base device allocator (#138222)" This reverts commit 92409b6c89fbfbd3caa79c81b1e3d9e7917d3bc7.	2025-07-10 10:28:14 -07:00
pytorchbot	d1d97caf15	[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157454 ) [inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322) Fixes #155006 Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322 Approved by: https://github.com/zou3519, https://github.com/drisspg (cherry picked from commit 82eefaedd98b63de8a87e34275af781f8eb177e1) Co-authored-by: David Berard <dberard@fb.com>	2025-07-09 17:24:51 -04:00
pytorchbot	b0543d6b7c	[release] Triton pin update to 3.4 (#157752 ) [release] Triton pin update to 3.4 (#156664) Triton pin update issue: https://github.com/pytorch/pytorch/issues/154206 Please see post: https://dev-discuss.pytorch.org/t/2-8-final-rc-release-postponed-by-a-week/3101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156664 Approved by: https://github.com/davidberard98 (cherry picked from commit 3d06ff82a84a118f0ed246864d4fc01ac4726328) Co-authored-by: Andrey Talman <atalman@fb.com>	2025-07-08 19:03:21 -04:00
pytorchbot	6550831051	[inductor][static launcher] Skip correctness test for test_floats (#157200 ) [inductor][static launcher] Skip correctness test for test_floats (#157023) https://github.com/triton-lang/triton/issues/6176 causes kernels that take fp64 scalar inputs to generate wrong results. Until we get around to fixing this, just skip the accuracy check (it'll fail on Triton's launcher anyway). Differential Revision: [D77407307](https://our.internmc.facebook.com/intern/diff/D77407307) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157023 Approved by: https://github.com/jamesjwu (cherry picked from commit e8217ad8becd2b297682c685a9179997cb0a98cc) Co-authored-by: David Berard <dberard@fb.com>	2025-07-07 15:01:33 -07:00
pytorchbot	a83d7ae76a	[ONNX] Bump onnxscript api for torch 2.8 (#157137 ) [ONNX] Bump onnxscript api for torch 2.8 (#157017) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157017 Approved by: https://github.com/titaiwangms, https://github.com/malfet (cherry picked from commit 36fd1ac9324429c095f8fbc5f6d2bd4b71f18d61) Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-07-07 15:00:43 -07:00
pytorchbot	90d16ee800	Fix macOS build with `USE_MPS=OFF` (#156932 ) Fix macOS build with `USE_MPS=OFF` (#156847) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156847 Approved by: https://github.com/angelayi (cherry picked from commit 455dfd258980294f0745bd90aee12a323e37224d) Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>	2025-07-07 14:59:26 -07:00
pytorchbot	f6166e4427	[dynamo] do not issue lru_cache warning for functions in the top-level torch namespace (#157718 ) [dynamo] do not issue lru_cache warning for functions in the top-level torch namespace (#157598) `lru_cache` usage warning was being raised for `torch.get_device_module()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157598 Approved by: https://github.com/Sidharth123-cpu (cherry picked from commit 52e4e41cbc36a5cf44395ff84ca2d069263560de) Co-authored-by: William Wen <williamwen@meta.com>	2025-07-07 14:58:18 -07:00
pytorchbot	30a9a25b15	[dynamo] Fix source for lru_cache method (#157308 ) [dynamo] Fix source for lru_cache method (#157292) Fixes - https://github.com/pytorch/pytorch/issues/157273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157292 Approved by: https://github.com/zou3519, https://github.com/malfet, https://github.com/jansel (cherry picked from commit 3684be056d9af667400ba071a116be8b1112bba8) Co-authored-by: Animesh Jain <anijain@umich.edu>	2025-07-07 09:55:02 -07:00
Scott Wolchok	347259f35a	[cherry-pick] Organize BUCK for torch/standalone and Rename torch::standalone to headeronly (#157418 ) * Organize BUCK for torch/standalone (#156503) Summary: Undo highlevel BUCKification in favor of something more organized by moving it to the dir itself Test Plan: CI Rollback Plan: Reviewed By: swolchok Differential Revision: D76920013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156503 Approved by: https://github.com/swolchok (cherry picked from commit acaf6ba3c6d0bdec88ab3f6c2ef82650050558d2) * Reapply D77381084 / #156964: Rename torch::standalone to headeronly (#157251) Was reverted due to internal failure which should be fixed now. I believe Jane wants this reapplied and picked to release, and she's out this week. Original summary: headeronly is more clear, let's change the name before anyone depends on standalone Differential Revision: [D77520173](https://our.internmc.facebook.com/intern/diff/D77520173/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157251 Approved by: https://github.com/janeyx99, https://github.com/Skylion007, https://github.com/desertfire (cherry picked from commit fee2377f9ea62223f69ea9904c5e25ccb2af5961) --------- Co-authored-by: Jane Xu <janeyx@meta.com>	2025-07-07 09:52:12 -07:00
pytorchbot	9fd946f04e	[PowerPC] Fixed build issue for vsx vec256 complexfloat and scaled_mm_out_cpu (#157422 ) [PowerPC] Fixed build issue for vsx vec256 complexfloat and scaled_mm_out_cpu (#155255) Pytorch build is failing on power system from this commit ec24f8f58a74502c5a2488f5d9e85a817616dda0 *Build Failure Logs* Error related to mkldnn ``` pytorch/aten/src/ATen/native/Blas.cpp:302:26: error: ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope 302 \| if ((!mixed_dtype && cpuinfo_has_x86_amx_int8()) \|\| \| ^~~~~~~~~~~~~~~~~~~~~~~~ pytorch/aten/src/ATen/native/Blas.cpp:303:25: error: ‘cpuinfo_has_x86_amx_fp16’ was not declared in this scope 303 \| (mixed_dtype && cpuinfo_has_x86_amx_fp16())) { \| ^~~~~~~~~~~~~~~~~~~~~~~~ ``` Error related to vec256 complex float redefinition ``` aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: specialization of ‘at::vec::DEFAULT::Vectorized<c10::complex<float> >’ after instantiation 19 \| class Vectorized<ComplexFlt> { \| ^~~~~~~~~~~~~~~~~~~~~~ aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: redefinition of ‘class at::vec::DEFAULT::Vectorized<c10::complex<float> >’  aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:633:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’ 633 \| auto abs_a = a.abs_2_(); \| ^~~~~~ aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:634:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’ 634 \| auto abs_b = b.abs_2_(); \| ^~~~~~ /aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:666:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’ 666 \| vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())}; aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:673:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’ 673 \| vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())}; \| ^~~~ aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:680:27: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’ 680 \| vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())}; ``` *With this changes build logs* ``` Building wheel torch-2.8.0a0+gita3098a7 -- Building version 2.8.0a0+gita3098a7 -- Checkout nccl release tag: v2.26.5-1 cmake -GNinja -DBLAS=OpenBLAS -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/torch -DCMAKE_PREFIX_PATH=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/lib/python3.12/site-packages -DPython_EXECUTABLE=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/bin/python -DTORCH_BUILD_VERSION=2.8.0a0+gita3098a7 -DUSE_MKLDNN=ON -DUSE_MKLDNN_CBLAS=ON -DUSE_NUMPY=True -DUSE_OPENMP=ON /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch cmake --build . --target install --config Release running build_ext -- Building with NumPy bindings -- Not using cuDNN -- Not using CUDA -- Not using XPU -- Using MKLDNN -- Not using Compute Library for the Arm architecture with MKLDNN -- Using CBLAS in MKLDNN -- Not using NCCL -- Building with distributed package: -- USE_TENSORPIPE=True -- USE_GLOO=True -- USE_MPI=False -- Building Executorch -- Not using ITT Copying functorch._C from functorch/functorch.so to /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so copying functorch/functorch.so -> /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so building 'torch._C' extension creating build/temp.linux-ppc64le-cpython-312/torch/csrc ``` This patch will fix the pytorch build issue on power, and i am able to build successfully. Hi @malfet @albanD Please review this PR for pytorch build issue that we are observing on power. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155255 Approved by: https://github.com/albanD, https://github.com/malfet (cherry picked from commit 5e18bc333144473f1f10bc8a5ba05dba7950fb8a) Co-authored-by: Avanish Tiwari <avanish@linux.ibm.com>	2025-07-07 09:50:47 -07:00
pytorchbot	7228ea51f8	[ONNX] Fix conversion of attention - 4D (#157509 ) [ONNX] Fix conversion of attention - 4D (#157130) Fixes a wrong conversion to onnx while investigation #149662. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157130 Approved by: https://github.com/gramalingam, https://github.com/justinchuby, https://github.com/titaiwangms (cherry picked from commit 0105cd89ab508ec56126c1de85c8f5b5acc446b5) Co-authored-by: xadupre <xadupre@microsoft.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-07-07 09:47:18 -07:00
pytorchbot	8b879b0cac	[dynamo] Fix bug in dict(mapping_proxy) (#157515 ) [dynamo] Fix bug in dict(mapping_proxy) (#157467) Fixes https://github.com/pytorch/pytorch/issues/157284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157467 Approved by: https://github.com/jansel, https://github.com/StrongerXi (cherry picked from commit 48560eef80e97e855cbb8e2814acefe8f5cc6fbd) Co-authored-by: Animesh Jain <anijain@umich.edu> Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-07-07 12:46:28 -04:00
Xuan Liao	b10409a417	[cherry-pick] [fake tensor] fix issue of no attribute tags (#156689 ) (#157519 ) [fake tensor] fix issue of no attribute tags (#156689) Fixes #156688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156689 Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman (cherry picked from commit 7597988f1b5a41c0b91d379e0ce51111fd7cc95a)	2025-07-07 09:44:51 -07:00
Richard Zou	0cb3b840b0	Add einops x torch.compile testing in PyTorch CI (#157416 ) (#157588 ) Fixes #146782. This PR adds testing for multiple einops versions in PyTorch CI. This occurs in a new "einops" CI job that runs for both Python 3.9 and 3.13 (aka, what we test Dynamo over). Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/157416 Approved by: https://github.com/guilhermeleobas, https://github.com/arogozhnikov, https://github.com/anijain2305	2025-07-07 09:41:11 -07:00
pytorchbot	d91af204d8	Fix cuda 12.9 aarch64 GPU builds. Update CUDA_STABLE variable. (#157641 ) Fix cuda 12.9 aarch64 GPU builds. Update CUDA_STABLE variable. (#157630) This contains 2 fixes that required in main and will need to be cherry-picked to Release 2.8 branch: 1. The PR https://github.com/pytorch/pytorch/pull/155819 missed to include triton change. 2. CUDA STABLE variable needs to be set to 12.8. Updating CUDA stable updates full static build Pull Request resolved: https://github.com/pytorch/pytorch/pull/157630 Approved by: https://github.com/Skylion007, https://github.com/jeanschmidt (cherry picked from commit 7275f280454f790414b24147a2ba7f94d0eabcf6) Co-authored-by: Andrey Talman <atalman@meta.com>	2025-07-04 19:22:01 -04:00
pytorchbot	3db12a59d7	Remove +PTX from CUDA 12.8 builds (#157634 ) Remove +PTX from CUDA 12.8 builds (#157516) Remove +PTX from CUDA 12.8 builds and small refactor in build_cuda.sh. Removing +PTX reduces binary size required to be able to upload binaries to pypi Pull Request resolved: https://github.com/pytorch/pytorch/pull/157516 Approved by: https://github.com/malfet, https://github.com/ptrblck, https://github.com/tinglvv (cherry picked from commit 84085229765698166f07c9220d5544023ab80d47) Co-authored-by: atalman <atalman@fb.com>	2025-07-04 14:54:10 -04:00
pytorchbot	2b0f8e9c86	Cleanup leftover miniconda brew installation (#157567 ) Cleanup leftover miniconda brew installation (#156898) That results in torch.compile being unable to produce working artifacts Should fix https://github.com/pytorch/pytorch/issues/156833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156898 Approved by: https://github.com/seemethere, https://github.com/atalman (cherry picked from commit 214e2959dcdbf91a999d5c0a5d40c91e4442e8c5) Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-07-04 14:53:20 -04:00
pytorchbot	8e00a755d6	Fix GITHUB_OUTPUT syntax in create_release.yml workflow (#157539 ) Fix GITHUB_OUTPUT syntax in create_release.yml workflow (#157466) #149919 fixed a number of linting issues, however, the conversion of the deprecated `::set-output` command to the new `>> $GITHUB_OUTPUT` redirect syntax went wrong, resulting in [failing uploads of the 2.8.0 rc1-rc3 pre-release tarballs](https://github.com/pytorch/pytorch/actions/runs/15892205745/job/44816789782). This PR fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157466 Approved by: https://github.com/clee2000, https://github.com/atalman (cherry picked from commit 5d5a5b3501dfb0759ed36d0a88b65cdcd87c1e27) Co-authored-by: Klaus Zimmermann <klaus.zimmermann@quansight.com>	2025-07-04 14:48:28 -04:00
pytorchbot	3e6f088e40	[aarch64] Add back NCCL lib to cuda arm wheel (#157105 ) [aarch64] Add back NCCL lib to cuda arm wheel (#156888) We discovered that when importing latest 12.9 arm nightly wheel, it is missing the NCCL lib. With the use of USE_SYSTEM_NCCL=1, we need to copy the libnccl.so lib into our big wheel environment, so that it can be dynamically linked at runtime. https://github.com/pytorch/pytorch/pull/152835 enabled USE_SYSTEM_NCCL=1, which would use the system NCCL by default, and it would no longer use the one built from libtorch_cuda.so. With this PR, we add back the libnccl.so to be used at runtime. In this way, we also provide the flexibility to use different versions of NCCL from what came with the original pytorch build. related - https://github.com/pytorch/pytorch/issues/144768 ``` Python 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 417, in <module> from torch._C import * # noqa: F403 ^^^^^^^^^^^^^^^^^^^^^^ ImportError: libnccl.so.2: cannot open shared object file: No such file or directory ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156888 Approved by: https://github.com/atalman (cherry picked from commit de45c5f673ce261e9a82c54280beeda36cff640e) Co-authored-by: Ting Lu <tingl@nvidia.com>	2025-07-04 14:45:41 -04:00
pytorchbot	9f3fd07c40	[MPS] Revert cumsum/cumprod to MPSGraph implementation (#157494 ) [MPS] Revert cumsum/cumprod to MPSGraph implementation (#156708) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156708 Approved by: https://github.com/malfet (cherry picked from commit 2d7e6c6241971106a56073d7a53c7d1336b11a51) Co-authored-by: Manuel Candales <mcandales@meta.com>	2025-07-03 09:22:10 -07:00
pytorchbot	7d5aff5f25	[ez] Disable some failing periodic tests (#157560 ) [ez] Disable some failing periodic tests (#156731) test_torch.py::TestTorchDeviceTypeCUDA::test_storage_use_count_cuda: Added in https://github.com/pytorch/pytorch/pull/150059 Fails in debug mode [GH job link](https://github.com/pytorch/pytorch/actions/runs/15856606665/job/44706020831) [HUD commit link](`4491326fb0`) inductor/test_inductor_freezing.py::FreezingGpuTests::test_cpp_wrapper_cuda: [GH job link](https://github.com/pytorch/pytorch/actions/runs/15856606665/job/44707119967) [HUD commit link](`4491326fb0`) started failing after moving to new cuda version https://github.com/pytorch/pytorch/pull/155234 I'll ping people if this gets merged Pull Request resolved: https://github.com/pytorch/pytorch/pull/156731 Approved by: https://github.com/huydhn (cherry picked from commit 2ff3280c77c705e11c5211d4be8fef9853cd0559) Co-authored-by: Catherine Lee <csl@fb.com>	2025-07-03 09:19:55 -07:00
Andrey Talman	ec45b4c0fe	Revert "Update triton version to 3.4" (#157471 ) Revert "Update triton version to 3.4 (#156890)" This reverts commit 03eb1e40f9ddf09cb9eef86ace74332e87f11a79.	2025-07-02 13:33:21 -04:00
pytorchbot	4f0798f34d	[ROCm] Bump AOTriton to 0.10b (#156845 ) [ROCm] Bump AOTriton to 0.10b (#156499) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b: * Official support of gfx950/gfx1201 * Experimental support of gfx1101/gfx1151/gfx1150/gfx1200 * Reduce libaotriton.so binary size by over 80%. + Without this optimization the binary size of `libaotriton.so` could be over 100MiB due to 2x more supported architectures compared with 0.9b. Now it is only about 11MiB. * Support sliding window attention (SWA) in `_flash_attention_forward/backward`. Should fix #154582 See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details, including Known Problems. Notable changes to SDPA backend: * `std::optional<int64_t>` `window_size_left/right` are directly passed to ROCM's SDPA backend, because the default value `-1` is meaningful to AOTriton's backend and bottom-right aligned causal mask is implemented with negative `window_size_left/right` * Some code clean up around `USE_CK_FLASH_ATTENTION` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156499 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd (cherry picked from commit d9577df312d477e8fa5b9d7bc61fb1f2c07b8e48) Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>	2025-06-30 09:03:56 -04:00
Camyll Harajli	998ffd4b25	[cherry-pick] revert #156517 on release 2.8 (#156768 ) Revert "[logging] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156517)" This reverts commit fb75dea2c1b93c78dccf08d5fd5e20b362ecd405.	2025-06-26 10:42:35 -04:00
pytorchbot	3d53a53e50	Fix environment and push env var for docker image builds for binary builds (#156916 ) Fix environment and push env var for docker image builds for binary builds (#156910) Changes WITH_PUSH and the environment check to be ok with giving credentials to push to docker io if its on the main branch, a tag starting with v, or the release branch Credentials for pushing to docker io are in the environment, so without the environment, you can't push to docker io. You also don't do the push unless WITH_PUSH is true binary builds on release branch were failing because they pull from docker io, but the docker build wasn't pushing to docker io because it was either on the release branch (didn't have credentials https://github.com/pytorch/pytorch/actions/runs/15888166271/job/44813180986) or it was on the tag (doesn't have WITH_PUSH) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156910 Approved by: https://github.com/atalman (cherry picked from commit 78ee2ee90eed957aec3dc80423b108b16938a8ae) Co-authored-by: Catherine Lee <csl@fb.com>	2025-06-25 22:15:13 -04:00
Andrey Talman	03eb1e40f9	Update triton version to 3.4 (#156890 ) [TEST] triton Update 3.4 - 2	2025-06-25 18:05:36 -04:00
Camyll Harajli	8de5ce7155	[cherry pick] revert #155412 (#156757 ) Revert "Remove remaining CUDA 12.4 CI code (#155412)" This reverts commit 9fed2addedb42da86b657165fe14eadc911232cf.	2025-06-25 14:53:55 -04:00
Andrey Talman	421d45d9a1	[RELEASE 2.8] Release only changes (#156728 ) * [RELEASE 2.8] Release only changes * test * tests_files * docker	2025-06-24 16:35:15 -04:00
Nikita Shulga	3a7ff829c5	Fix MacOS MP hang in Python-3.12+ (#155698 ) By leaking resource_tracker destructor (introduced by https://github.com/python/cpython/issues/88887 ) at exit, as at this point handle to child process might no longer be valid Also, switch CI from using `setup-miniconda` to `setup-python` as an integration test for the fix as all data loader tests will hang otherwise - Remove `CONDA_RUN` macro... - Hack the search path in `macos-test.sh` to put both python and python3 aliases first in the path (not sure what other action are messing with path environment variable) Fixes https://github.com/pytorch/pytorch/issues/153050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155698 Approved by: https://github.com/atalman	2025-06-24 12:13:35 +00:00
Xuehai Pan	f5e6e52f25	[BE][PYFMT] migrate PYFMT for `test/inductor/` to `ruff format` (#148186 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148186 Approved by: https://github.com/jansel	2025-06-24 11:12:11 +00:00
Mark Saroufim	4e8dd11be1	simplify nvrtc discovery login in compile_kernel (#156674 ) Followup from https://github.com/pytorch/pytorch/pull/156332 Tested a bunch while I was working on https://github.com/pytorch/pytorch/pull/156380 Works just fine on dev gpus Pull Request resolved: https://github.com/pytorch/pytorch/pull/156674 Approved by: https://github.com/malfet	2025-06-24 08:55:40 +00:00
Mark Saroufim	ce73b0c53f	Validate custom op support for compile_kernel (#156332 ) Follow-up work from #151484 - just makes sure that compile_kernel composes nicely with custom ops by writing some new tests, no new code functionality is added benchmark failure in CI is unrelated to this change, CI is green Pull Request resolved: https://github.com/pytorch/pytorch/pull/156332 Approved by: https://github.com/zou3519, https://github.com/malfet	2025-06-24 08:21:21 +00:00
Yu, Guangye	35e44067c4	Add unified memory APIs for torch.accelerator (#152932 ) # Motivation The following API will be put under torch.accelerator - empty_cache - max_memory_allocated - max_memory_reserved - memory_allocated - memory_reserved - memory_stats - reset_accumulated_memory_stats - reset_peak_memory_stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932 Approved by: https://github.com/albanD ghstack dependencies: #138222	2025-06-24 07:57:48 +00:00
cyy	ce1a07570d	Fix TORCH_CUDA_ARCH_LIST (#156667 ) Before the fix, `TORCH_CUDA_ARCH_LIST` variable contains string `TORCH_CUDA_ARCH_LIST` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156667 Approved by: https://github.com/ngimel	2025-06-24 07:27:53 +00:00
fengqing.lu	04178d347c	[Reland] [Intel GPU] Make SDPA output has the same stride as Query. (#154340 ) Fixes [#153903](https://github.com/pytorch/pytorch/issues/153903). Currently the output tensor of SDPA XPU is always defined as contiguous stride, while CPU/CUDA flash_attention and cudnn_attention allocate output tensor with stride the same as Query. This PR aligns XPU's behavior with CUDA/CPU to make XPU compatible to CPU/CUDA's modeling code. The function `alloc_with_matching_layout` is copied from cudnn `8c16d0e404/aten/src/ATen/native/cudnn/MHA.cpp (L874)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154340 Approved by: https://github.com/guangyey, https://github.com/drisspg	2025-06-24 06:09:59 +00:00
Ti-Tai Wang	a7b29c88b1	[ONNX] Preserve all legacy exporter params in fallback (#156659 ) Fixes #151693 Previous to this PR, the fallback does not take care of all user parameters. This pr preserves them to ensure a smooth transition for users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156659 Approved by: https://github.com/justinchuby	2025-06-24 05:28:55 +00:00
Yu, Guangye	a6a8641c8a	Fix UT failure on non-cuda backend (#156577 ) # Motivation `HAS_TRITON` is a generic API that could return `True` on xpu backend. It will result in these cases failing on xpu. So we should use `HAS_CUDA` (equivalently `torch.cuda.is_available() && HAS_TRITON`) to avoid these failures. Please refer to https://github.com/pytorch/pytorch/actions/runs/15813693789/job/44569593370#step:15:2129 # Additional Context This PR aims to fix the CI failure soon. We will have a dedicated PR to generalize these UT to be generic. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @daisyden Fix https://github.com/pytorch/pytorch/issues/156576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156577 Approved by: https://github.com/jansel	2025-06-24 05:24:24 +00:00
zeshengzong	495c317005	Replace deprecated `is_compiling` method (#154476 ) Replace depreacted `is_compiling` in `torch._dynamo` with `torch.compiler` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154476 Approved by: https://github.com/eellison	2025-06-24 05:16:40 +00:00
Boyuan Feng	1044934878	[CUDAGraph] add config `cudagraph_capture_sizes` (#156551 ) Users may want CUDAGraph for certain sizes and fallback for other sizes. As discussed in Issue #121968, we would like to use cudagraph for [batch size [1,2,3,...,16]](https://github.com/pytorch/pytorch/issues/121968#issuecomment-2259942345) and fallback for others. Another use case is [vllm](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/cuda_piecewise_backend.py#L114-L119), where 67 batch sizes (i.e., [1,2,4,8,16,24,32,...,512]) are captured and all other sizes fallback. This PR implements the feature with `torch._inductor.config.triton.cudagraph_capture_sizes`. When it is specified, we only capture cudagraph for these shapes. When it is None (by default), we capture cudagraph for all shapes. Example: ```python import torch torch._inductor.config.triton.cudagraph_capture_sizes = [(2,3), (4,5), (6, 2), (7,3)] def f(x): return x + 1 f = torch.compile(f, mode="reduce-overhead", dynamic=False) def run(batch_size, seq_len, d): x = torch.randn((batch_size, seq_len, d), device="cuda") # Need to mark the dimension as dynamic. Automated-dynamic # may have some ux issues on matching `cudagraph_capture_sizes` # with the actual dynamic shapes, since there are specialization and # multiple dynamo graphs. torch._dynamo.mark_dynamic(x, 0) torch._dynamo.mark_dynamic(x, 1) for _ in range(3): f(x) for i in range(2, 10): for j in range(2, 10): run(i, j, 8) num_cudagraph = torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id() assert num_cudagraph.id == 4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156551 Approved by: https://github.com/bobrenjc93	2025-06-24 05:14:49 +00:00
Ahmad Sharif	899d3d3e9e	Don't call `sum()` on a tensor that is not summable in layer_norm (#156600 ) Don't call `sum()` on a tensor that is default constructed. Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this: ``` Traceback (most recent call last): File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper method(args, kwargs) File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps ln_out_cuda.backward(grad_output_cuda) File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward torch.autograd.backward( File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward _engine_run_backward( File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: tensor does not have a device Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const, char const, unsigned int, char const) from ??:0 #7 at::TensorBase::options() const from :0 #8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0 #9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const, float const, float const, float const, long, long, at::Tensor, at::Tensor, CUstream_st) from ??:0 #14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor, at::Tensor, at::Tensor) from ??:0 #15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor, at::Tensor, at::Tensor) from ??:0 #16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0 #17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0 ``` Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156600 Approved by: https://github.com/eqy, https://github.com/ngimel	2025-06-24 05:00:42 +00:00
Edward Z. Yang	17eb649d55	Implement guard collectives (optimized version) (#156562 ) This is a remix of https://github.com/pytorch/pytorch/pull/155558 Instead of mediating guard collective via a config option, in this one it's done via a `set_stance` like API. The motivation is that checking for the config value on entry on torch.compile is apparently quite expensive, according to functorch_maml_omniglot. So this makes it a bit cheaper. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156562 Approved by: https://github.com/Microve	2025-06-24 04:59:49 +00:00
morrison-turnansky	73772919d2	remove deprecated numpy.typing.mypy_plugin in mypy.ini (#156601 ) Fixes #156489 removed deprecated numpy plugin in mypy.ini @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/156601 Approved by: https://github.com/ezyang	2025-06-24 04:56:08 +00:00
Xuehai Pan	6d5c789ad5	[BE][PYFMT] migrate PYFMT for `test/[a-h]*/` to `ruff format` (#144555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144555 Approved by: https://github.com/ezyang ghstack dependencies: #144551, #144554	2025-06-24 04:53:54 +00:00
PyTorch MergeBot	e600e044a7	Revert "[aotd] Support mutations of the same input in fw and bw (#155354 )" This reverts commit 3f920f3d8f5bd15d2222758f21f9a5d36e4dad1f. Reverted https://github.com/pytorch/pytorch/pull/155354 on behalf of https://github.com/malfet due to Not sure why CI was green, but it breaks tons of tests, see `930b575389/1` ([comment](https://github.com/pytorch/pytorch/pull/155354#issuecomment-2998780884))	2025-06-24 04:42:14 +00:00
fduwjj	930b575389	[symm_mem] Add sym mem test into ptd h100 ci (#156634 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156634 Approved by: https://github.com/ngimel, https://github.com/mori360	2025-06-24 03:43:22 +00:00
tvukovic-amd	b2d473c8f8	[ROCm][Windows] Fix rocsolver undefined symbol error (#156591 ) Fix undefined symbol error while using `rocsolver_ssyevd_strided_batched` call in `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156591 Approved by: https://github.com/jeffdaily	2025-06-24 03:28:45 +00:00
fduwjj	87d615efab	[fr] Use a vector to temporarily keep the reference to future object to avoid block (#156653 ) At the end of the scope when std::async is launched, a wait will be called which could makes the code blocking, this is not expected for monitoring thread. Instead, let's use a vector to contain the reference to it. So no blocking will happen. And at the end of loop, wait will still be called but it is ok since all the checks or dump has already been finished. Differential Revision: [D77190380](https://our.internmc.facebook.com/intern/diff/D77190380) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156653 Approved by: https://github.com/kwen2501	2025-06-24 03:25:04 +00:00
cyy	b09bd414a6	Deprecate c10::string (#155084 ) Now there is no mention of c10::string in OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155084 Approved by: https://github.com/ezyang	2025-06-24 03:03:06 +00:00
Simon Fan	0a2ec7681d	Add fx_graph_runnable tests boilerplate (#156552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156552 Approved by: https://github.com/StrongerXi	2025-06-24 02:41:38 +00:00
dolpm	9665702c64	[nativert] reland D76832891 remove designated initializer cpp20 (#156565 ) Summary: fix windows build broke in https://github.com/pytorch/pytorch/pull/156508 Test Plan: ci Rollback Plan: Differential Revision: D77080420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156565 Approved by: https://github.com/zhxchen17	2025-06-24 02:38:08 +00:00
Andrey Talman	6a3d00aa3b	Add Windows cuda 12.9.1 build (#156630 ) Without Support for SegmentReduce.cu Test PR confirmed by Removing SegmentReduce.cu windows build for CUDA 12.9 can succeed Related to: https://github.com/pytorch/pytorch/issues/156181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156630 Approved by: https://github.com/malfet Co-authored-by: Ting Lu <tingl@nvidia.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-06-24 02:15:49 +00:00
Sidharth	a9ef7c4d04	[dynamo] update to lru_cache message and updated user stack trace in debug mode (#156639 ) I had to create a new PR for this because of @atalman request of temporary reverting the previous PR to restore diff train sync. Nothing has changed from this PR and the original one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156639 Approved by: https://github.com/atalman	2025-06-24 01:52:13 +00:00
Paul Zhang	86996c15dc	[Inductor] Allow exhaustive autotuning across all GEMM options (#156610 ) Differential Revision: D76843916 Exhaustive autotuning is meant to autotune GEMM configs across the entire search space of possible configs. Some of these configs can cause extremely long compilation times and OOMs, especially with configs of the following nature: Excessive register spillage Using much larger amounts of shared memory than available on the hardware This diff prunes out those configs to make exhaustive autotuning more viable, along with supporting exhaustive autotuning for persistent+tma template and decompose_k. Previously, exhaustive autotuning would hang, now we are able to tune shapes in ~5 minutes. Below is a sample log for autotuning with exhaustive: ``` AUTOTUNE mm(1152x21504, 21504x1024) strides: [21504, 1], [1, 21504] dtypes: torch.bfloat16, torch.bfloat16 mm 0.1167 ms 100.0% triton_mm_6270 0.1172 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6522 0.1183 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7482 0.1190 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7483 0.1195 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6523 0.1274 ms 91.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6267 0.1285 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6519 0.1287 ms 90.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7480 0.1298 ms 89.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7312 0.1302 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 SingleProcess AUTOTUNE benchmarking takes 298.7185 seconds and 21.2569 seconds precompiling for 2210 choices INFO:tritonbench.utils.triton_op:Took 333894.46ms to get benchmark function for pt2_matmul_maxautotune ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156610 Approved by: https://github.com/jansel	2025-06-24 01:42:05 +00:00
Isuru Fernando	40a785103c	[dynamo] fix debugging code_parts for relational guards (#154753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154753 Approved by: https://github.com/anijain2305 ghstack dependencies: #154772	2025-06-24 01:38:29 +00:00
Isuru Fernando	849468034d	[dynamo] fix selecting shape guards (#154772 ) Not all LAMBDA_GUARDs are shape guards. Only the epilogue guards are lambda guards Pull Request resolved: https://github.com/pytorch/pytorch/pull/154772 Approved by: https://github.com/anijain2305	2025-06-24 01:38:29 +00:00
Ankita George	5dd9652389	Clean up HF components (#155707 ) Differential Revision: [D76427358](https://our.internmc.facebook.com/intern/diff/D76427358/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155707 Approved by: https://github.com/saumishr	2025-06-24 00:07:37 +00:00
Francisco Massa	ca5a40395d	[partitioner] Fix _broadcast_on_rank0 to use deterministic hash function (#153734 ) Summary: I was using python's hash, which is not deterministic across different interpreter runs. Use hashlib instead. Test Plan: Run using it https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-rebase_sanity_128bs_8t_cc-8e17be61ce?job_attempt=1&version=0&tab=summary&env=prod Differential Revision: D74882405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153734 Approved by: https://github.com/Microve	2025-06-24 00:06:23 +00:00
Georgia Phillips	24063ad109	Fix native static dispatch kernels (#156331 ) Summary: Fix for native static dispatch kernels not taking effect Test Plan: ``` buck2 test //sigmoid/backend/test:static_kernels_ops_test buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkByOp --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice "" --pytorch_predictor_sigmoid_static_dispatch_enable=true --pytorch_predictor_sigmoid_graph_passes_enable=true --benchmarkEnableProfiling=true --load_lowered_merge=3 --using_aoti_lowering_allowlist=false --requestFilePath=/data/users/georgiaphillips/replayer/inputs/742055223/0/mix/742055223_0_mix.inputs.recordio --benchmarkNumIterations=2 ``` Rollback Plan: Reviewed By: dolpm Differential Revision: D76559764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156331 Approved by: https://github.com/Skylion007, https://github.com/jingsh	2025-06-24 00:05:49 +00:00
Shivam Raikundalia	380e30a723	[EZ/Profiler] Change 'b' to 'B' in FunctionEvent Frontend (#156250 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/149311 Test Plan: Just changes string output ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 60.993us 0.97% 60.993us 1.848us 0 B 0 B 33 ... ``` Rollback Plan: Differential Revision: D76857251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156250 Approved by: https://github.com/sanrise	2025-06-23 23:25:04 +00:00
Yuanyuan Chen	07bb097698	Fix clang-tidy bugprone* warnings (#148529 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148529 Approved by: https://github.com/ezyang	2025-06-23 23:09:56 +00:00
IvanKobzarev	3f920f3d8f	[aotd] Support mutations of the same input in fw and bw (#155354 ) Original issue: https://github.com/pytorch/pytorch/issues/154820 The issue happens when there is a mutation for the same input in forward AND in backward. AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward). After that partitioner can put it either in forward or in backward. The fix: 1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for forward mutation. 2/ Exposing mutation_counter to python We want to keep invariant that copy_ exist only in the end of joint graph. 3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward. Emit post_forward mutations after joint graph fully traced. add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward. 4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward. For this set MUST_SAVE for the source of mutation in forward. proxy_tensor changes: By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained. But we want that this copy_ will be independent and applied just to primals. For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354 Approved by: https://github.com/bdhirsh	2025-06-23 22:25:45 +00:00
Scott Wolchok	c82a174cea	Extract CPU log_softmax kernels to header (#156243 ) This allows sharing them with ExecuTorch. Differential Revision: [D76830114](https://our.internmc.facebook.com/intern/diff/D76830114/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156243 Approved by: https://github.com/janeyx99	2025-06-23 21:31:16 +00:00
Paul Zhang	96e4c95cd8	[Inductor] Subgraph as a choice symbolic expression as input (#156185 ) Differential Revision: D76514984 Fix subgraph as a choice for when a symbolic shape is inputted as an expression, i.e. 256 * s0, which typically happens in the backwards pass. The current logic assumes that all symbolic shapes are single inputs, i.e. standalone s0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156185 Approved by: https://github.com/masnesral	2025-06-23 21:29:17 +00:00
PyTorch MergeBot	b1d62febd0	Revert "Use official CUDAToolkit module in CMake (#154595 )" This reverts commit 08dae945ae380d80efbaf140a95abfc5d96e5100. Reverted https://github.com/pytorch/pytorch/pull/154595 on behalf of https://github.com/malfet due to It breaks on some local setup with no clear diagnostic, but looks like it fails to find cuFile ([comment](https://github.com/pytorch/pytorch/pull/154595#issuecomment-2997959344))	2025-06-23 21:15:31 +00:00
anwang	31e1274597	[MTIA Aten Backend] Migrate max.dim_max / min.dim_min (#156568 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate max.dim_max / min.dim_min to in-tree. Differential Revision: [D77095185](https://our.internmc.facebook.com/intern/diff/D77095185/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156568 Approved by: https://github.com/malfet ghstack dependencies: #156502, #156539, #156554	2025-06-23 20:43:39 +00:00
Colin Peppler	dfdd636cfa	[aoti] Check longlong upperbound for codegening input size check (#156522 ) Summary: Fixes ``` error: integer literal is too large to be represented in any integer type 38979 \| if (arg410_1_size[0] > 1171368248680556527362) { ``` Test Plan: ci Differential Revision: D77057898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156522 Approved by: https://github.com/jingsh, https://github.com/desertfire	2025-06-23 20:38:34 +00:00
anwang	edd9c09e73	[MTIA Aten Backend] Migrate isnan (#156554 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate isnan to in-tree. Differential Revision: [D77094811](https://our.internmc.facebook.com/intern/diff/D77094811/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156554 Approved by: https://github.com/malfet ghstack dependencies: #156502, #156539	2025-06-23 20:22:32 +00:00
anwang	070e580d30	[MTIA Aten Backend] Migrate _log_softmax.out / _log_softmax_backward_data.out (#156539 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate _log_softmax.out / _log_softmax_backward_data.out to in-tree. Differential Revision: [D77044380](https://our.internmc.facebook.com/intern/diff/D77044380/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156539 Approved by: https://github.com/malfet ghstack dependencies: #156502	2025-06-23 19:56:01 +00:00
anwang	93cd16512f	[MTIA Aten Backend] Migrate maximum.out / minimum.out / cos.out / erf.out / exp.out (#156502 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate maximum.out / minimum.out / cos.out / erf.out / exp.out to in-tree. Differential Revision: [D76917384](https://our.internmc.facebook.com/intern/diff/D76917384/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156502 Approved by: https://github.com/malfet	2025-06-23 19:56:01 +00:00
atalman	ee4d343499	Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166 )" (#156624 ) This reverts changes to [test/dynamo/test_repros.py](https://github.com/pytorch/pytorch/compare/main...atalman:revert_only_portion_of_file?expand=1#diff-4c82a5798a61d4cceb176b2700ba6fdd7c3e72d575b8e7e22458589139459caa) Missed by: `ee3d9969cc (diff-036cb21341ff8e390cc250e74fe9e3f0f15f259ea4bec4abcce49d95febf1553)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156624 Approved by: https://github.com/Camyll	2025-06-23 19:30:08 +00:00
Shangdi Yu	56b3bf0c74	[nativert] Move HigherOrderKernel (#156507 ) Summary: Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the implementation to torch/: fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels Test Plan: CI Differential Revision: D77032074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156507 Approved by: https://github.com/zhxchen17	2025-06-23 19:29:27 +00:00
PyTorch MergeBot	d061a02e6e	Revert "[invoke_subgraph] make same subgraph share get_attr target (#156260 )" This reverts commit 39dd2f4d7defc63164a7969bfac0d0c62ffac900. Reverted https://github.com/pytorch/pytorch/pull/156260 on behalf of https://github.com/ydwu4 due to no signal, it breaks linter tests. ([comment](https://github.com/pytorch/pytorch/pull/156260#issuecomment-2997478798))	2025-06-23 18:24:10 +00:00
PyTorch MergeBot	35d03398e5	Revert "[invoke_subgraph] make collect_meta_analysis fake prop cachable (#156347 )" This reverts commit f179b7198522e6d93bd103efba1a1ebd5a2cf891. Reverted https://github.com/pytorch/pytorch/pull/156347 on behalf of https://github.com/ydwu4 due to no signal, it breaks linter tests. ([comment](https://github.com/pytorch/pytorch/pull/156347#issuecomment-2997453729))	2025-06-23 18:19:29 +00:00
Tom Ritchford	98a34e8d4b	Move code out of individual token linters (#152256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152256 Approved by: https://github.com/Skylion007	2025-06-23 18:16:33 +00:00
Prachi Gupta	da910e603a	[ROCm] update state check for test_trace_while_active* (#153545 ) When timing is enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why I updated the test state check to be either "started" or "scheduled" Pull Request resolved: https://github.com/pytorch/pytorch/pull/153545 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-23 17:58:14 +00:00
PyTorch MergeBot	55ef7b15e0	Revert "[dynamo] fixes to lru_cache message and adding user stack trace in debug mode (#156463 )" This reverts commit afbf5420b8745099bf7d871f5a4fb6dec338f825. Reverted https://github.com/pytorch/pytorch/pull/156463 on behalf of https://github.com/atalman due to This is temoprary revert, to restore diff train sync. We should be good to reland this change ([comment](https://github.com/pytorch/pytorch/pull/156463#issuecomment-2997335541))	2025-06-23 17:44:36 +00:00
Boyuan Feng	a95504b10f	[torchbench] update environment setup script (#156465 ) Existing torchbench `Makefile` installs all models from torchbench, which could easily take 30 minutes, even if a developer only want to run 1 model. This PR adds a config to only install torchbench models we want to run. Example usage: ``` # Install 1 torchbench model make build-deps TORCHBENCH_MODELS="alexnet" # Install 3 torchbench models make build-deps TORCHBENCH_MODELS="alexnet basic_gnn_gcn BERT_pytorch" # Install all models make build-deps # Install all models make build-deps TORCHBENCH_MODELS="" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156465 Approved by: https://github.com/ezyang	2025-06-23 17:41:29 +00:00
PyTorch MergeBot	e583b88819	Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 )" This reverts commit ac86ec0e60370c037e018137f2048cafd47c5c28. Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to internal breakage ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2997314638))	2025-06-23 17:36:44 +00:00
Yidi Wu	f179b71985	[invoke_subgraph] make collect_meta_analysis fake prop cachable (#156347 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156347 Approved by: https://github.com/anijain2305, https://github.com/zou3519 ghstack dependencies: #156260	2025-06-23 17:10:07 +00:00
Yidi Wu	39dd2f4d7d	[invoke_subgraph] make same subgraph share get_attr target (#156260 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156260 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2025-06-23 17:10:07 +00:00
Prachi Gupta	276c790010	[ROCm][SymmetricMemory] Avoid bf16 to float conversion during reduce (#155587 ) This PR helps improve the performance of one-shot and two-shot allreduce as reported here: https://github.com/pytorch/FBGEMM/issues/4072 One-Shot: ![image](https://github.com/user-attachments/assets/69fe0d53-6636-42e1-90e0-e5efb989f59f) As shown in the numbers presented above, symmetric memory performance prior to the PR (baseline) was on average about 26% less than fbgemm's number reported in the issue above. After this PR, we are seeing 16% improvement on average as compared to fbgemm and 59% as compared to our baseline numbers. Two-Shot: ![image](https://github.com/user-attachments/assets/e5c8a288-303e-4d50-814b-4348e589e1fc) Similarly, in two-shot, we were originally underperforming by 12%. We have improved by 22% after this PR as compared to symmetric memory performance prior to this PR. However, two-shot performance is still about 23% lower than fbgemm. This work is still in progress and will be pushing those changes through a separate PR. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/155587 Approved by: https://github.com/jeffdaily	2025-06-23 16:14:01 +00:00
Nikita Shulga	5a533f74a1	Checkout optional submodules when publishing a release tarball (#156615 ) This includes Eigen and nccl for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/156615 Approved by: https://github.com/huydhn	2025-06-23 16:08:22 +00:00
Aby Mathew C	6835ba1b34	Register hpu device to fake backend (#156076 ) ## MOTIVATION This PR intends to add hpu ( Intel Gaudi) also to the list of devices that will be supported by the "fake" distributed backend and the process group that will be created. ## CHANGES - Add "hpu" to the list of devices @ankurneog, @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/156076 Approved by: https://github.com/d4l3k, https://github.com/EikanWang, https://github.com/albanD	2025-06-23 16:08:08 +00:00
Ke Wen	cc410d3761	[SymmMem] Rename all_to_all_vdev ops (#156582 ) `all_to_all_vdev` are not binding of NVSHMEM APIs. Removing the `nvshmem_` prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156582 Approved by: https://github.com/fduwjj ghstack dependencies: #155134	2025-06-23 15:57:36 +00:00
Ryan Guo	640f5a7090	[dynamo] Support builtin bool on non-constant VTs (#155863 ) In practice `bool(...)` is either constant folded by Dynamo or used for branching (so most of its emulation logic lived in `InstructionTranslator.generic_jump`. This patch adds a dedicated `bool` hanlder (only for symbolic bool/int/float for now), and fixes #136075. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155863 Approved by: https://github.com/williamwen42	2025-06-23 15:53:15 +00:00
Simon Fan	6b45af38a5	[easy] better copy_misaligned_inputs assertion failure message (#154472 ) internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/688540560729579/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/154472 Approved by: https://github.com/williamwen42	2025-06-23 15:39:15 +00:00
Randolf Scholz	2e9bd03f60	Implemented `Size.__radd__` (#152554 ) Fixes #144334 Builds on top of #146834 by @khushi-411 The needed trick was to add `PyNumberMethods` because these Number Protocol appears to be responsible for `__radd__` (see https://stackoverflow.com/q/18794169) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152554 Approved by: https://github.com/albanD Co-authored-by: Khushi Agrawal <khushiagrawal411@gmail.com> Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-06-23 15:38:37 +00:00
Nikita Shulga	3cbae6dde8	[MPSInductor][BE] Fix multistage reduction check (#156567 ) From less than max threadgroup size to less or equal to that, which eliminates redundant trivial loops. I.e. it changes shader code generated for ```python import torch def f(x): var, mean = torch.var_mean(x, dim=2, keepdim = True) return x / var, var torch.compile(f)(torch.rand(1, 16, 1024, dtype=torch.float32, device='mps')) ``` from ```metal [[max_total_threads_per_threadgroup(1024)]] kernel void generated_kernel( device float* out_ptr1, device float* out_ptr2, constant float* in_ptr0, uint2 thread_pos [[thread_position_in_grid]], uint2 group_pos [[thread_position_in_threadgroup]] ) { auto xindex = thread_pos.x; auto r0_index = thread_pos.y; int x0 = xindex; threadgroup float3 tmp_acc_0[1024]; tmp_acc_0[r0_index * 1] = 0.0; for(auto r0_1_cnt = 0; r0_1_cnt < 1; ++r0_1_cnt) { int r0_1 = 1 * r0_index + r0_1_cnt; auto tmp0 = in_ptr0[r0_1 + 1024x0]; tmp_acc_0[r0_index 1] = ::c10:🤘:welford_combine(tmp_acc_0[r0_index * 1], float3(tmp0, 0.0, 1.0)); } auto tmp1 = c10:🤘:threadgroup_welford_combine(tmp_acc_0, 1024); auto tmp2 = 1023.0; auto tmp3 = tmp1.y / tmp2; out_ptr1[x0] = static_cast<float>(tmp3); for(auto r0_1_cnt = 0; r0_1_cnt < 1; ++r0_1_cnt) { int r0_1 = 1 * r0_index + r0_1_cnt; auto tmp4 = in_ptr0[r0_1 + 1024x0]; auto tmp5 = tmp4 / tmp3; out_ptr2[r0_1 + 1024x0] = static_cast<float>(tmp5); } } ``` to ```metal [[max_total_threads_per_threadgroup(1024)]] kernel void generated_kernel( device float* out_ptr1, device float* out_ptr2, constant float* in_ptr0, uint2 thread_pos [[thread_position_in_grid]], uint2 group_pos [[thread_position_in_threadgroup]] ) { auto xindex = thread_pos.x; auto r0_index = thread_pos.y; int r0_1 = r0_index; int x0 = xindex; threadgroup float tmp_acc_0[1024]; auto tmp0 = in_ptr0[r0_1 + 1024x0]; tmp_acc_0[r0_index 1] = tmp0; auto tmp1 = c10:🤘:threadgroup_welford_reduce(tmp_acc_0, 1024); auto tmp2 = 1023.0; auto tmp3 = tmp1.y / tmp2; out_ptr1[x0] = static_cast<float>(tmp3); auto tmp4 = tmp0 / tmp3; out_ptr2[r0_1 + 1024*x0] = static_cast<float>(tmp4); } `` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156567 Approved by: https://github.com/dcci ghstack dependencies: #156566	2025-06-23 14:49:26 +00:00
Manuel Candales	e28925aa75	[MPS] Activation kernels: do compute at float precision (#155735 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155735 Approved by: https://github.com/malfet ghstack dependencies: #155304, #155316, #155462, #155479, #155571, #155586	2025-06-23 14:48:57 +00:00
PyTorch MergeBot	f5e1b24945	Revert "Enable Leak Sanitizer (#154584 )" This reverts commit c79c7bbe615265b6b3d7df39d6d5a68afd7d6b2a. Reverted https://github.com/pytorch/pytorch/pull/154584 on behalf of https://github.com/cyyever due to Need to suppress more output ([comment](https://github.com/pytorch/pytorch/pull/154584#issuecomment-2995792265))	2025-06-23 10:08:40 +00:00
PyTorch MergeBot	4f70fbbd16	Revert "Use CMake wholearchive group (#156393 )" This reverts commit d1b4e0fa9a5feb22fc6de1d36dc4c9dac685caed. Reverted https://github.com/pytorch/pytorch/pull/156393 on behalf of https://github.com/etaf due to This PR is breaking XPU windows build. ([comment](https://github.com/pytorch/pytorch/pull/156393#issuecomment-2995576362))	2025-06-23 09:03:19 +00:00
Yu, Guangye	92409b6c89	Add DeviceAllocator as the base device allocator (#138222 ) # Motivation In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases. <div align="center"> <table> <tr> <td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td> </tr> <tr> <td> ```python torch.xxx.empty_cache ``` </td> <td> ```python torch.accelerator.empty_cache ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_peak_memory_stats ``` </td> <td> ```python torch.accelerator.reset_peak_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_accumulated_memory_stats ``` </td> <td> ```python torch.accelerator.reset_accumulated_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_stats ``` </td> <td> ```python torch.accelerator.memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_allocated ``` </td> <td> ```python torch.accelerator.memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_allocated ``` </td> <td> ```python torch.accelerator.max_memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_reserved ``` </td> <td> ```python torch.accelerator.memory_reserved ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_reserved ``` </td> <td> ```python torch.accelerator.max_memory_reserved ``` </td> </tr> </table> </div> # Solution This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222 Approved by: https://github.com/albanD	2025-06-23 08:49:30 +00:00
Bob Ren	d5781c8d21	remove allow-untyped-defs from torch/fx/passes/utils/fuser_utils.py (#156538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156538 Approved by: https://github.com/ezyang	2025-06-23 08:18:16 +00:00
Pandya, Vivek Vasudevbhai	e0ae4ecca8	Refactor cpp codegen to support overridable class attributes. (#155553 ) - Refactored CppKernelProxy and CppScheduling to use class-level attributes (kernel_cls, kernel_proxy_cls) for backend-specific kernel customization. - Avoids method duplication (e.g., codegen_functions, codegen_node) for backend-specific overrides thus reduces downstream maintenance when upgrading Torch. - Ensures type safety with annotations while keeping core logic centralized and extensible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155553 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2025-06-23 07:36:30 +00:00
cyy	67ee0c6725	Remove outdated Android workarounds of nearbyintf (#151292 ) This PR uses std::nearbyint on all supported platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151292 Approved by: https://github.com/ezyang	2025-06-23 06:28:15 +00:00
cyy	d1b4e0fa9a	Use CMake wholearchive group (#156393 ) Use CMake wholearchive group to simplify code. It may also support more OSes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156393 Approved by: https://github.com/ezyang	2025-06-23 06:22:34 +00:00
cyy	099d0d6121	Simplify nvtx3 CMake handling, always use nvtx3 (#153784 ) Fall back to third-party NVTX3 if system NVTX3 doesn't exist. We also reuse the `CUDA::nvtx3` target for better interoperability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153784 Approved by: https://github.com/ezyang	2025-06-23 06:12:46 +00:00
Michael Lazos	31659964a5	[Cutlass] Fix buffer missing issues (#155897 ) Handles constants and constant folding with aoti. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155897 Approved by: https://github.com/henrylhtsang	2025-06-23 05:58:39 +00:00
cyy	c79c7bbe61	Enable Leak Sanitizer (#154584 ) It enables Leak Sanitizer and also provides a suppression file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154584 Approved by: https://github.com/ezyang	2025-06-23 05:20:27 +00:00
Yuanyuan Chen	9fed2added	Remove remaining CUDA 12.4 CI code (#155412 ) Because no 12.4 job. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155412 Approved by: https://github.com/ezyang	2025-06-23 05:16:38 +00:00
Nikita Shulga	4cd6e96bf0	[MPSInductor] Fix nested loop var elimination (#156566 ) As reduction resuts must be kept around Add regression test that is specific for this issue Fixes https://github.com/pytorch/pytorch/issues/156426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156566 Approved by: https://github.com/dcci	2025-06-23 04:35:16 +00:00
Xuehai Pan	d55dc00f84	[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321 Approved by: https://github.com/jingsh ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319	2025-06-23 02:57:50 +00:00
Xuehai Pan	5b210bb3a6	[BE][9/16] fix typos in torch/ (torch/csrc/) (#156319 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156319 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315, #156316, #156317	2025-06-23 02:57:50 +00:00
Xuehai Pan	ced90016c1	[BE][7/16] fix typos in torch/ (torch/csrc/) (#156317 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156317 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315, #156316	2025-06-23 02:57:41 +00:00
Xuehai Pan	cec2977ed2	[BE][6/16] fix typos in torch/ (#156316 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315	2025-06-23 02:57:34 +00:00
Xuehai Pan	4ccc0381de	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-23 02:57:28 +00:00
Xuehai Pan	1b2146fc6d	[BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156314 Approved by: https://github.com/jingsh ghstack dependencies: #156313	2025-06-23 02:57:19 +00:00
Xuehai Pan	6ff6630375	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-23 02:57:12 +00:00
leslie-fang-intel	c55eef79f8	[Inductor][CPP] Enable a config to use a small dequant buffer for woq int4 (#156395 ) Summary Add a configuration option to enable a smaller dequantization buffer for WOQ INT4 CPP GEMM template. This can improve the performance of the WOQ INT4 GEMM template in cases where M is small. In such scenarios, matrix B cannot be effectively reused across matrix A, and we found that reducing the Kc block size can lead to better performance. Test Plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_with_small_buffer_config ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156395 Approved by: https://github.com/jansel ghstack dependencies: #156407, #156387	2025-06-23 02:00:42 +00:00
leslie-fang-intel	3c7079959c	[Inductor][CPP] Enable WOQ int4 concat linear (#156387 ) Summary Enable the concat linear optimization pass in Inductor for woq int4 linear. Test Plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_concat_woq_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156387 Approved by: https://github.com/CaoE, https://github.com/jansel ghstack dependencies: #156407	2025-06-23 01:52:00 +00:00
Jack Taylor	03023f178c	FlexAttn config refactor + ROCm optimisations (#156307 ) This PR primarily unifies the flex attention config logic with the GEMM/Conv config approach https://github.com/pytorch/pytorch/pull/147452 this will make it much easier to handle optimisation pathways for particular triton backends. This PR also introduces: 1. Introduces an exhaustive tuning mode for flex attention via TORCHINDUCTOR_MAX_AUTOTUNE_FLEX_SEARCH_SPACE="EXHAUSTIVE" to allow for wide scale benchmarking for perf investigation use cases. 3. Updates configs for ROCm flex autotune path providing perf optimisations AMD perf numbers on score mod benchmark (default inputs) flex_attn \| mode \| Speedup (Avg) \| Speedup (Max) -- \| -- \| -- \| -- fwd \| autotune before PR \| 2.608 \| 20.56 fwd \| autotune after PR \| 2.862 \| 22 fwd \| exhaustive_autotune \| 2.943 \| 22.471 bwd \| autotune before PR \| 2.196 \| 9.831 bwd \| autotune after PR \| 2.423 \| 11.331 bwd \| exhaustive_autotune \| 2.566 \| 13.87 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156307 Approved by: https://github.com/drisspg, https://github.com/jansel	2025-06-22 22:27:38 +00:00
Varun Thumbe	a5cbb2bcb3	Improve All to All Perf for inter-node use-case (#156376 ) (#156389 ) Summary: For 16 GPU use-case. NVSHMEM can drive only upto 49GB/s with 8 thread blocks per peer for all to all V use-case. Increasing that to 16 threads per block is able to max out the perf. Test Plan: Verify on two hosts Host1: TORCH_SYMMMEM=NVSHMEM torchrun --nnodes=2 --nproc_per_node=8 --master_addr ${master_ip} --node_rank=0 comms.py -- master-ip ${master_ip} --b 4 --e 256M --n 500 --f 2 --z 1 --collective all_to_allv --backend nccl --device cuda Host2: TORCH_SYMMMEM=NVSHMEM torchrun --nnodes=2 --nproc_per_node=8 --master_addr ${master_ip} --node_rank=1 comms.py -- master-ip ${master_ip} --b 4 --e 256M --n 100 --f 2 --z 1 --collective all_to_allv --backend nccl --device cuda Rollback Plan: Differential Revision: D76937048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156389 Approved by: https://github.com/kwen2501	2025-06-22 20:45:46 +00:00
FFFrog	a28e6ae38f	[OpenReg][2/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156401 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156401 Approved by: https://github.com/albanD ghstack dependencies: #156400	2025-06-22 18:40:38 +00:00
FFFrog	1d522325b4	[OpenReg][1/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156400 ) As the title stated. Changes: - add resize_ for OpenReg - migrate related tests into test_openreg.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/156400 Approved by: https://github.com/albanD	2025-06-22 18:40:38 +00:00
Aaron Orenstein	54b8087f63	Improve torch.ops typing (#154555 ) Summary: Cloned https://github.com/pytorch/pytorch/pull/153558 from benjaminglass1 and fixed internal typing errors. Fixes longstanding issue where direct references to aten operations are seen as untyped by type checkers. This is accomplished by setting attributes on several classes more consistently, so that `__getattr__` can return a single type in all other cases. Decisions made along the way: 1. `torch.ops.higher_order` is now implemented by a single-purpose class. This was effectively true before, but the class implementing it attempted to be generalized unnecessarily. Fixing this simplified typing for the `_Ops` class. 2. `__getattr__` is only called when all other lookup methods have failed, so several constant special-cases in the function could be implemented as class variables. The remainder of this PR is fixing up all the bugs exposed by the updated typing, as well as all the nitpicky typing issues. Test Plan: CI Differential Revision: D75497142 Co-authored-by: Benjamin Glass <bglass@quansight.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154555 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/zou3519, https://github.com/benjaminglass1	2025-06-22 15:52:27 +00:00
James Wu	10fb98a004	[Precompile] Hook up backend="inductor" (#155387 ) This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry. One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387 Approved by: https://github.com/oulgen	2025-06-22 15:05:08 +00:00
PyTorch MergeBot	b5c8b8d09f	Revert "[dynamo] control one_graph behavior additionally through config (#154283 )" This reverts commit b46eb1ccaff944cdcd43e9ce3958819226d2952f. Reverted https://github.com/pytorch/pytorch/pull/154283 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))	2025-06-22 14:22:07 +00:00
PyTorch MergeBot	5e56db59d4	Revert "[dynamo] add set_fullgraph decorator/context manager (#154289 )" This reverts commit 2c372a0502578e0136a84423c3f49c19c26d6bb7. Reverted https://github.com/pytorch/pytorch/pull/154289 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))	2025-06-22 14:22:07 +00:00
PyTorch MergeBot	c10eeb5bad	Revert "[dynamo] fix set_fullgraph for nested calls (#154782 )" This reverts commit 537b0877a87948bc221301a518fdbc1cf772bc7e. Reverted https://github.com/pytorch/pytorch/pull/154782 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))	2025-06-22 14:22:07 +00:00
PyTorch MergeBot	ee3d9969cc	Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166 )" This reverts commit 24dc33b37b50ec92da08fc693dd83e7c87b74f8b. Reverted https://github.com/pytorch/pytorch/pull/155166 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))	2025-06-22 14:22:07 +00:00
PyTorch MergeBot	f1331f3f1b	Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 )" This reverts commit 3627270bdf17b0fb6f528ca1cb87d6f2ec32680a. Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
PyTorch MergeBot	5b427c92a8	Revert "[BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314 )" This reverts commit ead741c5fb0036e0fc95b79d4fe1af3a426e1306. Reverted https://github.com/pytorch/pytorch/pull/156314 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
PyTorch MergeBot	145d4cdc11	Revert "[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 )" This reverts commit c2f0292bd5b4b3206f5b295e96f81cd6c178eb18. Reverted https://github.com/pytorch/pytorch/pull/156315 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
PyTorch MergeBot	3f44fdc03d	Revert "[BE][6/16] fix typos in torch/ (#156316 )" This reverts commit b210cf1ea56bcd9f937a2805d9e70d8684d25ee4. Reverted https://github.com/pytorch/pytorch/pull/156316 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
PyTorch MergeBot	035a68d25a	Revert "[BE][7/16] fix typos in torch/ (torch/csrc/) (#156317 )" This reverts commit ee72815f1180fe2d8bcdb23493999256169ac2fa. Reverted https://github.com/pytorch/pytorch/pull/156317 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:56 +00:00
PyTorch MergeBot	1d3bca40ed	Revert "[BE][9/16] fix typos in torch/ (torch/csrc/) (#156319 )" This reverts commit a23ccaa8479e038e79532759a64e9947c0fac43d. Reverted https://github.com/pytorch/pytorch/pull/156319 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:56 +00:00
PyTorch MergeBot	4b55871e06	Revert "[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321 )" This reverts commit c95f7fa874a3116f1067f9092456ee7281003614. Reverted https://github.com/pytorch/pytorch/pull/156321 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156321#issuecomment-2994163667))	2025-06-22 12:27:36 +00:00
Sidharth	afbf5420b8	[dynamo] fixes to lru_cache message and adding user stack trace in debug mode (#156463 ) This PR refers to the issue: https://github.com/pytorch/pytorch/issues/155352 This PR uses torch._dynamo.utils.warn_once so that this warning only emits once, clarifies in the warning that silent incorrectness is potential, not observed, Doesn't warn for functions that come from torch.* As of right now with this code change the terminal outputs: if the code came from torch.* : Nothing, as we shouldn't warn for functions that come from torch.* else: /data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a potential risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace. torch._dynamo.utils.warn_once(msg) If the user runs the command 'TORCH_LOGS="+dynamo" python foo4.py', in the debug logs it shows(this log below is based on chillee's repro: /data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a potential risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace. torch._dynamo.utils.warn_once(msg) V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] call to a lru_cache` wrapped function from user code at: /data/users/ssubbarao8/pytorch/foo4.py:9 V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] File "/data/users/ssubbarao8/pytorch/foo4.py", line 9, in <module> V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] torch.compile(foo, backend="eager")(torch.randn(4)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156463 Approved by: https://github.com/williamwen42	2025-06-22 11:40:28 +00:00
Sidharth	aeaf6b59e2	[dynamo] Weblink generation when unimplemented_v2() is called (#156033 ) This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033 Approved by: https://github.com/williamwen42	2025-06-22 11:39:31 +00:00
Xuehai Pan	c95f7fa874	[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321 Approved by: https://github.com/jingsh ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319	2025-06-22 08:43:49 +00:00
Xuehai Pan	a23ccaa847	[BE][9/16] fix typos in torch/ (torch/csrc/) (#156319 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156319 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315, #156316, #156317	2025-06-22 08:43:49 +00:00
Xuehai Pan	ee72815f11	[BE][7/16] fix typos in torch/ (torch/csrc/) (#156317 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156317 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315, #156316	2025-06-22 08:43:41 +00:00
Xuehai Pan	b210cf1ea5	[BE][6/16] fix typos in torch/ (#156316 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315	2025-06-22 08:43:33 +00:00
Xuehai Pan	c2f0292bd5	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-22 08:43:26 +00:00
Xuehai Pan	ead741c5fb	[BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156314 Approved by: https://github.com/jingsh ghstack dependencies: #156313	2025-06-22 08:43:18 +00:00
Xuehai Pan	3627270bdf	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-22 08:43:09 +00:00
cyy	08dae945ae	Use official CUDAToolkit module in CMake (#154595 ) Use CUDA language in CMake and remove forked FindCUDAToolkit.cmake. Some CUDA targets are also renamed with `torch::` prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154595 Approved by: https://github.com/albanD	2025-06-22 05:44:29 +00:00
Edward Yang	1d993fa309	Don't change set_skip_guard_eval_unsafe for DisableContext, since compiler won't run (#156490 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156490 Approved by: https://github.com/anijain2305	2025-06-22 00:51:32 +00:00
Edward Yang	333e0e6147	Make build-deps drop builds into current venv again (#156200 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156200 Approved by: https://github.com/malfet	2025-06-22 00:45:02 +00:00
Laith Sakka	74ebd8d14e	use guard_or_false for expand utils reduction (#155868 ) This is classic broadcast like pattern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155868 Approved by: https://github.com/bobrenjc93	2025-06-21 23:42:19 +00:00
Syed Tousif Ahmed	f70c80105e	Enables NCCL symmetric memory kernels through mempool registration (#155134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155134 Approved by: https://github.com/kwen2501 Co-authored-by: Ke Wen <kw2501@meta.com>	2025-06-21 23:24:04 +00:00
Isalia20	9e132b770e	[CUDA] Skip test on low vram machines (#156548 ) I noticed some jobs error out after merging #155397 due to the test requiring >15GB GPU memory to execute and some of the machines it's running on has 8GB GPUs. This PR adds the skip option on those machines. CC: @eqy @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/156548 Approved by: https://github.com/eqy, https://github.com/malfet	2025-06-21 22:32:57 +00:00
codingwithsurya	e4ae60a413	[SymmMem] Add NVSHMEM Quiet support to Triton (#156475 ) This PR introduces device-side NVSHMEM completion guarantees via the quiet API in Triton, enabling GPU kernels to ensure all pending remote memory operations are fully complete before proceeding with subsequent operations. Changes: - Added a new `core.extern` wrapper for `nvshmem_quiet` in `nvshmem_triton.py` - Implemented `test_triton_quiet` in `test/distributed/test_nvshmem.py`, including: - A Triton kernel that performs `putmem_block` followed by `quiet()` to ensure completion - Flag-based signaling only after `quiet()` completes, guaranteeing data delivery - Consumer validation that when the completion flag arrives, all data transfers are guaranteed complete Tests: `$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_quiet` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156475 Approved by: https://github.com/kwen2501 ghstack dependencies: #156472, #156473, #156474	2025-06-21 22:19:58 +00:00
Xuan Zhang	c2d1b225e6	[PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809 ) Problem & Solution: Assume we have something like: ``` x = some_op(...) x0 = x[0] do_something_with_and_is_last_use_of(x0) do_a_bunch_of_other_things() x1 = x[1] ``` In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as ``` x = some_op(...) x0 = x[0] x1 = x[1] do_something_with_and_is_last_use_of(x0) do_a_bunch_of_other_things() ``` Results: For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are * baseline: 7.73GiB * with the chage: 6.45GiB As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed. cc and credit to @ShatianWang for noticing this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809 Approved by: https://github.com/fmassa, https://github.com/bdhirsh	2025-06-21 19:57:21 +00:00
codingwithsurya	04b91a9e43	[SymmMem] Add NVSHMEM Fence support to Triton (#156474 ) This PR introduces device-side NVSHMEM memory ordering via the fence API in Triton, enabling GPU kernels to enforce completion and ordering of remote memory operations before subsequent operations proceed. Changes: - Added a new `core.extern` wrapper for `nvshmem_fence` in `nvshmem_triton.py` - Implemented `test_triton_fence` in `test/distributed/test_nvshmem.py`, including: - A Triton kernel that performs two ordered `putmem_block` operations separated by `fence()` calls - Final fence before flag update to ensure all data transfers complete before signaling - Consumer validation that both buffers contain expected values when flag arrives, proving ordering guarantees Tests: `$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_fence` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156474 Approved by: https://github.com/mandroid6, https://github.com/kwen2501 ghstack dependencies: #156472, #156473	2025-06-21 18:57:05 +00:00
Simon Fan	c06c2569ee	[ca] Support TorchDispatchMode via pass through (#156516 ) The CA initial trace just proxies nodes without dispatching any ops, we should hide it from ambient TorchDispatchModes In terms of differences with eager autograd engine: - For function mode, CA additionally disables/re-enables `_set_multithreading_enabled` - For dispatch mode: - accumulate grad doesn't go down the stealing path (inaccurate compile-time refcount) so the grad `detach` ops are `copy_` instead - Since we always initial trace with dynamic shapes, and we filter out sizes, there's 1 aten.empty.memory_format for each mark_dynamic'd scalar Pull Request resolved: https://github.com/pytorch/pytorch/pull/156516 Approved by: https://github.com/jansel ghstack dependencies: #156374, #156509	2025-06-21 18:33:47 +00:00
Simon Fan	5f2f343e1e	[ca] suggest to disable compiled autograd for trace-time NotImplementedErrors (#156509 ) Example: ```python File "/home/xmfan/core/a/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ NotImplementedError: TorchDispatchMode not yet implemented for compiled autograd. You can disable compiled autograd for this operation by: 1. Relocating the unsupported autograd call outside the compiled region. 2. Wrapping the unsupported autograd call within a scope that disables compiled autograd. 3. Configuring the specific compilation unit to disable compiled autograd. 4. Globally disabling compiled autograd at the application's initialization. ``` No duplicate error messages for python side trace-time errors ```python ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xmfan/core/a/pytorch/torch/_dynamo/compiled_autograd.py", line 344, in begin_capture raise NotImplementedError( NotImplementedError: Found tensor of type <class 'torch.nn.utils._expanded_weights.expanded_weights_impl.ExpandedWeight'>, which is not supported by FakeTensorMode. You can turn off compiled autograd by either: 1. Moving the unsupported autograd call outside of the torch.compile'd region. 2. Wrapping the unsupported autograd call in the torch._dynamo.compiled_autograd._disable() context manager. 3. Setting torch._dynamo.config.compiled_autograd=False for the torch.compile call containing the unsupported autograd call. 4. Setting torch._dynamo.config.compiled_autograd=False at the start of the program. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156509 Approved by: https://github.com/jansel ghstack dependencies: #156374	2025-06-21 18:33:46 +00:00
Simon Fan	f1968a5e76	[ca] skip on some PYTORCH_TEST_WITH_DYNAMO=1 autograd tests (#156374 ) These aren't supported. Not sure how they passed CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/156374 Approved by: https://github.com/jansel	2025-06-21 18:33:38 +00:00
Animesh Jain	fab85fc5f9	[compile][hierarchical compilation] Release nested_compile_region API (#156449 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156449 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-06-21 15:14:59 +00:00
Sam Larsen	fb75dea2c1	[logging] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156517 ) Summary: Discussed internally at https://fburl.com/workplace/v3hllrs9. With coordinate descent tuning enabled, we're missing the dynamo_timed logging. Test Plan: `TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 buck run mode/opt caffe2/benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 1 --performance --cold-start-latency` * tlparse: https://fburl.com/bh2hxw4z * dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/u88ogw39 * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/yqljow6c Rollback Plan: Differential Revision: D77053918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156517 Approved by: https://github.com/mengluy0125	2025-06-21 14:17:19 +00:00
atalman	a47ca4fc74	Revert "[dynamo] Weblink generation when unimplemented_v2() is called (#156033 )" (#156546 ) Broke multiple CI jobs: dynamo/test_reorder_logs.py::ReorderLogsTests::test_constant_mutation [GH job link](https://github.com/pytorch/pytorch/actions/runs/15792695433/job/44521220864) [HUD commit link](`9de23d0c29`) This reverts commit 9de23d0c29dfac8dc0f6f234bdbcd85a6375fa81. PyTorch bot revert failed: https://github.com/pytorch/pytorch/pull/156033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156546 Approved by: https://github.com/jansel	2025-06-21 14:10:12 +00:00
PyTorch MergeBot	d846e21355	Revert "[nativert] move layout planner algorithms to libtorch (#156508 )" This reverts commit eab45643f22e58ee12d95d8b0162d51ca0a50801. Reverted https://github.com/pytorch/pytorch/pull/156508 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/15793524714/job/44524067679) [HUD commit link](`eab45643f2`) ([comment](https://github.com/pytorch/pytorch/pull/156508#issuecomment-2993589983))	2025-06-21 13:42:40 +00:00
Isalia20	1cfdcb975a	[CUDA] fix illegal memory access in attention (#155397 ) Fixes https://github.com/pytorch/pytorch/issues/150054 CI seemed to be messed up in the old one, old PR: https://github.com/pytorch/pytorch/pull/155145 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155397 Approved by: https://github.com/ngimel	2025-06-21 12:32:00 +00:00
fduwjj	cd75cf3cab	[symm_mem] Add one side put API for nvshvem (#156443 ) `nvshmem_put(Tensor tensor, int peer)`, where `tensor` must be a symmetric tensor, i.e. rendezvoused before this call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156443 Approved by: https://github.com/kwen2501 Co-authored-by: Ke Wen <kw2501@meta.com>	2025-06-21 12:16:36 +00:00
codingwithsurya	4ff0e033c1	[SymmMem] Add NVSHMEM signal_wait_until support to Triton (#156473 ) This PR introduces device-side NVSHMEM signal synchronization via the signal_wait_until API in Triton, enabling GPU kernels to block until a signal variable meets a specified condition. This replaces previous barrier-based synchronization patterns with more efficient signal-based coordination between PEs. Changes: - Added a new `core.extern` wrapper for `nvshmem_signal_wait_until` in `nvshmem_triton.py` - Updated existing `test_triton_put_signal` and `test_triton_put_signal_add` tests to use `signal_wait_until` instead of `dist.barrier()` for proper device-side synchronization ([per feedback](https://github.com/pytorch/pytorch/pull/156211#discussion_r2153035675)) - Implemented `test_triton_signal_wait_until` with: - Producer-consumer pattern where Rank 0 puts data and signals completion via `putmem_signal_block` - Consumer (Rank 1) uses `signal_wait_until` to block until the signal variable reaches the expected value - End-to-end validation of both data transfer and signal synchronization Tests: `$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_signal_wait_until` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156473 Approved by: https://github.com/kwen2501, https://github.com/mandroid6 ghstack dependencies: #156472	2025-06-21 10:55:40 +00:00
Laith Sakka	8485f19507	remove gso from vector_norm (#156530 ) guard_or_false here does same thing that guard_size_oblivuous do, note that size is >=0 and this is size like by definition since its a tensor size Pull Request resolved: https://github.com/pytorch/pytorch/pull/156530 Approved by: https://github.com/bobrenjc93	2025-06-21 08:42:36 +00:00
sanchitintel	6ffa03ef9e	[Inductor-CPU] int8 WoQ concat linear (#153004 ) ### Summary int8 WoQ GEMM concat linear optimization pertaining to the same activation applied to 3 sets of weights of the same shape. ### Perf data GPT-J 128 input tokens, 128 output tokens. 32 physical cores of one socket of Intel(R) Xeon(R) 6972P (Xeon Gen 5). tcmalloc & Intel OpenMP were preloaded. \| May 8 nightly first token latency \| First token latency with this implementation \| Rest token latency with May 8 nightly \| Rest token latency with this implementation combined with #149373 \| \|---\|---\|---\|---\| \|202 ms \| 190 ms \| 33 ms \| 30 ms\| Pull Request resolved: https://github.com/pytorch/pytorch/pull/153004 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/jansel Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in>	2025-06-21 08:40:09 +00:00
Laith Sakka	35321b2ad6	remove make_fast_binary_impl from make_fast_binary_impl (#156528 ) This was added in https://github.com/pytorch/pytorch/pull/133584. Take slow path when we cant determine fast path is valid. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156528 Approved by: https://github.com/bobrenjc93	2025-06-21 08:27:54 +00:00
dolpm	eab45643f2	[nativert] move layout planner algorithms to libtorch (#156508 ) Summary: tt Test Plan: ci Rollback Plan: Differential Revision: D76832891 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156508 Approved by: https://github.com/zhxchen17	2025-06-21 07:35:40 +00:00
Scott Wolchok	bf50d71553	Add missing `inline namespace CPU_CAPABILITY` to Gelu/Elu.h (#156512 ) As I recently learned the hard way (#156243), it is necessary to put kernel code that uses Vectorized in headers in this namespace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156512 Approved by: https://github.com/malfet	2025-06-21 06:26:23 +00:00
codingwithsurya	e3b44edfd8	[SymmMem] Add NVSHMEM wait_until support to Triton (#156472 ) This PR introduces device-side NVSHMEM synchronization via the wait_until API in Triton, enabling GPU kernels to block until a remote flag reaches a specified value. It also adds a corresponding end-to-end test to validate correct behavior across PEs. Changes: - Added a new `core.extern` wrapper for `nvshmem_longlong_wait_until` in `nvshmem_triton.py`. - Implemented `test_triton_wait_until` in `test/distributed/test_nvshmem.py`, including: - A simple Triton kernel that calls `nvshmem.wait_until` on a symmetric memory flag. - Coordination logic where Rank 0 blocks until Rank 1 atomically sets the flag and transfers data. Tests: `$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_wait_until` ```python @triton.jit def put_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr): nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer) @triton.jit def wait_until_kernel(ivar_ptr, cmp_op: tl.constexpr, cmp_val: tl.constexpr): nvshmem.wait_until(ivar_ptr, cmp_op, cmp_val) ... if rank == 0: print(f"[RANK 0] About to call wait_until_kernel - this will BLOCK until rank 1 sets flag to 21") wait_until_kernel[(1, 1, 1)](ivar_ptr, cmp_op=NVSHMEM_CMP_EQ, cmp_val=flag_val, extern_libs=nvshmem_lib) print(f"[RANK 0] WAIT IS OVER! Flag was set, checking data now...") print(f"[RANK 0] Current out buffer contents: {out.tolist()}") torch.testing.assert_close(out, val * torch.ones(numel, dtype=dtype, device=self.device)) print(f"[RANK 0] ✓ DATA VERIFICATION PASSED! Got expected values.") if rank == 1: print(f"[RANK 1] About to PUT 8 elements of value 13 to rank 0") put_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib) print(f"[RANK 1] About to PUT flag value 21 to wake up rank 0") put_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=1, peer=peer, extern_libs=nvshmem_lib) print(f"[RANK 1] FLAG PUT complete! Rank 0 should wake up now.") ... ``` Output: ``` [RANK 0] About to call wait_until_kernel - this will BLOCK until rank 1 sets flag to 21 [RANK 1] About to PUT 8 elements of value 13 to rank 0 [RANK 1] About to PUT flag value 21 to wake up rank 0 [RANK 1] FLAG PUT complete! Rank 0 should wake up now. [RANK 0] WAIT IS OVER! Flag was set, checking data now... [RANK 0] Current out buffer contents: [13, 13, 13, 13, 13, 13, 13, 13] [RANK 0] ✓ DATA VERIFICATION PASSED! Got expected values. [RANK 0] Test completed successfully! 🎉 [RANK 1] Test completed successfully! 🎉 ... ---------------------------------------------------------------------- Ran 1 test in 18.773s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156472 Approved by: https://github.com/kwen2501	2025-06-21 06:18:31 +00:00
Pian Pawakapan	92c79f36db	[PGO] frame-specific whitelist logging (#155959 ) Summary: In D75617963, we started logging dynamic whitelist suggestions to PT2 Compile Events. The whitelists were aggregated across all frames, intending to avoid manual work for the user (e.g. if frame 0/1 saw L['x'] turn dynamic, and later 1/1 saw L['y'], we'd log "L['x'],L['y']" on frame 1/1). This switches to frame-specific whitelists, as attributing dynamism changes to certain frames was difficult, and suggestions are sometimes polluted by problematic frames (e.g. optimizer states). The globally aggregated whitelist is still available in tlparse, by looking at the final `put_local_code_state_*` entry. Test Plan: loggercli codegen GeneratedPt2CompileEventsLoggerConfig Rollback Plan: Differential Revision: D76628834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155959 Approved by: https://github.com/bobrenjc93	2025-06-21 06:15:51 +00:00
Sidharth	9de23d0c29	[dynamo] Weblink generation when unimplemented_v2() is called (#156033 ) This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033 Approved by: https://github.com/williamwen42	2025-06-21 05:47:54 +00:00
Aby Mathew C	b8ace6f951	Make dtensor tests device agnostic (#155687 ) ## MOTIVATION This PR is a continuation of https://github.com/pytorch/pytorch/pull/154840 and we are trying to make the tests more device agnostic by removing hard coded references to any particular device. Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66 ## CHANGES 1. test_convolution_ops.py: - Replace "cuda" with self.device_type 2. test_random_ops.py: - Remove setting and using TYPE_DEVICE variable since device_type is set as per the environment (device) in DTensorTestBase class. - Replace "cuda" with self.device_type Pull Request resolved: https://github.com/pytorch/pytorch/pull/155687 Approved by: https://github.com/EikanWang, https://github.com/d4l3k	2025-06-21 04:51:59 +00:00
anwang	f3ec16c26a	[MTIA Aten Backend][3/n] Migrate mm.out from out-of-tree to in-tree (#154393 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate mm.out from out-of-tree to in-tree. We dispatch mm.out to MTIA separately from CPU/CUDA. So this diff adds the file `MTIAOps.cpp` under `ATen/native/mtia` to hold the dispatched functions. In future we can split `MTIAOps.cpp` to categorized ops files. Differential Revision: [D74743849](https://our.internmc.facebook.com/intern/diff/D74743849/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154393 Approved by: https://github.com/albanD, https://github.com/egienvalue, https://github.com/nautsimon	2025-06-21 04:31:04 +00:00
drisspg	88b9c285e0	Workaround for e4m2 dtype (#156461 ) Found in: https://github.com/pytorch/ao/pull/2408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156461 Approved by: https://github.com/vkuzo	2025-06-21 04:01:44 +00:00
soulitzer	554b568040	Add internal use only utility to allow externally visible side effects within HOPs (#155715 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155715 Approved by: https://github.com/zou3519	2025-06-21 03:55:28 +00:00
Arsh Zahed	c09b054878	Add runtime profiler info for AOTDispatcher prologue (#155785 ) Fixes #155721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155785 Approved by: https://github.com/bdhirsh	2025-06-21 03:34:07 +00:00
fduwjj	fd8ea3c8a3	[symm_mem] Add nccl as a backend for symmetric memory (#155740 ) Running unit test: TORCH_SYMMMEM=NCCL TORCH_DISTRIBUTED_DEBUG=INFO TORCH_CPP_LOG_LEVEL=INFO pytest test/distributed/test_nccl.py -k test_nccl_symmem_alloc Pull Request resolved: https://github.com/pytorch/pytorch/pull/155740 Approved by: https://github.com/kwen2501	2025-06-21 03:22:23 +00:00
Nikita Shulga	ee56e9f8a8	[BE] Make Eigen an optional dependency (#155955 ) Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found. Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example. Remove eigen submodule and replace it with eigen_pin.txt Fixes https://github.com/pytorch/pytorch/issues/108773 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955 Approved by: https://github.com/atalman	2025-06-21 03:02:02 +00:00
Xuehai Pan	b4228a94d1	Split the exclude pattern for `CODESPELL` linter (#156229 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156229 Approved by: https://github.com/albanD ghstack dependencies: #156080, #156081	2025-06-21 02:47:40 +00:00
Xuehai Pan	e3507c3777	[BE] fix typos in functorch/ and scripts/ (#156081 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156081 Approved by: https://github.com/albanD ghstack dependencies: #156080	2025-06-21 02:47:40 +00:00
Xuehai Pan	2ccfd14e23	[BE] fix typos in docs/ (#156080 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156080 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-06-21 02:47:32 +00:00
clr	9aaa184105	dynamo: Don't crash when someone tries to access a non existent list member (#156335 ) dynamo: Don't crash when someone tries to access a non existent list member Test added which reproduces the failure. Note that I'm using the new unimplemented_v2 API. Let me know if people have a strong preference that I use something else. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156335 Approved by: https://github.com/jansel	2025-06-21 02:26:31 +00:00
Frank Lin	ac86ec0e60	[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 ) Fixes #154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097 Approved by: https://github.com/ngimel Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-06-21 01:34:41 +00:00
Yiming Zhou	e98dd95446	[nativert] Move SerialGraphExecutor to PyTorch core (#156459 ) Summary: `SerialGraphExecutor` inherits from `GraphExecutorBase` and executes all nodes in the graph in a serial manner Test Plan: CI Rollback Plan: Differential Revision: D76917966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156459 Approved by: https://github.com/zhxchen17, https://github.com/jingsh	2025-06-21 01:32:06 +00:00
bobrenjc93	a67eb1a0d6	[ez] remove unused functions (#156466 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156466 Approved by: https://github.com/jingsh	2025-06-21 00:38:34 +00:00
Animesh Jain	2ee23175d9	[dynamo][guards] Catch exception and return false in the backend match (#156341 ) Its difficult to write a test. I found this while debugging a sefgault. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156341 Approved by: https://github.com/williamwen42	2025-06-21 00:13:26 +00:00
Ke Wen	0f0c010714	[c10d] init_process_group supports index-only device id (#156214 ) Before: ``` acc = torch.accelerator.current_accelerator() if acc: local_idx = ... dist.init_process_group( device_id=torch.device(acc.type, local_idx) ) ``` After: ``` dist.init_process_group(device_id=local_idx) ``` That is, `init_process_group` checks `torch.accelerator.current_accelerator()` internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156214 Approved by: https://github.com/guangyey, https://github.com/albanD	2025-06-21 00:02:37 +00:00
Justin Chu	fbbab794ef	[ONNX] Implement Attention-23 (#156431 ) Implement Attention-23 using sdpa and flexattention. - I used copilot for this. - Also updated the conversion logic to remove trailing None inputs. @gramalingam @kunal-vaishnavi @titaiwangms Pull Request resolved: https://github.com/pytorch/pytorch/pull/156431 Approved by: https://github.com/titaiwangms Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-06-20 23:54:57 +00:00
Menglu Yu	0ad88a2224	Support environement var for autotune log (#156254 ) Summary: Titled Test Plan: See the scadcastle signal Rollback Plan: Differential Revision: D76860928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156254 Approved by: https://github.com/Mingming-Ding	2025-06-20 23:06:33 +00:00
Nicolas Macchioni	6098209bff	[BE][5/X] Phase out usage of use_max_autotune() (#156269 ) These look to be the last call sites using `use_max_autotune(...)`, so remove those and `use_max_autotune(...)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156269 Approved by: https://github.com/masnesral	2025-06-20 22:37:45 +00:00
Animesh Jain	5ab257c74c	[invoke_subgraph] Make invoke_subgraph cacheable (#156448 ) Its unclear to me what happens if the subgraph itself is not cacheable. Imo, there is nothing special about invoke_subgraph to prevent any caching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156448 Approved by: https://github.com/oulgen, https://github.com/zou3519	2025-06-20 21:20:23 +00:00
Scott Wolchok	e2351f2dcf	fix apparent copy-paste bug in log_softmax reduced-precision fp kernel (#156379 ) This looks like a bug. Check if trying to fix it breaks existing tests; if not, will look into why no test coverage caught it Pull Request resolved: https://github.com/pytorch/pytorch/pull/156379 Approved by: https://github.com/janeyx99	2025-06-20 20:54:53 +00:00
Guilherme Leobas	b8fc5e0c0d	skip flaky test in CPython 3.13 tests (#155561 ) Changed files: * test_math.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/155561 Approved by: https://github.com/zou3519	2025-06-20 20:25:35 +00:00
PyTorch MergeBot	754c04aa06	Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564 )" This reverts commit 0aed855b2bde6d9bd045bb20cc24544a9f2fb72b. Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/ezyang due to regresses functorch_maml_omniglot ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2992685744))	2025-06-20 20:18:24 +00:00
Bhagirath Mehta	de1930a429	Add ONNX dynamo metadata documentation (#155816 ) Describe auto-generated metadata when calling torch.onnx.export Pull Request resolved: https://github.com/pytorch/pytorch/pull/155816 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-06-20 20:12:22 +00:00
bobrenjc93	a69e27ca5a	Remove unused MultiKernelCall import from inductor codegen (#156158 ) Since it's now actually used within async_compile.multi_kernel ``` def multi_kernel(self, args, kwargs) -> Any: from torch._inductor.codegen.multi_kernel import MultiKernelCall # no need to call this in parallel since the sub-kernels are already parallel tasks return MultiKernelCall(args, **kwargs) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156158 Approved by: https://github.com/jansel, https://github.com/shunting314	2025-06-20 19:55:24 +00:00
Shangdi Yu	e5ea24fb27	[nativert] Move auto_functionalize_kernel (#156454 ) Summary: Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/: fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels Copied from original auto_functionalize Diff Summary D53776805: This is a non-functional kernel implementation for auto_functionalize In AutoFunctionalizeKernel, I directly call the underlying target without making a clone of mutating inputs. This would mutates the input tensors inplace, which is unsafe in general. However, Sigmoid is not doing any graph optimization, or node reordering at the moment, so it's ok do take this short cut. In the proper functional implementation, it will make a clone of the mutating input tensor return these new instance of tensors as AutoFunctionalizeKernel output. If the original exported program has some "bufferMutation" or "userInputMutation" fields, it will also need to honor such mutations in Sigmoid. Test Plan: See internal for test plan Differential Revision: D76926383 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156454 Approved by: https://github.com/zhxchen17	2025-06-20 19:53:16 +00:00
Jane Xu	eb331b59fe	Add shim fallback for narrow (#156496 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156496 Approved by: https://github.com/albanD	2025-06-20 19:47:00 +00:00
Aleksandar Samardžić	6ed85bfe6a	Refine alignment check along dynamic dimension for grouped MMs (#155466 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466 Approved by: https://github.com/ngimel	2025-06-20 19:42:57 +00:00
Nikita Shulga	ef6d2cee7a	[BE][MPS] Refactor core matmul logic into matmul_core (#155969 ) In preparation of adding integer addmm, move matmul computation part into matmul_inner function Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969 Approved by: https://github.com/Skylion007	2025-06-20 18:54:38 +00:00
sekyondaMeta	18e4c461fb	Update index.md (#155143 ) Related to: https://github.com/pytorch/pytorch/issues/152134 Update to index.md to add language for Stable and Unstable Pull Request resolved: https://github.com/pytorch/pytorch/pull/155143 Approved by: https://github.com/AlannaBurke, https://github.com/atalman Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-20 18:53:32 +00:00
Kevin Fu	502486d946	[PT2]Add weight and constant config path template (#156359 ) Summary: At title. Test Plan: N/A Rollback Plan: Differential Revision: D76925510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156359 Approved by: https://github.com/SherlockNoMad	2025-06-20 18:46:01 +00:00
Jane Xu	4b6cbf528b	Add C shim fallback for fill_ (#156245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156245 Approved by: https://github.com/desertfire	2025-06-20 18:45:48 +00:00
PyTorch MergeBot	208ec60e72	Revert "[BE] Make Eigen an optional dependency (#155955 )" This reverts commit 1b50c12584909bda00009f4f0fd0d38ec792d019. Reverted https://github.com/pytorch/pytorch/pull/155955 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155955#issuecomment-2992512124))	2025-06-20 18:43:52 +00:00
PyTorch MergeBot	d309cd1d50	Revert "[BE][MPS] Refactor core matmul logic into matmul_core (#155969 )" This reverts commit 769d754ab2469813a3b790ec58c25c466099dd3d. Reverted https://github.com/pytorch/pytorch/pull/155969 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155969#issuecomment-2992502683))	2025-06-20 18:40:38 +00:00
PyTorch MergeBot	96d082d06b	Revert "[InductorBench] Fix accuracy validation logic for MPS (#156385 )" This reverts commit 242eb19c8383b4b197963a8a564475d52c85ac66. Reverted https://github.com/pytorch/pytorch/pull/156385 on behalf of https://github.com/malfet due to Has some bug in error handling ([comment](https://github.com/pytorch/pytorch/pull/156385#issuecomment-2992441769))	2025-06-20 18:17:18 +00:00
Shunting Zhang	39270430c9	[inductor] force min num-split (off by default) (#155941 ) This is a fix for the 10% QPS regression of some internal model (internal doc: [here](https://docs.google.com/document/d/19EiSZSS_SNUNfRg3jmevyrDs9nVpyvyGX_LHfiz-SbU/edit?tab=t.0#heading=h.dim0r28ztzu5) and [here](https://docs.google.com/document/d/1DjRWJPl1cgpceaj8YXTyw6FubGb43Vw-lTAETF9XXnI/edit?tab=t.0#heading=h.ld0vvn8o77sp) ). The regression is caused by un-representable example inputs for compilation with dynamic shapes. While the general problem is hard to solve and requires more work, for this specific one, there is a quick fix. When we compile LayerNormBackward with small xnumel and large rnumel, we do split reduction. With un-representative inputs, rnumel may be something in the range like 4K and we pick a small num-split (9 in this specific case). Later on when we get an inputs with larger rnumel (100K range. no recompile due to dynamic shape enabled), the small num-split does not introduce enough parallelism and cause sub-optimal performance. The quick fix is to force a minimum value for num_split. Let's say we split a reduction [xnueml, rnueml] to two in this order: - [xnumel * num_split, rnumel / num_split] - [xnumel, num_split] A larger num_split always introduce more parallelism for kernel 1. It may results in more work in kernel 2. But if we set the minimum num_split to something not too large (like 256), for kernel2 each row may still be able to get done by reduction with a few or even a single warp. There may not be slow down for kernel 2. Here are some benchmarking results. ``` import torch from triton.testing import do_bench import functools from torch._inductor import config from torch._dynamo.decorators import mark_dynamic import os @torch.compile(dynamic=True) def f(x): return x.sum(dim=0) N = 512 C = functools.partial(torch.randn, device="cuda") x_small = C(4096, N) x_large = C(4096 * 1000, N) if os.getenv("HINT_WITH_SMALL_INPUT") == "1": x = x_small else: x = x_large mark_dynamic(x, 0) f(x) ms = do_bench(lambda: f(x_large)) # 4.03ms if hint with large input. Output code: https://gist.github.com/shunting314/0be562a0c14f8ec0852b12bbf53d7a15 # 8.32ms if hint with small input. Output code: https://gist.github.com/shunting314/79b924c266d5c562703c3bdfb48d8272 # 3.92ms if hint with small input, and force min num split: Output code: https://gist.github.com/shunting314/c82917a1849b698bf4d2be2fde2fd2ba print(ms) ``` This test mimic what we see in the original problem. - If we compile with large inputs and benchmark for large inputs, latency is 4.03ms - if we compile with small input but benchmark for large inputs, we get more than 2x slowdown. latency is 8.32ms - with the fix, even if we compile with small input and benchmark for large inputs, latency is 3.92ms. The perf is slightly better than the first case. So it's possible that the heuristic to decide num-split has room to improve The minimum num-split restriction could be applied for dynamic shape case solely, but I found it can also help for static shape cases a little bit. So I plan to apply it without checking dynamic shape for now unless I see red signals in thorough perf test. - Outer reduction with static shape: https://gist.github.com/shunting314/6a670a818e63533479399c4dbea5b29a . The fix improve perf from 0.01 ms to 0.009 ms - Inner reduction with static shape: https://gist.github.com/shunting314/f12f20099126130b953e55ad325c0f62 Perf is neutral (0.011 ms v.s. 0.011ms) A thorough perf test is running here: https://github.com/pytorch/pytorch/actions/runs/15642912325 # Update for not applying the change to static shape: from the perf test result [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2009%20Jun%202025%2020%3A57%3A15%20GMT&stopTime=Mon%2C%2016%20Jun%202025%2020%3A57%3A15%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=62b8e191e027842d402fb046a429732616f87570&rBranch=main&rCommit=5b9db4335e61c1c903cb0769282cbea588e49036), it looks like the change hurts perf for static shape case. I think one reason is the change may increase the number of kernels and lose some fusion opportunities. Check the following code for example: ``` import torch from torch._inductor import config aten = torch.ops.aten def f(x): return aten.bernoulli(x).sum() x = torch.randn(8000 * 3, dtype=torch.bfloat16, device="cuda") torch.compile(f)(x) ``` With the change the bernoulli kernel would NOT be able to fuse with the first layer reduction due to 8000 * 3 is not divisible by 256. Potentially we could improve the change to always pick num-split greater than 256 and divisible by rnumel . But I'll simply apply the change for dynamic shape for now since that's the original issue. Another perf test only applying min-num-split to dynamic shape [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2011%20Jun%202025%2018%3A14%3A04%20GMT&stopTime=Wed%2C%2018%20Jun%202025%2018%3A14%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=e7b2cf55f30a585acd4d907fc9127fcb30a256cc&rBranch=main&rCommit=d3d655ad14ee4cd1c135ac57bbf75d5623fc9fa6) Differential Revision: [D76625617](https://our.internmc.facebook.com/intern/diff/D76625617) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155941 Approved by: https://github.com/jansel, https://github.com/bobrenjc93	2025-06-20 18:01:28 +00:00
Jane Xu	55dae0bf7a	Add a basic shim and stable::Tensor is_contiguous API (#156228 ) Add a limited is_contiguous in shim, stable::Tensor API with a test case Pull Request resolved: https://github.com/pytorch/pytorch/pull/156228 Approved by: https://github.com/desertfire	2025-06-20 17:59:52 +00:00
Catherine Lee	49ee1e7106	[CI] Reuse old whl: loosen check for deleted files, do not handle renames (#156138 ) Make the check for deleted files only be for files in the torch folder since docs only changes could not get through this Use `--no-renames` to make both the old name and the old name show up in the diff. Without it I think only the new name shows up in git diff Pull Request resolved: https://github.com/pytorch/pytorch/pull/156138 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/cyyever	2025-06-20 17:58:04 +00:00
Mwiza Kunda	e31f205292	[Inductor] Adjust boundary checking of dimensions using YBLOCK (#149504 ) Apply the same logic introduced in https://github.com/pytorch/pytorch/pull/139751 to triton kernels using block ptrs. Here, if ynumel / YBLOCK > max_y_grids, dimensions dependent on YBLOCK need to be boundary checked, even if the block shape in such dimensions is a multiple of an expression in YBLOCK. This is because ynumel / YBLOCK % get_max_y_grids() may not be zero, so redundant programs will be launched that will attempt to read / write OOB. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149504 Approved by: https://github.com/blaine-rister Co-authored-by: blaine-rister <145300525+blaine-rister@users.noreply.github.com>	2025-06-20 17:43:38 +00:00
Frost Mitchell	d83ff89d3b	Add toggle functionality for XPU profiler (#155135 ) Fixes #154898 by adding ability to toggle XPU profiler on and off (which has already been added in pytorch/kineto#1088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155135 Approved by: https://github.com/guangyey, https://github.com/sraikund16	2025-06-20 17:27:48 +00:00
Nikita Shulga	1b50c12584	[BE] Make Eigen an optional dependency (#155955 ) Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found. Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example. Remove eigen submodule and replace it with eigen_pin.txt Fixes https://github.com/pytorch/pytorch/issues/108773 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955 Approved by: https://github.com/atalman ghstack dependencies: #155947, #155954	2025-06-20 17:21:27 +00:00
Xuehai Pan	63360e64da	[BE][Easy] do not install yanked `types-pkg-resources` in lint environment (#156462 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156462 Approved by: https://github.com/ezyang	2025-06-20 16:00:43 +00:00
PyTorch MergeBot	1036f6d114	Revert "[ROCm] Bump AOTriton to 0.10b (#156290 )" This reverts commit 34d8e64ef64d88324092a2028884c54c13e086b3. Reverted https://github.com/pytorch/pytorch/pull/156290 on behalf of https://github.com/atalman due to failing multiple internal tests ([comment](https://github.com/pytorch/pytorch/pull/156290#issuecomment-2992072727))	2025-06-20 15:35:25 +00:00
PyTorch MergeBot	b4442f42a9	Revert "Upgrade to DLPack 1.0. (#145000 )" This reverts commit 6e185c53124e1b5a0fe391959060c1249178bcb6. Reverted https://github.com/pytorch/pytorch/pull/145000 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/145000#issuecomment-2992055400))	2025-06-20 15:32:47 +00:00
PyTorch MergeBot	edd45f3a02	Revert "[Precompile] Hook up backend="inductor" (#155387 )" This reverts commit 2c68c3e8d5e9a235f5861be6486de4959f80c840. Reverted https://github.com/pytorch/pytorch/pull/155387 on behalf of https://github.com/atalman due to dynamo/test_precompile_context.py::PrecompileContextTests::test_basic [GH job link](https://github.com/pytorch/pytorch/actions/runs/15772892021/job/44464141039) [HUD commit link](`2c68c3e8d5`) ([comment](https://github.com/pytorch/pytorch/pull/155387#issuecomment-2992044073))	2025-06-20 15:30:04 +00:00
Hari Krishna Sai Kodali	e1f28fe17b	add device generalisation support for distributed tests (#152471 ) ### MOTIVATION To generalize Distributed test cases for non-CUDA devices ### CHANGES - test/distributed/optim/test_zero_redundancy_optimizer.py - test/distributed/test_c10d_logger.py - test/distributed/test_compute_comm_reordering.py Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp. DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions. - torch/testing/_internal/common_distributed.py extended common utility functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/152471 Approved by: https://github.com/d4l3k	2025-06-20 07:35:42 +00:00
William Wen	0aed855b2b	[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564 ) This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error. Implementation details: - The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`. - InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end. - When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289, #154782, #155166	2025-06-20 07:03:29 +00:00
William Wen	24dc33b37b	[dynamo] handle fullgraph toggle using nested torch.compile (#155166 ) See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not. Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289, #154782	2025-06-20 07:03:29 +00:00
William Wen	537b0877a8	[dynamo] fix set_fullgraph for nested calls (#154782 ) - Make the fullgraph argument of set_fullgraph a positional argument - Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis). Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289	2025-06-20 07:03:16 +00:00
William Wen	2c372a0502	[dynamo] add set_fullgraph decorator/context manager (#154289 ) Implements https://github.com/pytorch/pytorch/issues/144908. Implementation notes: - `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing. - Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example). - InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` . - `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #154283	2025-06-20 07:03:07 +00:00
William Wen	b46eb1ccaf	[dynamo] control one_graph behavior additionally through config (#154283 ) `torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-06-20 07:02:57 +00:00
James Wu	2c68c3e8d5	[Precompile] Hook up backend="inductor" (#155387 ) This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry. One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387 Approved by: https://github.com/oulgen	2025-06-20 06:38:29 +00:00
Xuehai Pan	d5b4a32960	[BE] fix `PYPROJECT` linting errors in `test/` and `tools/` (#156021 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156021 Approved by: https://github.com/Skylion007	2025-06-20 06:19:05 +00:00
Nikita Shulga	4cbbc8b458	[MPS] Implement backward pass for interpolate_trilinear (#156373 ) Backwards pass simply iterates over all 8 points current point contributed to, and back propagates them with the respective weights TODO: Benchmark the performance of similar loop for the forward pas (i.e. compiler should be able to do loop unrolling, so no point of unrolling it by hand) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156373 Approved by: https://github.com/dcci ghstack dependencies: #156375	2025-06-20 05:41:24 +00:00
angelayi	c37ddcaefb	Fix torchgen update-aoti-shim (#156323 ) will remove the fill changes before landing and let Jane merge her changes! Pull Request resolved: https://github.com/pytorch/pytorch/pull/156323 Approved by: https://github.com/janeyx99	2025-06-20 05:23:06 +00:00
leslie-fang-intel	f7a5ad6c29	[Inductor][CPP] Fix WOQ int4 accuracy issue when NC large than one (#156407 ) Summary There is an accuracy issue when `Nc_block` is greater than 1 in WOQ int4 GEMM. Previously, we used the slice `{%- set tile_W = kernel.slice_nd(W, [("n_start", "n_start + n_size"), ("k_start * Nr / 2", "k_end * Nr / 2")]) %}`, which means that each `ni` in `Nc_block` takes the exact same N slice from `n_start` to `n_start + n_size`, leading to the accuracy problem. This accuracy issue is exposed by [PR #156174](https://github.com/pytorch/pytorch/pull/156174), which changes `block_N` from 64 to 32. This change increases the likelihood of `Nc_block` being greater than 1, making it more likely to trigger the issue. This PR will fix this accuracy issue. Test Plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx_Nc_larger_than_one ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156407 Approved by: https://github.com/CaoE	2025-06-20 03:08:02 +00:00
Cui, Yifeng	72c8751b61	Align meta deducing for fft_r2c with fft_r2c_mkl on XPU (#156048 ) There is a memory layout mismatching between `fft_r2c` XPU and Inductor meta deducing. Original `fft_r2c` Inductor meta deducing for XPU backend is aligned with CPU (fallback). This PR is to correct the Inductor meta deducing and update the torch-xpu-ops commit to [intel/torch-xpu-ops@`3a9419c`](`3a9419c8bb`). The XPU implementation first performs the R2C transform on the last dimension, followed by iterative C2C transforms on the remaining dimensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156048 Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel	2025-06-20 01:41:03 +00:00
CaoE	159a39ad34	Add an option for cpp_wrapper to compile entry and kernel separately (#156050 ) Fixes #156037. Compiling entry and kernel separately has a non-negligible impact on the performance. This PR is to add an option for cpp_wrapper to control whether to compile entry and kernel separately, and turn it off by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156050 Approved by: https://github.com/leslie-fang-intel, https://github.com/benjaminglass1, https://github.com/jansel	2025-06-20 01:11:16 +00:00
atalman	ebab279942	Forward fix inductor benchmark after #150287 (#156455 ) Looks like https://github.com/pytorch/pytorch/pull/150287 stack fixed some inductor tests HUD: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor-periodic%20%2F%20linux-jammy-cpu-py3.9-gcc11-inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/156455 Approved by: https://github.com/huydhn	2025-06-20 00:04:15 +00:00
cyy	3c2324c64a	[2/N] Fix cppcoreguidelines-init-variables suppression (#146237 ) This PR removes all `cppcoreguidelines-init-variables` suppressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146237 Approved by: https://github.com/ezyang	2025-06-19 23:26:42 +00:00
Aaron Orenstein	52f873adc2	Add logging for async compile worker statistics (#155820 ) Add some on-exit logging to the async compile workers. When you use `TORCH_LOGS=async_compile` (or `all`) it will now report how many workers were enqueued & dequeued (should be the same) as well as queuing time (how long workers sat on the queue before starting to run) and maximum depth (how many workers were waiting to start. Tested manually by running a larger internal model and then lowering the number of available workers to see the time and depth get longer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155820 Approved by: https://github.com/masnesral	2025-06-19 23:10:15 +00:00
Yiming Zhou	c60d8188d2	[nativert] Move GraphExecutorBase to PyTorch core (#156196 ) Summary: Moves GraphExecutorBase class to PyTorch core. GraphExecutorBase is a lightweight abstraction to execute a graph with execution frames without actually owning the graph nor the weights. This is introduced to decouple the state management of the top level runtime from the kernel executions so that sub graphs from higher order ops can be supported. Torch Native Runtime RFC: pytorch/rfcs#72 Test Plan: CI Rollback Plan: Differential Revision: D76830436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156196 Approved by: https://github.com/zhxchen17	2025-06-19 22:42:35 +00:00
Xinya Zhang	34d8e64ef6	[ROCm] Bump AOTriton to 0.10b (#156290 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b: * Official support of gfx950/gfx1201 * Experimental support of gfx1101/gfx1151/gfx1150/gfx1200 * Reduce libaotriton.so binary size by over 80%. + Without this optimization the binary size of `libaotriton.so` could be over 100MiB due to 2x more supported architectures compared with 0.9b. Now it is only about 11MiB. * Support sliding window attention (SWA) in `_flash_attention_forward/backward`. Should fix #154582 See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details, including Known Problems. Notable changes to SDPA backend: * `std::optional<int64_t>` `window_size_left/right` are directly passed to ROCM's SDPA backend, because the default value `-1` is meaningful to AOTriton's backend and bottom-right aligned causal mask is implemented with negative `window_size_left/right` * Some code clean up around `USE_CK_FLASH_ATTENTION` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-06-19 21:13:58 +00:00
Ti-Tai Wang	3644b41a7c	[ONNX] Note on attention op symbolic function (#156441 ) Follow up https://github.com/pytorch/pytorch/pull/156367 Explain why num_heads is provided when ONNX Attention op does not need it in torch case: The thread: https://github.com/pytorch/pytorch/pull/156367#discussion_r2155727038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156441 Approved by: https://github.com/justinchuby	2025-06-19 21:00:05 +00:00
Dmitry Rogozhkin	443b5b43c3	xpu: fix AOT compilation in sycl cpp extension (#156364 ) Commit fixes AOT compilation in sycl cpp extension which got accidentally dropped on aca2c99a652 (fallback to JIT compilation had happened). Commit also fixes override logic for default sycl targets allowing flexibility to specify targets externally. Further, commit extends test coverage to cover such a case and fixes issue in the test where consequent tests executed same (first) compiled extension due to name conflicts. Fixes: #156249 Fixes: aca2c99a652 ("xpu: get xpu arch flags at runtime in cpp_extensions (#152192)") CC: @pengxin99, @guangyey Pull Request resolved: https://github.com/pytorch/pytorch/pull/156364 Approved by: https://github.com/ezyang	2025-06-19 20:11:38 +00:00
Ke Wen	d32deb664a	[c10d] Disable NCCL NVLS when using deterministic mode (#156381 ) via setting env `NCCL_ALGO=^NVLS`. Note that this setting must be made before the first NCCL init. Otherwise, it won't take effect. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156381 Approved by: https://github.com/ngimel	2025-06-19 20:09:24 +00:00
Huy Do	69f2e09cc2	Add more shards to H100 benchmark, and also run it more frequently (#156429 ) There are 32 H100 `linux.aws.h100` and they are still not fully utilized with more than half staying idle, so we could add more shards to finish the whole suite within 4 hours. I add 1 more for `TIMM` and 3 more for `TorchBench` using the duration from a sample run https://github.com/pytorch/pytorch/actions/runs/15753185459/job/44411825090 With this computing power, we could also run the whole suite every 4 hours now. I could run this less frequently later if I see queueing Pull Request resolved: https://github.com/pytorch/pytorch/pull/156429 Approved by: https://github.com/atalman	2025-06-19 20:02:56 +00:00
Catherine Lee	aac0e8f0e9	[build] Create target for flash attention (#156235 ) Create a target for flash attention? so it can be built using ninja flash_attention Pull Request resolved: https://github.com/pytorch/pytorch/pull/156235 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-06-19 20:02:38 +00:00
Nikita Shulga	c2f4cc59a7	[MPS] Fix bug in 3d coords calculation (#156375 ) Which was not caught by CI beforehand, as all 3D examples right now are symmetric, so add an uneven shape to `sample_inputs_interpolate` Though it's indirectly tested by `test_upsample_nearest3d` inductor test Pull Request resolved: https://github.com/pytorch/pytorch/pull/156375 Approved by: https://github.com/atalman	2025-06-19 19:56:15 +00:00
Xuehai Pan	c0ee01c2fb	tools/nightly.py: only download `torch` via pip and install dependenices via `uv` (#156409 ) Setup time (cpu-only): 70s -> 27.6s -> 17.4s The tool can setup the pinned NVIDIA dependencies correctly: ```console $ make setup-env-cuda PYTHON="${HOMEBREW_PREFIX}/bin/python3.13" && source venv/bin/activate make setup-env PYTHON="/home/linuxbrew/.linuxbrew/bin/python3.13" NIGHTLY_TOOL_OPTS="pull --cuda" make[1]: Entering directory '/home/PanXuehai/Projects/pytorch' /home/linuxbrew/.linuxbrew/bin/python3.13 tools/nightly.py pull --cuda log file: /home/PanXuehai/Projects/pytorch/nightly/log/2025-06-19_21h16m16s_94cd1471-4d0f-11f0-b120-b88584c06696/nightly.log Creating virtual environment Removing existing venv: /home/PanXuehai/Projects/pytorch/venv Creating venv (Python 3.13.4): /home/PanXuehai/Projects/pytorch/venv Installing packages Upgrading package(s) (https://download.pytorch.org/whl/nightly/cu128): - uv - pip - setuptools - packaging - wheel - build[uv] Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://download.pytorch.org/whl/nightly/cu128 Collecting uv Using cached `f2e96cec5e/uv-0.7.13-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl` (17.8 MB) Requirement already satisfied: pip in ./venv/lib/python3.13/site-packages (25.1.1) Collecting setuptools Using cached `17031897da/setuptools-80.9.0-py3-none-any.whl` (1.2 MB) Collecting packaging Using cached `38679034af/packaging-25.0-py3-none-any.whl` (66 kB) Collecting wheel Using cached `87f3254fd8/wheel-0.45.1-py3-none-any.whl` (72 kB) Collecting build[uv] Using cached `80633736cd/build-1.2.2.post1-py3-none-any.whl` (22 kB) Collecting pyproject_hooks (from build[uv]) Using cached `12818598c3/pyproject_hooks-1.2.0-py3-none-any.whl` (10 kB) Installing collected packages: wheel, uv, setuptools, pyproject_hooks, packaging, build Successfully installed build-1.2.2.post1 packaging-25.0 pyproject_hooks-1.2.0 setuptools-80.9.0 uv-0.7.13 wheel-0.45.1 Installing packages took 6.251 [s] Creating virtual environment took 9.050 [s] Downloading packages Downloading package(s) (https://download.pytorch.org/whl/nightly/cu128): torch Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://download.pytorch.org/whl/nightly/cu128 Collecting torch Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250619%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB) Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250619%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl (1040.3 MB) Saved /tmp/pip-download-xeqmhrww/torch-2.8.0.dev20250619+cu128-cp313-cp313-manylinux_2_28_x86_64.whl Successfully downloaded torch Downloaded 1 file(s) to /tmp/pip-download-xeqmhrww: - torch-2.8.0.dev20250619+cu128-cp313-cp313-manylinux_2_28_x86_64.whl Downloading packages took 6.284 [s] Unpacking wheel file Unpacking to: /tmp/wheel-kugk2os0/torch-2.8.0.dev20250619+cu128...OK Unpacking wheel file took 15.107 [s] Installing dependencies Installing packages Installing package(s) (https://download.pytorch.org/whl/nightly/cu128): - filelock - typing-extensions>=4.10.0 - setuptools; python_version >= "3.12" - sympy>=1.13.3 - networkx - jinja2 - fsspec - nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cuda-runtime-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cuda-cupti-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cudnn-cu12==9.10.2.21; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cublas-cu12==12.8.4.1; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cufft-cu12==11.3.3.83; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-curand-cu12==10.3.9.90; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cusolver-cu12==11.7.3.90; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cusparse-cu12==12.5.8.93; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cusparselt-cu12==0.7.1; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-nccl-cu12==2.27.3; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-nvshmem-cu12==3.2.5; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-nvtx-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-nvjitlink-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64" - nvidia-cufile-cu12==1.13.1.3; platform_system == "Linux" and platform_machine == "x86_64" - pytorch-triton==3.3.1+gitc8757738; platform_system == "Linux" - numpy - cmake - ninja - packaging - ruff - mypy - pytest - hypothesis - ipython - rich - clang-format - clang-tidy - sphinx Using Python 3.13.4 environment at: venv Resolved 78 packages in 2.95s Installed 76 packages in 93ms + alabaster==1.0.0 + asttokens==3.0.0 + attrs==24.2.0 + babel==2.17.0 + certifi==2024.8.30 + charset-normalizer==3.3.2 + clang-format==20.1.6 + clang-tidy==20.1.0 + cmake==3.25.0 + decorator==5.2.1 + docutils==0.21.2 + executing==2.2.0 + filelock==3.18.0 + fsspec==2025.5.1 + hypothesis==6.135.11 + idna==3.10 + imagesize==1.4.1 + iniconfig==2.1.0 + ipython==9.3.0 + ipython-pygments-lexers==1.1.1 + jedi==0.19.2 + jinja2==3.1.6 + markdown-it-py==3.0.0 + markupsafe==2.1.5 + matplotlib-inline==0.1.7 + mdurl==0.1.2 + mpmath==1.3.0 + mypy==1.16.1 + mypy-extensions==1.0.0 + networkx==3.5 + ninja==1.11.1.4 + numpy==2.3.0 + nvidia-cublas-cu12==12.8.4.1 + nvidia-cuda-cupti-cu12==12.8.90 + nvidia-cuda-nvrtc-cu12==12.8.93 + nvidia-cuda-runtime-cu12==12.8.90 + nvidia-cudnn-cu12==9.10.2.21 + nvidia-cufft-cu12==11.3.3.83 + nvidia-cufile-cu12==1.13.1.3 + nvidia-curand-cu12==10.3.9.90 + nvidia-cusolver-cu12==11.7.3.90 + nvidia-cusparse-cu12==12.5.8.93 + nvidia-cusparselt-cu12==0.7.1 + nvidia-nccl-cu12==2.27.3 + nvidia-nvjitlink-cu12==12.8.93 + nvidia-nvshmem-cu12==3.2.5 + nvidia-nvtx-cu12==12.8.90 + parso==0.8.4 + pathspec==0.12.1 + pexpect==4.9.0 + pluggy==1.6.0 + prompt-toolkit==3.0.51 + ptyprocess==0.7.0 + pure-eval==0.2.3 + pygments==2.19.1 + pytest==8.4.1 + pytorch-triton==3.3.1+gitc8757738 + requests==2.32.3 + rich==14.0.0 + roman-numerals-py==3.1.0 + ruff==0.12.0 + snowballstemmer==3.0.1 + sortedcontainers==2.4.0 + sphinx==8.2.3 + sphinxcontrib-applehelp==2.0.0 + sphinxcontrib-devhelp==2.0.0 + sphinxcontrib-htmlhelp==2.1.0 + sphinxcontrib-jsmath==1.0.1 + sphinxcontrib-qthelp==2.0.0 + sphinxcontrib-serializinghtml==2.0.0 + stack-data==0.6.3 + sympy==1.14.0 + traitlets==5.14.3 + typing-extensions==4.14.0 + urllib3==2.2.3 + wcwidth==0.2.13 Installing packages took 3.080 [s] Installing dependencies took 3.080 [s] Pulling nightly PyTorch Found released git version 5622038e20ddb12b9a011c9a9128190d71a21cba Found nightly release version 2625c70aecc6eced1dbe108279feab7509733bef Already up to date. Pulling nightly PyTorch took 0.017 [s] Moving nightly files into repo Moving nightly files into repo took 4.898 [s] Writing pytorch-nightly.pth Writing pytorch-nightly.pth took 0.021 [s] ------- PyTorch Development Environment set up! Please activate to enable this environment: $ source /home/PanXuehai/Projects/pytorch/venv/bin/activate make[1]: Leaving directory '/home/PanXuehai/Projects/pytorch' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156409 Approved by: https://github.com/ezyang ghstack dependencies: #156408	2025-06-19 19:42:15 +00:00
Xuehai Pan	71faa7e5b9	tools/nightly.py: use `uv pip install` instead of `pip install` (#156408 ) Setup time: 70s -> 27.6s Pull Request resolved: https://github.com/pytorch/pytorch/pull/156408 Approved by: https://github.com/ezyang	2025-06-19 19:42:15 +00:00
Chen Haifeng	134dfb3fe6	[dynamo] Fix cycle reference problem caused by recursive collect_temp_source in codegen (#155791 ) Recursive function collect_temp_source with closure in PyCodegen caused cycle reference issue when torch.compile is used. This issue may cause major tensors will not freed timely even there are no user references to these tensors. We saw OOM issues because of this problem in many cases including training and inference using torch.compile. The fix is to use iterative function implementation to replace the recursive function implementation. Fixes #155778 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155791 Approved by: https://github.com/ezyang	2025-06-19 19:37:44 +00:00
Shangdi Yu	e4c9f6d9a2	[nativert] Move c10_kernel (#156208 ) Summary: Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/: fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/cpp/nativert:c10_kernel_test ``` Differential Revision: D76825830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156208 Approved by: https://github.com/zhxchen17	2025-06-19 17:36:23 +00:00
Dmitry Nikolaev	f402eed4d9	[ROCm] Enable BF16 NCHW Mixed batchnorm on MIOpen if ROCm>=6.4 (#154611 ) This PR enables MIOpen for BF16 NCHW Mixed batchnorm if MIOpen version >=3.4 (ROCm >= 6.4) CUDAHooks::versionMIOpen() was added to detect MIOpen version Pull Request resolved: https://github.com/pytorch/pytorch/pull/154611 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd	2025-06-19 17:22:37 +00:00
Doru Bercea	085f270a00	[ROCm] Enable more parallelism for multi-dimensional reductions (#155806 ) Enable more parallelism for multi-dimensional reductions. In the case of multi-dimensional reductions the grid often start with a single active block. In such cases, we need to allow the parallelism to be extended along the y-direction of the grid to avoid having a single block running. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155806 Approved by: https://github.com/Skylion007, https://github.com/jeffdaily	2025-06-19 17:19:40 +00:00
Shangdi Yu	eaf704914e	[aoti] package weights to disk and dedup (#155241 ) We package the weights and save them in `data/weights/` (`WEIGHTS_DIR`). In addition, we store a `weights_config.json` in the model folder for each model to specify which weight file corresponding to which weight name. Models can share weights. We dedup the weights based on their underlying storage (`tensor.untyped_storate()`). - Use `"aot_inductor.package_constants_on_disk": True` config to produce the `Weights` in aot_compile - If we see `Weights` in aoti_files, we'll automatically package them to disk - `"aot_inductor.package_constants_on_disk"` config and `"aot_inductor.package_constants_in_so"` config work independently. - Use `load_pt2(package_path, load_weights_from_disk=True)` to load the weights from disk. `load_weights_from_disk` defaults to False. Test Plan: ``` buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_shared_weights" ``` Tested with whisper at https://github.com/pytorch-labs/torchnative/pull/7 Rollback Plan: Differential Revision: D74747190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155241 Approved by: https://github.com/desertfire	2025-06-19 17:17:17 +00:00
Yukio Siraichi	6e185c5312	Upgrade to DLPack 1.0. (#145000 ) This PR makes the necessary changes in order to upgrade PyTorch DLPack support to version 1.0. In summary, we add support for the following: - Support both `DLManagedTensor` and `DLManagedTensorVersioned` when producing and consuming DLPack capsules - New parameter for `__dlpack__` method: `max_version` - Version checks: - Fallback to old implementation if no `max_version` or if version lower than 1.0 - Check that the to-be-consumed capsule is of version up to 1.X In order to accommodate these new specifications, this PR adds the following main changes: - `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python API for creating a versioned DLPack capsule (called by `__dlpack__` method) - `DLPackTraits<T>` class (DLConvertor.h): select the correct traits (e.g. capsule name, conversion functions) depending on which DLPack tensor class is being used - `toDLPackImpl<T>` function (DLConvertor.cpp): populates the common fields of both classes - `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor from a DLPAck capsule - `fillVersion<T>` function (DLConvertor.cpp): populates the version field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`) - `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function for constructing a tensor out of a DLPack capsule that also marks the capsule as used Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000 Approved by: https://github.com/albanD	2025-06-19 16:27:42 +00:00
Raman-RH	6eb6f198e1	update codebase structure documentation to include mps (#156297 ) 📚 The doc update adding description about mps folder in code structure guide @albanD @malfet @svekars @sekyondaMeta Pull Request resolved: https://github.com/pytorch/pytorch/pull/156297 Approved by: https://github.com/ezyang	2025-06-19 16:16:29 +00:00
Zhengxu Chen	7f0cddfb55	[dynamo] Add documentation for guard_filter_fn (#156114 ) Summary: Adding a section of doc for guard_filter_fn. Test Plan: CI Rollback Plan: Differential Revision: D76756743 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156114 Approved by: https://github.com/jansel	2025-06-19 16:13:12 +00:00
Benjamin Glass	c9afcffed0	[AOTInductor] Call most runtime fallback ops without calling into Python (#154142 ) Uses the new aoti_torch_call_dispatcher interface to call runtime fallback ops without calling back into Python. This supports a limited subset of input and output datatypes, but a significant majority of remaining fallback ATen ops are covered. Fixes #150988 Fixes #153478 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154142 Approved by: https://github.com/desertfire	2025-06-19 15:27:15 +00:00
PyTorch MergeBot	317af4c87b	Revert "[cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140 )" This reverts commit a5f59cc2eab3a5201712c52fe48c268357ba4f3c. Reverted https://github.com/pytorch/pytorch/pull/156140 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/156140#issuecomment-2988441548))	2025-06-19 15:09:29 +00:00
Jeff Daily	ab3393e923	[ROCm][CI] fix mi300 test failure after 6.4.1 update (#156368 ) Fixes failures such as https://github.com/pytorch/pytorch/actions/runs/15739699156/job/44365395854: `test/test_linalg.py::TestLinalgCUDA::test_broadcast_batched_matmul_cuda` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156368 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-19 15:02:40 +00:00
PyTorch MergeBot	0b62465b99	Revert "Refine alignment check along dynamic dimension for grouped MMs (#155466 )" This reverts commit 830a335a7da5fec00395d440ba568749cb4e2e9e. Reverted https://github.com/pytorch/pytorch/pull/155466 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/155466#issuecomment-2988285117))	2025-06-19 14:25:38 +00:00
Kaichao You	fec8af8b98	[bugfix] [build] guard cuda version for ipc with fabric handle (#156394 ) https://github.com/pytorch/pytorch/pull/156074 adds the support of ipc with fabric handle, but the code cannot compile for cuda < 12.3 (in particular, e.g. cuda 11.8). this pr improves the support by adding some compilation-time check against cuda versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156394 Approved by: https://github.com/ngimel	2025-06-19 13:54:01 +00:00
Nikita Shulga	769d754ab2	[BE][MPS] Refactor core matmul logic into matmul_core (#155969 ) In preparation of adding integer addmm, move matmul computation part into matmul_inner function Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969 Approved by: https://github.com/Skylion007	2025-06-19 13:22:41 +00:00
xinan.lin	8cb0c4a4da	[Intel GPU][AOTI] Add xpu mkldnn ops support for AOTInductor. (#154586 ) This PR is closely related to the previous one in the stack(https://github.com/pytorch/pytorch/pull/150287). The previous PR enabled MKLDNN ops for XPU, which caused several test cases to fail in test_aot_inductor.py. This PR addresses those failing cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154586 Approved by: https://github.com/EikanWang, https://github.com/desertfire ghstack dependencies: #150287	2025-06-19 13:17:22 +00:00
xinan.lin	83259cf7a7	[Inductor][Intel GPU] Support mkldnn Conv post op fusion for XPU. (#150287 ) This PR adds support for MKLDNN Conv post-op fusion in the Inductor Intel GPU backend under freezing mode. The implementation reuses the CPU's MKLDNN pattern fusion mechanism, as well as the corresponding Inductor unit tests for CPU MKLDNN pattern fusion. The performance improvement: \| Suite \| Inductor Speedup (Baseline) \| Inductor Speedup (Compared) \| Acc Failed \| Perf Failed \| Inductor Perf Ratio \| Speedup \| \|-------------\|-----------------------------\|------------------------------\|------------\|--------------\|----------------------\|----------\| \| Huggingface \| 2.134838 \| 2.125740314 \| 0 \| 0 \| 1.001462504 \| 100.43% \| \| Torchbench \| 1.808558 \| 1.675100479 \| 0 \| 0 \| 1.075722187 \| 107.97% \| \| Timm \| 2.343893 \| 2.070476653 \| 0 \| 0 \| 1.131023832 \| 113.21% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/150287 Approved by: https://github.com/ZhiweiYan-96, https://github.com/EikanWang, https://github.com/jansel	2025-06-19 13:17:22 +00:00
Ting Lu	0504480f37	Add CUDA 12.9 libtorch nightly (#155895 ) https://github.com/pytorch/pytorch/issues/155196 with libtorch docker added, we can add the build script Pull Request resolved: https://github.com/pytorch/pytorch/pull/155895 Approved by: https://github.com/atalman	2025-06-19 13:15:42 +00:00
Daisy Deng	ccb1f687d6	Port two dynamo test cases for Intel GPU (#156056 ) For https://github.com/pytorch/pytorch/issues/114850, we will port more cases to Intel GPU. This PR is for 2 dynamo cases. We adopted "torch.accelerator.current_accelerator()" to determine the backend, and added XPU support in decorators like @requires_gpu, also enabled XPU for some test path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156056 Approved by: https://github.com/guangyey, https://github.com/jansel	2025-06-19 12:49:04 +00:00
PyTorch MergeBot	a8fe982993	Revert "[build] Create target for flash attention (#156235 )" This reverts commit 6d02321472ee0761092166dd273eb3ec386cf0c0. Reverted https://github.com/pytorch/pytorch/pull/156235 on behalf of https://github.com/ZainRizvi due to Weird, but seems to have broken trunk: test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/15748768079/job/44390494621) [HUD commit link](`6d02321472`) ([comment](https://github.com/pytorch/pytorch/pull/156235#issuecomment-2987784207))	2025-06-19 11:47:27 +00:00
codingwithsurya	4da98351b9	[SymmMem] Add NVSHMEM PUT with Signal support to Triton (#156211 ) Adds NVSHMEM PUT with Signal operation support for Triton kernels: - Added`putmem_signal_block` core.extern wrapper for nvshmemx_putmem_signal_block - Added kernel for 2-rank PUT operation with atomic SET signaling (`test_triton_put_signal_set`) - Added kernel for 2-rank PUT operation with atomic ADD signaling (`test_triton_put_signal_add`) Tests: `$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py` `TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put_signal_set` `TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put_signal_add` ```python @skipIfRocm @requires_triton() def test_triton_put_signal_set(self) -> None: @triton.jit def put_signal_kernel(dst_ptr, src_ptr, numel: tl.constexpr, sig_ptr, signal_val: tl.constexpr, sig_op: tl.constexpr, peer: tl.constexpr): nvshmem.putmem_signal_block(dst_ptr, src_ptr, numel, sig_ptr, signal_val, sig_op, peer) # ... setup code ... val = 11 inp = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(val) out = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(-1) # destination buffer # Signal flag buffer - starts at 0, will be set to 1 upon completion flag = symm_mem.empty(1, dtype=torch.int64, device=self.device).fill_(0) peer = 1 - rank NVSHMEM_SIGNAL_SET = 0 # atomic set operation SIGNAL_VAL = 1 # completion signal value if rank == 0: # Rank 0 atomically: (1) puts data to rank 1, (2) sets rank 1's flag to 1 put_signal_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, sig_ptr=sig_ptr, signal_val=SIGNAL_VAL, sig_op=NVSHMEM_SIGNAL_SET, peer=peer, extern_libs=nvshmem_lib) dist.barrier() # Rank 1 can check flag to know data transfer completed! print(f"[Rank {rank}] inp buffer: {inp}") print(f"[Rank {rank}] out buffer: {out}") print(f"[Rank {rank}] flag buffer: {flag}") ``` ``` [Rank 0] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:0', dtype=torch.int8) [Rank 0] out buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int8) [Rank 0] got data from peer 1 [Rank 0] flag buffer: tensor([0], device='cuda:0') [Rank 1] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8) [Rank 1] out buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8) [Rank 1] got data from peer 0 [Rank 1] flag buffer: tensor([1], device='cuda:1') ---------------------------------------------------------------------- Ran 2 tests in 17.046s OK ``` Working as expected! Data is received, and flag set to 1 for completion signal! ```python @skipIfRocm @requires_triton() def test_triton_put_signal_add(self) -> None: @triton.jit def put_signal_kernel(dst_ptr, src_ptr, numel: tl.constexpr, sig_ptr, signal_val: tl.constexpr, sig_op: tl.constexpr, peer: tl.constexpr): nvshmem.putmem_signal_block(dst_ptr, src_ptr, numel, sig_ptr, signal_val, sig_op, peer) # ... setup code ... # Signal buffer (uint64 flag) flag = symm_mem.empty(1, dtype=torch.int64, device=self.device).fill_(0) peer = 1 - rank NVSHMEM_SIGNAL_ADD = 5 # atomic add operation SIGNAL_VAL = 16 # Signal value to add if rank == 0: # Rank 0 puts into Rank 1 and adds to signal put_signal_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, sig_ptr=sig_ptr, signal_val=SIGNAL_VAL, sig_op=NVSHMEM_SIGNAL_ADD, peer=peer, extern_libs=nvshmem_lib) dist.barrier() print(f"[Rank {rank}] inp buffer: {inp}") print(f"[Rank {rank}] out buffer: {out}") print(f"[Rank {rank}] flag buffer: {flag}") ``` ``` [Rank 0] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:0', dtype=torch.int8) [Rank 0] out buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int8) [Rank 0] got data from peer 1 [Rank 0] flag buffer: tensor([0], device='cuda:0') [Rank 1] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8) [Rank 1] out buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8) [Rank 1] got data from peer 0 [Rank 1] flag buffer: tensor([16], device='cuda:1') ---------------------------------------------------------------------- Ran 1 test in 17.145s OK ``` The flag transition from [0] → [16] confirms both data delivery and atomic signal completion in a single operation! Pull Request resolved: https://github.com/pytorch/pytorch/pull/156211 Approved by: https://github.com/kwen2501, https://github.com/mandroid6	2025-06-19 10:24:30 +00:00
bobrenjc93	348e2a76df	s/defer_runtime_assert/guard_or_defer_runtime_assert (#156397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156397 Approved by: https://github.com/laithsakka	2025-06-19 10:18:28 +00:00
Yuan Yao	02080c2cd9	Fix num_heads inference in ONNX Attention-23 exporter (#156367 ) Fixes issue in torch-onnx exporter for Attention: https://github.com/pytorch/pytorch/issues/156105 Previously the number of heads attributes inferred by the exporter is incorrect. It should be read from input dimension -3 not dimension 3: ![image](https://github.com/user-attachments/assets/26f10e15-bc98-42ac-807a-2e089a7d996a) But in fact, [torch sdpa](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) doesn't support combined num_heads and head_size dimensions like [ONNX](https://onnx.ai/onnx/operators/onnx__Attention.html) does, so this num_heads attribute is not needed. Extending support to rank>4 can be left as future work if there is use case for that. The translation logic will look like: Reshape(Q,K,V to 4d) -> Attention -> Reshape(Y to original rank). Pull Request resolved: https://github.com/pytorch/pytorch/pull/156367 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms	2025-06-19 09:40:01 +00:00
Ke Wen	8fcda2c60d	[SymmMem] Add runtime detection of NVSHMEM (#156291 ) so that we can pick the default backend for SymmetricMemory without fully relying on env var `TORCH_SYMMMEM=CUDA \| NVSHMEM` On Python side, the following API is added: `torch.distributed._symmetric_memory.is_nvshmem_available()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156291 Approved by: https://github.com/Skylion007 ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116, #156117	2025-06-19 08:26:11 +00:00
Pian Pawakapan	eabf7cd3c5	[export] update docs for Dims (#156262 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/156262 Approved by: https://github.com/angelayi	2025-06-19 06:25:21 +00:00
Pian Pawakapan	ec0276103f	[PGO] fix whitelist scalar bug (#156194 ) Test Plan: test_pgo Rollback Plan: Differential Revision: D76830552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156194 Approved by: https://github.com/bobrenjc93	2025-06-19 05:51:21 +00:00
Xuehai Pan	1c960c5638	[Makefile] lazily setup `lintrunner` on first `make lint` run (#156058 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156058 Approved by: https://github.com/ezyang	2025-06-19 05:43:35 +00:00
Nikita Shulga	242eb19c83	[InductorBench] Fix accuracy validation logic for MPS (#156385 ) As it does not support full fp64, validate against float32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156385 Approved by: https://github.com/Skylion007	2025-06-19 05:37:51 +00:00
Junjie Wang (PyTorch)	ce8180a61d	[c10d] Disable stack trace call in logging (#156362 ) Summary: We noticed std::future_error: Broken promise errors in logging, so let's disable for now and will investigate more. Test Plan: CI Rollback Plan: Differential Revision: D76929722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156362 Approved by: https://github.com/fegin	2025-06-19 05:11:57 +00:00
Yiming Zhou	a21806f038	[ez][export] Better error message for schema check in torch.export.load (#156361 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/156354 torch.export.load() only supports files generated by torch.export.save() Test Plan: CI Rollback Plan: Differential Revision: D76928725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156361 Approved by: https://github.com/zhxchen17	2025-06-19 04:50:56 +00:00
Laith Sakka	3f69e3b3a0	Add view_simple as meta function for view, and avoid calling reshape_view_helper for unbacked (#154757 ) address https://github.com/pytorch/pytorch/issues/153303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757 Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel	2025-06-19 04:50:18 +00:00
Simon Fan	3bec588bf5	[aot][ca] save bw_module in AOTAutogradCache (#151860 ) Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash). The bw_module's generated code is then serialized, and at compiled autograd runtime, it is restored via symbolic_trace. This also means that presence of tensor constructors will be lifted as constants. Something we will address separately. Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860 Approved by: https://github.com/jamesjwu ghstack dependencies: #156120	2025-06-19 03:47:41 +00:00
Catherine Lee	6d02321472	[build] Create target for flash attention (#156235 ) Create a target for flash attention? so it can be built using ninja flash_attention Pull Request resolved: https://github.com/pytorch/pytorch/pull/156235 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-06-19 03:35:04 +00:00
Wang, Chuanqi	77518d1a13	[CI] fix xpu-smi hang in XPU test container (#156171 ) Apply same fix #155443 for XPU test container, refer https://github.com/pytorch/pytorch/actions/runs/15589866881/job/43907973867#step:15:911 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156171 Approved by: https://github.com/huydhn	2025-06-19 02:48:11 +00:00
Teja Rao	19ffdf4ea0	[dcp] add new checkpoint staging to preserve storage sharing and support mutable state_dicts (#155192 ) Summary: This implements staging in way that doesnt mess up checkpointing semantics. We want to be close to torch.save/load semantics and when async checkpointing is used it messes up shared storages, doesnt handle custom objects or tensors well. EG: users passes a state_dict with a cuda tensor in datatype. this is deepcloned causing the staging tensor to be created on GPU. This can cause ooms is hard to debug. This diffs hooks into deepcopy of storages to move them to cpu using the cached storages created for async checkpoint staging. This allows reusing storages created for staging to avoid recreating them on each checkpoint while also being flexible enough to handle any changes - clean up old storages or create new ones as needed. Lifetime of staging storages is tied to the original storage object. when the original storage object is gc-ed, we delete the corresponding staging storage from cache possibly causing it to gc-ed is there are no other references. I am using data_ptr of the storage to keep track of this. Please share thoughts on this. The alternative is to use fqn's instead of storage_id and verify the underlying storage object has same shape/size,etc to make the caching logic work. Current implementation is much simpler and cleaner. The API: ``` # construct a stager once per job in checkpointing. stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory) # do this on every checkpoint: with staging_context(stager): cpu_state_dict = copy.deepcopy(state_dict) ``` Also, adds support for pinned-memory. One problem this implementation does not address is that we lose the original device. The only alternatives here are - pickle synchronously like torch.save but with special handling for storages. It is valuable to keep state_dict throughout the checkpointing process. so users can manipulate and debug as needed. so we need to unpickle in the background process. I think this is flexible, not performant and not very different to current solution but needs more code. One idea if we really want to address is this to stick the original device in a some variable on storage and then use it recover on load side. I think we do not need this for now and can be explicit about losing device type for async checkpointing. Update: Note: Due to reservations on hooking into deepcopy to customize it, the PR is now updated to use deepcopy like logic to clone the state_dict. There are some caveats to this solution: 1. Duplicated deepcopy code to hook into for tensors. There is a risk of this code getting outdated with python version changes. This is needed to handle several different types like NamedTuples, frozen dataclasses, nested dataclasses. deepcopy logic is relying on reduce_ex to get a function with which these can be constructed. 2. Since we are bypassing deepcopy and adding custom logic to clone a tensor, we are missing some of the functionality that exists in deepcopy for torch.Tensor like _clear_non_serializable_cached_data(), or other logic. Would like thoughts on which logic or if everything should be copied? 3. If any object implemented deepcopy , we will not be able to handle any tensors in the attrs with this logic because they likely just call copy.deepcopy on the attrs instead of this deepcopy logic. We are taking care of subclasses of torch.Tensor to workaround this. The new API: ``` # construct a stager once per job in checkpointing. stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory) # do this on every checkpoint: cpu_state_dict = copy.stage(state_dict) ``` Test Plan: unit tests Differential Revision: D75993324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155192 Approved by: https://github.com/mikaylagawarecki, https://github.com/pradeepfn	2025-06-19 02:04:21 +00:00
Luca Wehrstedt	d4ad280429	Enable querying the build and runtime NCCL versions (#156305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156305 Approved by: https://github.com/wconstab, https://github.com/Skylion007, https://github.com/fegin	2025-06-19 02:00:08 +00:00
Thanh Ha	bc9bd2a766	Use linux.2xlarge runner (#156351 ) The cuda version of this job uses a linux.2xlarge here so matching that to see if this job really needs a 12xlarge system or not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156351 Approved by: https://github.com/jeffdaily, https://github.com/cyyever	2025-06-19 01:50:56 +00:00
bobrenjc93	e5a1197191	Fix fx tracing for mark dynamic (#156346 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156346 Approved by: https://github.com/tony-ivchenko	2025-06-19 01:03:09 +00:00
Ben Koopman	6959b5febe	Context on torch.cuda.memory._record_memory_history max_entries (#155889 ) Context on torch.cuda.memory._record_memory_history buffer behavior ## Description Answer questions: - Can I keep _record_memory_history() always enabled with the default max_entries=sys.maxsize (9223372036854775807)? Will it consume a significant amount of CPU RAM? - If I set max_entries to a lower value, e.g. 2000, will it keep the first 2000 entries and then stop recording or will it keep the most recent 2000 entries before each snapshot (fifo-style)? - What is the expected size on disk of the snapshots? Some KBs, MBs? Fixes #129674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155889 Approved by: https://github.com/ngimel	2025-06-19 00:44:43 +00:00
Jeff Daily	6303cc41b7	[ROCm] support CUDA_KERNEL_ASSERT using abort() (#155262 ) We won't have the full message that __assert_fail would provide, but at least we won't silently do nothing. Fixes #155045. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155262 Approved by: https://github.com/hongxiayang, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-18 23:52:35 +00:00
Feng Shi	b8c2d4c259	add a corner test case of dynamic sizes for combo kernel (#156035 ) Summary: Added a unit test case for a corner case of combo kernel where all below are true: 1. more than 1 dimensions are dynamic size 2. no_x_dim presistent reduce op Test Plan: ``` buck2 test mode/opt caffe2/test/inductor:combo_kernels -- test_dynamic_shapes_persistent_reduction_no_x_dim_2 ``` Rollback Plan: Differential Revision: D76699002 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156035 Approved by: https://github.com/mlazos	2025-06-18 22:57:09 +00:00
Scott Wolchok	76d07e919f	Unbreak //c10/util:base (#156216 ) Missing dep. Bifferential Revision: [D76840057](https://our.internmc.facebook.com/intern/diff/D76840057/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156216 Approved by: https://github.com/janeyx99, https://github.com/desertfire	2025-06-18 22:44:20 +00:00
Meet Vadakkanchery	9bfefda296	[DCP][PyTorch Staging APIs][2/x] Handle 0-elem case + ShardedTensor copy for staging (#156092 ) Summary: ### Diff Context 1. Sometimes, a tensor might have non-zero size and 0 numel. In this case, pinning memory will fail so we take a best guess at how to replicate the tensor below to maintain symmetry in the returned state dict. 2. ShardedTensor copying was not handled originally in PyTorch state_dict copy APIs, handled in this diff. Test Plan: CI Differential Revision: D75553096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156092 Approved by: https://github.com/pradeepfn	2025-06-18 22:41:25 +00:00
dolpm	a5b4463d60	[nativert] session state (#156190 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D76827309 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156190 Approved by: https://github.com/zhxchen17	2025-06-18 22:40:44 +00:00
Yiming Zhou	6918758f55	[export] Update documents for ExportGraphSiganture (#156244 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/156184 The current document for ExportGraphSignature doesn't reflect `torch.export.export()` returns non-functional graph by default. And users may get confused. Test Plan: Document change only. CI Rollback Plan: Differential Revision: D76849097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156244 Approved by: https://github.com/yushangdi	2025-06-18 22:37:34 +00:00
Justin Chu	1e474cc9c8	[ONNX] Fix how shapes are computed for float4 (#156353 ) Changed the way we compute shapes for unpacked float4. Previously we always added a last dimension [2] to existing shape, but this doesn't really make sense because it prevents use from being able to represent any shape other than those with a list dim [2]. I updated the logic to be `[shape[:-1], shape[-1]2]` which doubles the last dimension. This is more in line with what we see in practice when people are using 4bit types, and it allows us to represent any shape with an even dimension at the end, which is much more reasonable in my opinion. Also clarified in https://github.com/pytorch/pytorch/pull/148791#discussion_r2155395647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156353 Approved by: https://github.com/titaiwangms	2025-06-18 22:28:02 +00:00
henrylhtsang	9afee0fa96	[inductor] Set num_workers to number of available cpu divided by number of available gpu (#156201 ) internal: https://fb.workplace.com/groups/1075192433118967/posts/1689562705015267/?comment_id=1690284241609780&notif_id=1749770611538976&notif_t=work_group_comment&ref=notif Right now it doesn't have the divided by 2 logic yet. Not sure how to tell if we are on a dev machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156201 Approved by: https://github.com/masnesral	2025-06-18 22:15:32 +00:00
anwang	e5a0b73ce9	[MTIA Aten Backend] Migrate logical_and.out (#156286 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate logical_and.out to in-tree Differential Revision: [D76874551](https://our.internmc.facebook.com/intern/diff/D76874551/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156286 Approved by: https://github.com/nautsimon, https://github.com/jingsh ghstack dependencies: #155634, #156046, #156047, #156283, #156284, #156285	2025-06-18 21:57:05 +00:00
PyTorch MergeBot	bfccfa0b31	Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 )" This reverts commit cf90c9f8d1632777ec5f4b6ccaa14bc5bf259e9c. Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to break internal tests ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2985785811))	2025-06-18 21:48:50 +00:00
dolpm	f5eb42e4c0	[nativert] move layoutplanneralgorithm to libtorch (#156205 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D76831634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156205 Approved by: https://github.com/zhxchen17	2025-06-18 21:46:38 +00:00
anwang	d1c924c68a	[MTIA Aten Backend] Migrate lt.Tensor_out / lt.Scalar_out (#156285 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate t.Tensor_out / lt.Scalar_out to in-tree. Differential Revision: [D76873997](https://our.internmc.facebook.com/intern/diff/D76873997/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156285 Approved by: https://github.com/nautsimon ghstack dependencies: #155634, #156046, #156047, #156283, #156284	2025-06-18 21:40:26 +00:00
anwang	5c7e1d39ab	[MTIA Aten Backend] Migrate logit (#156284 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate logit to in-tree. Differential Revision: [D76871451](https://our.internmc.facebook.com/intern/diff/D76871451/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156284 Approved by: https://github.com/nautsimon ghstack dependencies: #155634, #156046, #156047, #156283	2025-06-18 21:36:27 +00:00
anwang	706e236b08	[MTIA Aten Backend] Migrate logical_or.out / log.out / log2.out (#156283 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate logical_or.out / log.out / log2.out to in-tree. Differential Revision: [D76857072](https://our.internmc.facebook.com/intern/diff/D76857072/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156283 Approved by: https://github.com/nautsimon ghstack dependencies: #155634, #156046, #156047	2025-06-18 21:27:58 +00:00
anwang	ab81fb846c	[MTIA Aten Backend] Migrate remainder.Tensor_out / reciprocal.out / neg.out (#156047 ) Migrate remainder.Tensor_out / reciprocal.out / neg.out Differential Revision: [D76696710](https://our.internmc.facebook.com/intern/diff/D76696710/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156047 Approved by: https://github.com/nautsimon ghstack dependencies: #155634, #156046	2025-06-18 21:17:34 +00:00
anwang	c26ce593d8	[MTIA Aten Backend] Migrate nan_to_num.out (#156046 ) Migrate nan_to_num.out Differential Revision: [D76696155](https://our.internmc.facebook.com/intern/diff/D76696155/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156046 Approved by: https://github.com/nautsimon ghstack dependencies: #155634	2025-06-18 21:14:13 +00:00
anwang	2f1c5c4131	[MTIA Aten Backend] Achieve CPU fallback by overriding registration (#155634 ) # Context MTIA supports CPU fallback, and people can set it using env vars. By migrating aten backend to in-tree, we also need to provide this support. # This diff Suggested by Alban(pytorch core), instead of skipping registration, this diff achieves CPU fallback by doing additional registration and override. The benefits of this approach: 1. The previous solution has problem handling ops that have default dispatch key(e.g. CompositeImplicitAutograd), and can't really achieve CPU fallback. 2. The CPU fallback related logic can be aggregated in aten_mtia_cpu_fallback.cpp. ---------------- p.s. D76314740 also tried reusing the yaml parsing logic in mtia's python script, but realized that the env vars are only available in runtime but not compile/codegen time Differential Revision: [D76376644](https://our.internmc.facebook.com/intern/diff/D76376644/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155634 Approved by: https://github.com/nautsimon, https://github.com/albanD	2025-06-18 21:10:18 +00:00
Mu-Chu Lee	e99cc126a4	[AOTInductor] Reuse input information instead of directly applying unbacked_symint_fallback (#156133 ) Summary: When we encounter unbacked symint during autotuning, we try to reuse existing symbols from user provided inputs, then fallback. Test Plan: python test/inductor/test_aot_inductor.py -k test_triton_dynamic_launcher_grid Rollback Plan: Differential Revision: D76769711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156133 Approved by: https://github.com/jingsh	2025-06-18 20:53:21 +00:00
PyTorch MergeBot	728cf6721e	Revert "[PT2]load dense delta by trimming prefixes (#155872 )" This reverts commit c74fd35050a7241f0c439501ef735aa6cdde751f. Reverted https://github.com/pytorch/pytorch/pull/155872 on behalf of https://github.com/malfet due to Broke lint, internal has been backed out ([comment](https://github.com/pytorch/pytorch/pull/155872#issuecomment-2985542895))	2025-06-18 20:05:56 +00:00
Kevin Fu	c74fd35050	[PT2]load dense delta by trimming prefixes (#155872 ) Summary: In PT2 with GPU with AOTI, weight names are like ```merge.submod_0._run_on_acc_0.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb``` but when publishing delta snapshots, lowering is skipped so weights are like ```merge.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb``` so when loading delta weights in original model runner, we need to: 1. Redo tensorName -> weight idx look up, because the weight ordering may be different. 2. use trimmed tensorName to find the correct weight path. Note that with this diff, delta snapshot loading still does NOT use xl weights. This should be fine for now as we are still publishing full model with non-xl weights. Test Plan: Merge only: ``` MODEL_TYPE=mtml_ctr_instagram_model MODULE=merge MODEL_ENTITY_ID=900234243 SNAPSHOT_ID=7 DENSE_DELTA_SNAPSHOT_ID=13 CUDA_VISIBLE_DEVICES=2,3 buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=DenseOnly --baseNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE} --moduleName=${MODULE} --predictor_hardware_type 1 --submodToDevice "" --deltaNetFile /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/delta_${DENSE_DELTA_SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE} ``` Local replayer: ``` MODEL_TYPE=mtml_ctr_instagram_model MODEL_ENTITY_ID=900234243 SNAPSHOT_ID=7 DENSE_DELTA_SNAPSHOT_ID=13 USE_SERVABLE=0 HARDWARE_TYPE=0 DENSE_DELTA_IDS=${DENSE_DELTA_SNAPSHOT_ID} ENABLE_REALTIME_UPDATE=1 CUDA_VISIBLE_DEVICES=6,7 sh ./sigrid/predictor/scripts/start_gpu_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} 7455 USE_SERVABLE=0 sh sigrid/predictor/scripts/start_gpu_replayer_localhost_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} 10 ${MODEL_TYPE} /data/users/$USER/requests/filter_requests_mtml_ctr_instagram_model_500 localhost /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} true 7455 ``` Rollback Plan: Differential Revision: D76520301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155872 Approved by: https://github.com/SherlockNoMad	2025-06-18 19:13:22 +00:00
Alexander Zhipa	48de3da253	fix: avoid flamegraph script setup conflicts (#156310 ) Fixes #156309 Instead of any kind of locking and busy waits leaving room for multiple script downloads to happen, while only one `rename` will succeed and others will silently fail, removing any temporary files created during this process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156310 Approved by: https://github.com/malfet Co-authored-by: Alexander Zhipa <azzhipa@amazon.com>	2025-06-18 19:06:22 +00:00
Luca Wehrstedt	cbafba5794	Allow forcing FSDP2 to always use SUM reductions (#155915 ) NCCL zero-copy support only works for SUM reductions. FSDP2, by default, was prefering AVG reductions or, when using `set_reduce_scatter_divide_factor`, PreMulSum reductions. Moreover, PreMulSum reductions had a few bugs, such as #155903 and #155904. This PR adds a flag to always use SUM reductions, potentially requiring separate pre-/post-scaling kernels, and reworks the `set_reduce_scatter_divide_factor` logic to make it safer (and renaming it to avoid confusion). Differential Revision: [D76895058](https://our.internmc.facebook.com/intern/diff/D76895058) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155915 Approved by: https://github.com/xunnanxu	2025-06-18 18:57:47 +00:00
windsonsea	9944cd0949	Convert to markdown: quantization-accuracy-debugging.rst, quantization-backend-configuration.rst, quantization-support.rst, random.rst (#155520 ) Related to #155032 - ✅ quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html) - ✅ quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html) - ✅ quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html) - ✅ random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-18 18:46:04 +00:00
Jeff Daily	30d3cf62fb	support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F (#154680 ) Requires CUDA >= 12.9 and sm_90. hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154680 Approved by: https://github.com/malfet, https://github.com/atalman	2025-06-18 18:39:01 +00:00
xinan.lin	aee2bfc5ba	[Intel GPU] Update xpu triton commit pin for PyTorch release 2.8. (#154194 ) As title. Thanks @anmyachev for the work on compatibility adaptation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154194 Approved by: https://github.com/jansel	2025-06-18 18:17:07 +00:00
Ayan Das	2620361d19	Add batching rule for torch.matrix_exp (#155202 ) ## Summary Adds the missing batching rule for `torch.matrix_exp` to enable efficient `vmap` support. Previously, using `vmap` with `matrix_exp` would trigger a performance warning and fall back to a slow loop-based implementation, even though `matrix_exp` natively supports batched inputs. Fixes #115992 ## Details `torch.matrix_exp` is an alias for `torch.linalg.matrix_exp`. This PR adds vmap support by registering `matrix_exp` with `OP_DECOMPOSE`, which reuses the existing CompositeImplicitAutograd decomposition to automatically generate batching behavior from the operation's simpler component operations. ## Testing The existing test suite for vmap and matrix_exp should cover this change. The fix enables: - No performance warning when using `vmap(torch.matrix_exp)` - Efficient native batched execution instead of loop-based fallback Edit: Updated Details section to accurately reflect the implementation approach (decomposition rather than batch rule registration) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155202 Approved by: https://github.com/zou3519	2025-06-18 17:35:35 +00:00
eqy	a5f59cc2ea	[cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140 ) The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN https://github.com/pytorch/pytorch/issues/155225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140 Approved by: https://github.com/ngimel	2025-06-18 17:32:36 +00:00
PyTorch MergeBot	94f8679019	Revert "[PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809 )" This reverts commit 6d3a4356f61b28a14abd95f641e2615deb186365. Reverted https://github.com/pytorch/pytorch/pull/155809 on behalf of https://github.com/laithsakka due to pr_time_benchmarks ([comment](https://github.com/pytorch/pytorch/pull/155809#issuecomment-2985022572))	2025-06-18 16:52:19 +00:00
Nikita Shulga	36f7a027b5	[MPS] Implement upsample_trilinear as Metal shader (#156263 ) But only forward for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/156263 Approved by: https://github.com/dcci ghstack dependencies: #156256, #156090	2025-06-18 16:10:02 +00:00
charan-ponnada	bf06190e21	Integrated AMD AWS runners into Pytorch CI (#153704 ) Integrated AMD AWS runners into PyTorch CI, including the linux.24xl.amd for performance tests, the linux.8xl.amd with AVX512 support for unit and periodic tests, and the linux.12xl.amd with AVX2 support for unit and periodic tests. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153704 Approved by: https://github.com/malfet, https://github.com/jithunnair-amd Co-authored-by: kiriti-pendyala <kiriti.pendyala@amd.com>	2025-06-18 15:58:22 +00:00
PyTorch MergeBot	ce3406817d	Revert "[dynamo] control one_graph behavior additionally through config (#154283 )" This reverts commit fe37db4f1270745d6c523623143332ddf263af55. Reverted https://github.com/pytorch/pytorch/pull/154283 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2984795214))	2025-06-18 15:53:32 +00:00
PyTorch MergeBot	c5d3e7a4ff	Revert "[dynamo] add set_fullgraph decorator/context manager (#154289 )" This reverts commit 920f6e681ec70b664ed952255b8c1f97962f5de0. Reverted https://github.com/pytorch/pytorch/pull/154289 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154289#issuecomment-2984774814))	2025-06-18 15:51:06 +00:00
PyTorch MergeBot	408d9884b0	Revert "[dynamo] fix set_fullgraph for nested calls (#154782 )" This reverts commit 3c8c48f79344356c58e91b9c8588f85ff806e1c8. Reverted https://github.com/pytorch/pytorch/pull/154782 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154782#issuecomment-2984764330))	2025-06-18 15:47:21 +00:00
PyTorch MergeBot	6201981f48	Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166 )" This reverts commit 614a41514545cbdd15757ef2586d433d7d34041c. Reverted https://github.com/pytorch/pytorch/pull/155166 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/15726606697/job/44333233942) [HUD commit link](`a6a3a44144`) ([comment](https://github.com/pytorch/pytorch/pull/155166#issuecomment-2984751600))	2025-06-18 15:43:22 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	d290fe7690	Remove legacy export testing path (#156093 ) Summary: After this diff stack lands, we are pretty much done with the training IR migration. So there is no need to run extensive legacy export test. Test Plan: CI Rollback Plan: Differential Revision: D76734378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156093 Approved by: https://github.com/desertfire	2025-06-18 15:36:44 +00:00
Jeff Daily	7531bd6491	[ROCm] upgrade to 6.4.1 patch release (#156112 ) Fixes #155292. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156112 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-18 15:21:44 +00:00
Aleksandar Samardžić	830a335a7d	Refine alignment check along dynamic dimension for grouped MMs (#155466 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466 Approved by: https://github.com/ngimel	2025-06-18 15:15:05 +00:00
Xuan Zhang	6d3a4356f6	[PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809 ) Problem & Solution: Assume we have something like: ``` x = some_op(...) x0 = x[0] do_something_with_and_is_last_use_of(x0) do_a_bunch_of_other_things() x1 = x[1] ``` In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as ``` x = some_op(...) x0 = x[0] x1 = x[1] do_something_with_and_is_last_use_of(x0) do_a_bunch_of_other_things() ``` Results: For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are * baseline: 7.73GiB * with the chage: 6.45GiB As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed. cc and credit to @ShatianWang for noticing this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809 Approved by: https://github.com/fmassa, https://github.com/bdhirsh ghstack dependencies: #155943	2025-06-18 14:38:55 +00:00
Pearu Peterson	c177abd217	Disable pinning check when loading sparse tensors (#154638 ) Disables pinning check as unnecessary and to fix https://github.com/pytorch/pytorch/issues/153143 when loading sparse tensor from external storage with sparse tensor invariants check enabled. Fixes https://github.com/pytorch/pytorch/issues/153143 . For FC, to be landed two weeks after https://github.com/pytorch/pytorch/pull/154617, see https://github.com/pytorch/pytorch/pull/154617#issuecomment-2919643612. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154638 Approved by: https://github.com/amjames, https://github.com/ngimel	2025-06-18 14:33:36 +00:00
PyTorch MergeBot	8f02161d10	Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564 )" This reverts commit a6a3a441442a96f38d0771c985f753223cea2ba0. Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/15726606697/job/44333233942) [HUD commit link](`a6a3a44144`) ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2984409088))	2025-06-18 14:19:39 +00:00
Luca Wehrstedt	b30e04b3c8	Make the NCCL PG Options and Config copyable and safe to init standalone (#155700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155700 Approved by: https://github.com/kwen2501	2025-06-18 13:36:27 +00:00
Xia, Weiwen	1bb9b1858b	[CPU][Inductor] Improve A16W4 GEMM template performance by using block_n=32 (#156174 ) Summary We found that using `block_n=32` brings better performance for A16W4 than `block_n=64` because cache locality is better and parallelism is better if N is small and more cores are used. For example, when running Llama-3.1-8B with A16W4 and batch size = 16 on 43 cores, `block_n=32` is faster by >10% E2E for both first and next token. Test plan ``` pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156174 Approved by: https://github.com/leslie-fang-intel	2025-06-18 13:17:46 +00:00
frost-intel	d99cac2816	[Kineto][submodule] Update kineto pin for XPU toggle feature (#155488 ) Part of #154898 Update kineto submodule Summary: We add the toggleCollectionDynamic functionality to XPUPTI in Kineto, so profiler can be enabled/disabled dynamically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155488 Approved by: https://github.com/guangyey, https://github.com/sraikund16	2025-06-18 12:39:58 +00:00
Aleksei Nikiforov	c11888e7a6	Skip more tests on s390x (#155210 ) Make CI for s390x green before fixing and restoring tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155210 Approved by: https://github.com/seemethere	2025-06-18 12:07:17 +00:00
Xuehai Pan	402ae09e41	[BE] fix typos in c10/ (#156078 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156078 Approved by: https://github.com/malfet, https://github.com/cyyever	2025-06-18 10:24:44 +00:00
David Berard	f45f483884	[user triton] AOT Inductor support for new host-side TMA api (#155879 ) This adds support for the host-side TMA api (TensorDescriptor.from_tensor) for AOTI. Note: this should support all the same features as the old (experimental) TMA api, but not some new features of the new TMA, like mxfp4 support. Note: one complexity with the new TMA api is that a single TMA descriptor passed to the python kernel turns into 1 + 2 * N args in the cubin function signature, for a rank-N tensor. What this PR contains: 1) device_op_overrides.py: add a rough copy of fillTMADescriptor from https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.c#L283. However, the fillTMADescriptor implementation in Triton is significantly modified, so that much of the computation (about swizzling and data types) is done before the time of the TMA construction. For simplicity, I've moved the computation into the cuda helper kernel (as was the previous strategy with fill2DTMADescriptor); but long term we might want to unify our implementation with the upstream implementation 2) device_op_overrides.py: introduces a struct "StableTMADescriptor" which stores some of the 1 + 2 * N args for the cubin signature (along with the global shape, which is not strictly needed, but this cleans up the call to the triton kernel 3) plumbing through cpp_wrapper_gpu.py. The main thing to note is: the code generated by cpp_wrapper_gpu.py generally refers to the StableTMADescriptor object when it passes around a "tma descriptor" variable. At the very end (in generate_args_decl), the StableTMADescriptor is unwrapped and the individual arguments are passed into the cubin. Tests: test_aot_inductor.py's test_triton_kernel_tma_descriptor_{N}d_dynamic_{D}_tma_version_{V}_cuda: for N in {1, 2} and D in {True, False}, and V = {new, old}, this test passes (or is skipped, if the appropriate TMA API is not available). Tested on H100 for Triton 3.3 and Triton 3.4. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155879 Approved by: https://github.com/desertfire	2025-06-18 09:35:11 +00:00
Junjie Wang (PyTorch)	577baa4116	[c10d] Add a logger for all nccl collectives with its time duration when completed (#156008 ) Summary: We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives. Test Plan: CI + dry run. Rollback Plan: Differential Revision: D76552340 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156008 Approved by: https://github.com/fegin, https://github.com/eqy	2025-06-18 09:08:42 +00:00
Wang, Chuanqi	c5a4fe9c17	[CI] fix the ci image name for public copy in ghcr (#156169 ) After the PR #152209 landed, the name of ci image public copy in ghcr is not correct. For example, https://github.com/pytorch/pytorch/actions/runs/15698468716/job/44228133522#step:10:8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156169 Approved by: https://github.com/malfet	2025-06-18 08:16:56 +00:00
William Wen	a6a3a44144	[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564 ) This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error. Implementation details: - The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`. - InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end. - When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289, #154782, #155166	2025-06-18 07:27:20 +00:00
William Wen	614a415145	[dynamo] handle fullgraph toggle using nested torch.compile (#155166 ) See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not. Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289, #154782	2025-06-18 07:27:20 +00:00
William Wen	3c8c48f793	[dynamo] fix set_fullgraph for nested calls (#154782 ) - Make the fullgraph argument of set_fullgraph a positional argument - Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis). Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289	2025-06-18 07:27:09 +00:00
William Wen	920f6e681e	[dynamo] add set_fullgraph decorator/context manager (#154289 ) Implements https://github.com/pytorch/pytorch/issues/144908. Implementation notes: - `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing. - Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example). - InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` . - `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #154283	2025-06-18 07:27:00 +00:00
William Wen	fe37db4f12	[dynamo] control one_graph behavior additionally through config (#154283 ) `torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-06-18 07:26:52 +00:00
Brian Hirsh	ccc6279b40	flex attention: fix dispatch order for tensor subclasses, avoid hardcoding call to faketensor impl in dynamo (#151719 ) This is enough to get @XilunWu 's stack in a state where his flex_attention DTensor implementations worked E2E for me. It also required these changes on the DTensor side, to properly add a DTensor rule for flex backward: P1789852198 There are two problems: (1) in the normal dispatcher, we have a precedence ordering between modes and subclasses. Modes are dispatched to first, but modes are allowed to return NotImplemented, giving subclasses a chance to run. This normally happens automatically in `FakeTensorMode.__torch_dispatch__` and `FunctionalTensorMode.__torch_dispatch__`. However, since HOPs implement these two modes themselves, HOPs do not get this benefit. For now, I ended up hardcoding this `NotImplemented` logic directly into the functional/fake rules for flex attention. Having to do this for every HOP seems a bit painful. If we could plumb every HOP through `Fake[\|Functional]TensorMode.__torch_dispatch__` then we would get this support. Another option could be to just assume that most HOP <> mode implementations want the same treatment by default, and hardcode this `NotImplemented` logic into `torch/_ops.py`. I'm not sure if we'd need a way for the HOP to opt out of this though. (2) We were hardcoding a call to flex attention's fake implementation in dynamo to run fake prop. This is technically wrong for subclasses, because it doesn't give subclasses the chance to interpose on the op and desugar it before fake prop runs. I tweaked dynamo's logic to call the op, and let the dispatcher handle invoking the fake implementation. Testing Xilun is adding some DTensor tests in his PR that will end up testing this logic. If folks would prefer, though, I can try to add a test that uses another subclass instead that is maybe more basic. This is the tlparse that his DTensor test gnerated for me: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/0196c1d3-a9a2-46ea-a46d-aa21618aa060/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151719 Approved by: https://github.com/ydwu4 Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-06-18 07:02:04 +00:00
Ruben Rodriguez Buchillon	bdb1553b77	[inductor][cutlass] binary remote cache (#156248 ) Summary: # Why speed up cutlass kernel generation and retrieval # What using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally this is the OSS only part of the change, to facilitate integration Test Plan: ## prove that we can upload successfully ``` buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` manifold ls coconutruben-test-01/tree/cutlass_concept_2 673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so 649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so ``` ## prove that we can download successfully ``` buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so ``` ## prove that we can upload errors successfully ``` buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` manifold ls coconutruben-test-01/tree/cutlass_concept_2 4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error 4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error ``` ## prove that we can download errors successfully ``` buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error ``` ## showing timing information ``` I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s) I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s) ``` Reviewed By: henrylhtsang Pull Request resolved: https://github.com/pytorch/pytorch/pull/156248 Approved by: https://github.com/henrylhtsang	2025-06-18 06:51:22 +00:00
PyTorch UpdateBot	96df866410	[audio hash update] update the pinned audio hash (#156259 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156259 Approved by: https://github.com/pytorchbot	2025-06-18 06:02:46 +00:00
Kaichao You	a5df6ffbc2	Improve IPC for Expandable Segments to use fabric handle when possible (#156074 ) Improve upon https://github.com/pytorch/pytorch/pull/130890 , inspired by https://github.com/pytorch/pytorch/pull/130890#issuecomment-2278882984 , we can automatically use the fabric handle for IPC when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156074 Approved by: https://github.com/ngimel, https://github.com/malfet	2025-06-18 05:22:06 +00:00
henrylhtsang	29867b211a	[cutlass backend] Add __init__.py to cutlass_lib_extensions (#156234 ) When using docker with cutlass backend, we can get ``` No module named 'torch._inductor.codegen.cuda.cutlass_lib_extensions' ``` First reported by @nWEIdia in https://github.com/pytorch/pytorch/issues/155888 Evidence that this fixes: https://github.com/pytorch/pytorch/pull/156136 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156234 Approved by: https://github.com/mlazos, https://github.com/Skylion007	2025-06-18 05:03:43 +00:00
Nikita Shulga	c28e74e457	[MPS] Add nearest_3d forward and backward (#156090 ) Introduce generalizable `UpsampleParams` structure in `UpSample.h`, which could be shared between CPU and MPS Delete `upsample_nearest3d` MPS fallback and replace it with proper shader Pull Request resolved: https://github.com/pytorch/pytorch/pull/156090 Approved by: https://github.com/kulinseth, https://github.com/dcci ghstack dependencies: #156256	2025-06-18 04:48:15 +00:00
functionstackx	a82c171bb2	remove skipifrocm from composability tests (#156036 ) Porting over DTensor training codebase to rocm atm and was reading through a 2D unit tests and noticed a couple of the unit tests already work on rocm even though it is being skipped. pipeline parallel tests pass too tested locally <img width="561" alt="image" src="https://github.com/user-attachments/assets/7c40c0f2-2de8-4cf1-8e36-0ba2bba46baa" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156036 Approved by: https://github.com/jeffdaily	2025-06-18 04:24:42 +00:00
Daniel Galvez	9ed0060225	Provide access to the cudaGraph_t underlying a CUDAGraph. (#155164 ) There are a few considerations here: 1. A user might want to modify the cudaGraph_t either during the stream capture or after the stream capture (but before instantiation). This draft implements modification after stream capture only, though support could be added for modification during stream capture by applying https://github.com/pytorch/pytorch/pull/140979/files#diff-d7302d133bb5e0890fc94de9aeea4d9d442555a3b40772c9db10edb5cf36a35cR391-R404 2. Previously, the cudaGraph_t would be destroyed before the end of capture_end() unless the user had previously called enable_debug_mode(). There is no way to implement this correctly without removing this restriction, or forcing the user to always call enable_debug_mode(). However, enable_debug_mode() is a confusing API (despite being an instance method, it would modify a static global variable; thus, putting one CUDAGraph object into debug mode puts all of them into debug mode, which is not acceptable in my opinion). Therefore, I made enable_debug_mode() into a no-op. This means that the CPU memory usage will increase after this change. I think this is likely to be fine. 3. No python bindings yet. These should be easy to add. It is probably worthwhile to take some time to make sure that the returned cudaGraph_t can be converted into the cuda-python cudaGraph_t in a reasonable, hopefully type-safe, manner (but without making cuda-python a dependency of pytorch), since I imagine most users will use the pip cuda-python package to make modifications. 4. There are two foot guns: a. The cudaGraph_t returned by raw_cuda_graph() is not owned by the user, so it will be destroyed once the owning CUDAGraph is destroyed (or calls reset()). b. The following seuquence won't work as intended: ``` g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): foo() g.replay() raw_graph = g.raw_cuda_graph() modify(raw_graph) g.replay() ``` This won't work because the user must call instantiate() again after modifying cudaGraph_t. You could add a "safety" mechanism by traversing the cudaGraph_t to create a hash and seeing if the hash changes between calls to replay(), but this is likely way too expensive. I think these two foot guns are probably okay given that this a bit of an experts' API. Fixes #155106 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155164 Approved by: https://github.com/ngimel	2025-06-18 03:39:28 +00:00
Simon Fan	17b38b850e	[ca] Allow using compiled autograd context managers during backward runtime (#156120 ) Added an invariant that nested compiled autograd context managers must exit before their parent context manager. This allows us to defer the thread check. FIXES https://github.com/pytorch/pytorch/issues/152219 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156120 Approved by: https://github.com/jansel ghstack dependencies: #155521, #155480	2025-06-18 03:01:15 +00:00
CaoE	10d41c7d20	Add SDPA patterns for T5 models (#155455 ) * Add SDPA patterns for T5 models. * Remove the stride check of mask, and do contiguous for mask in flash attention when the stride of last dim != 1 & != 0. This allows more SDPAs with complex mask to be accelerated using flash attention, such as the T5 model, where the generated masks may be not continuous. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155455 Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-06-18 02:09:55 +00:00
Mark Molinaro	4851863e3f	fix hack to check if register_buffer has been overridden (#155963 ) Followup on https://github.com/pytorch/pytorch/pull/125971 `self.register_buffer` will always be a a bound method on the instance (`self`) while `torch.nn.Module.register_buffer` is an unbound class method. `is`-ing these two things will never yield `True`. Instead, lets check the [original function object](https://docs.python.org/3/reference/datamodel.html#method.__func__). Note that the current logic doesn't break anything because the `else` branch will still do the "right thing" in the case `register_buffer` hasn't been overrridden, but it does mean we do less work! Example demonstration: ```python class Base: def register_buffer(self, buffer): pass class InheritedOk(Base): pass class InheritedOverride(Base): def register_buffer(self, buffer): pass b = Base() ok = InheritedOk() override = InheritedOverride() print(f"b.register_buffer is Base.register_buffer: {b.register_buffer is Base.register_buffer}") # False print(f"ok.register_buffer is Base.register_buffer: {ok.register_buffer is Base.register_buffer}") # False print(f"override.register_buffer is Base.register_buffer: {override.register_buffer is Base.register_buffer}") # False print(f"b.register_buffer.__func__ is Base.register_buffer: {b.register_buffer.__func__ is Base.register_buffer}") # True print(f"ok.register_buffer.__func__ is Base.register_buffer: {ok.register_buffer.__func__ is Base.register_buffer}") # True print(f"override.register_buffer.__func__ is Base.register_buffer: {override.register_buffer.__func__ is Base.register_buffer}") # False ``` (I can make an associated issue if needed, but didnt see it required [in the contributing guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#merging-your-change)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155963 Approved by: https://github.com/mikaylagawarecki	2025-06-18 01:50:30 +00:00
nirajkamalk	202d2ae53a	Convert rst to md: rpc.rst, signal.rst, size.rst, special.rst (#155430 ) Fixes #155033 - [x] [rpc.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/rpc.rst) - [x] [signal.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/signal.rst) - [x] [size.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/size.rst) - [sparse.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/sparse.rst) fixed in #155438 due to large size. - [x] [special.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/special.rst) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155430 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-18 01:27:04 +00:00
Nicolas Macchioni	68996dc183	[BE][2/X] Phase out usage of `use_max_autotune()` (#155848 ) See #155847 for context Pull Request resolved: https://github.com/pytorch/pytorch/pull/155848 Approved by: https://github.com/masnesral	2025-06-18 01:18:09 +00:00
Jane Xu	e8bfce9a43	Document how to use stack-based APIs with StableIValue (#155984 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155984 Approved by: https://github.com/albanD, https://github.com/zou3519	2025-06-18 01:10:23 +00:00
Nikita Shulga	541297daae	[Build] Allow metal shaders to include ATen headers (#156256 ) No-op change that will be used later to share structs between CPU and Metal Pull Request resolved: https://github.com/pytorch/pytorch/pull/156256 Approved by: https://github.com/dcci	2025-06-18 01:03:25 +00:00
xinan.lin	3dabc351bb	[Break XPU] Fix XPU UT failures introduced by community. (#156091 ) Fixes #15089, Fixes #156063, Fixes #155689, Fixes #155692, Fixes #156146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156091 Approved by: https://github.com/jansel	2025-06-17 23:43:37 +00:00
Howard Huang	38e1e5d54c	Add get_pipeline_order() for Gpipe and 1F1B (#155935 ) The [schedule visualizer](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/_schedule_visualizer.py) relies on `self.pipeline_order` to be populated. The `_PipelineScheduleRuntime` also depends on this to run the IR. The single stage schedules do not implement this so this PR adds that. Also fixes a bug in the schedule visualizer Pull Request resolved: https://github.com/pytorch/pytorch/pull/155935 Approved by: https://github.com/wconstab	2025-06-17 23:39:17 +00:00
bobrenjc93	5435e75399	[ez] rename choice_timings -> choice_timings_fn (#156099 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156099 Approved by: https://github.com/mlazos ghstack dependencies: #155982, #155996, #156053	2025-06-17 23:30:27 +00:00
Manuel Candales	12b02137af	[MPS] Add benchmark for scan operations (#156241 ) Comparison of cumsum performance before and after Metal implementaton: Previous performance (using torch==2.7.1): ```[------------------------------- -------------------------------] \| eager \| compile 1 threads: ------------------------------------------------------- cumsum-dim0-32x32 (torch.float16) \| 131.0 \| 136.9 cumsum-dim0-128x128 (torch.float16) \| 116.9 \| 121.2 cumsum-dim0-512x512 (torch.float16) \| 132.5 \| 151.9 cumsum-dim0-1024x1024 (torch.float16) \| 150.0 \| 163.0 cumsum-dim1-32x32 (torch.float16) \| 125.9 \| 140.9 cumsum-dim1-128x128 (torch.float16) \| 116.4 \| 129.4 cumsum-dim1-512x512 (torch.float16) \| 135.9 \| 150.1 cumsum-dim1-1024x1024 (torch.float16) \| 139.5 \| 154.2 cumsum-1d-100 (torch.float16) \| 119.5 \| 127.1 cumsum-1d-10000 (torch.float16) \| 128.9 \| 142.5 cumsum-1d-1000000 (torch.float16) \| 140.6 \| 145.6 cumsum-dim0-32x32 (torch.float32) \| 115.7 \| 132.5 cumsum-dim0-128x128 (torch.float32) \| 118.0 \| 131.5 cumsum-dim0-512x512 (torch.float32) \| 138.8 \| 151.6 cumsum-dim0-1024x1024 (torch.float32) \| 155.5 \| 164.2 cumsum-dim1-32x32 (torch.float32) \| 127.2 \| 141.7 cumsum-dim1-128x128 (torch.float32) \| 117.7 \| 130.5 cumsum-dim1-512x512 (torch.float32) \| 138.2 \| 152.3 cumsum-dim1-1024x1024 (torch.float32) \| 144.4 \| 158.6 cumsum-1d-100 (torch.float32) \| 118.6 \| 128.0 cumsum-1d-10000 (torch.float32) \| 125.5 \| 141.5 cumsum-1d-1000000 (torch.float32) \| 143.9 \| 158.4 cumsum-dim0-32x32 (torch.bfloat16) \| 106.6 \| 137.6 cumsum-dim0-128x128 (torch.bfloat16) \| 118.1 \| 131.0 cumsum-dim0-512x512 (torch.bfloat16) \| 140.0 \| 154.3 cumsum-dim0-1024x1024 (torch.bfloat16) \| 153.2 \| 164.4 cumsum-dim1-32x32 (torch.bfloat16) \| 127.9 \| 132.6 cumsum-dim1-128x128 (torch.bfloat16) \| 116.5 \| 129.6 cumsum-dim1-512x512 (torch.bfloat16) \| 136.5 \| 151.2 cumsum-dim1-1024x1024 (torch.bfloat16) \| 139.8 \| 144.8 cumsum-1d-100 (torch.bfloat16) \| 115.7 \| 129.4 cumsum-1d-10000 (torch.bfloat16) \| 125.0 \| 143.3 cumsum-1d-1000000 (torch.bfloat16) \| 127.8 \| 143.4 Times are in microseconds (us). ``` Current performance: ``` [-------------------------------- --------------------------------] \| eager \| compile 1 threads: --------------------------------------------------------- cumsum-dim0-32x32 (torch.float16) \| 107.4 \| 123.8 cumsum-dim0-128x128 (torch.float16) \| 134.2 \| 145.8 cumsum-dim0-512x512 (torch.float16) \| 207.3 \| 231.6 cumsum-dim0-1024x1024 (torch.float16) \| 318.9 \| 355.3 cumsum-dim1-32x32 (torch.float16) \| 98.0 \| 114.3 cumsum-dim1-128x128 (torch.float16) \| 110.8 \| 121.6 cumsum-dim1-512x512 (torch.float16) \| 193.0 \| 209.1 cumsum-dim1-1024x1024 (torch.float16) \| 844.7 \| 870.8 cumsum-1d-100 (torch.float16) \| 108.4 \| 125.0 cumsum-1d-10000 (torch.float16) \| 784.7 \| 852.3 cumsum-1d-1000000 (torch.float16) \| 65855.2 \| 66725.9 cumsum-dim0-32x32 (torch.float32) \| 114.7 \| 115.7 cumsum-dim0-128x128 (torch.float32) \| 139.0 \| 151.6 cumsum-dim0-512x512 (torch.float32) \| 197.3 \| 208.0 cumsum-dim0-1024x1024 (torch.float32) \| 312.7 \| 332.9 cumsum-dim1-32x32 (torch.float32) \| 92.0 \| 110.8 cumsum-dim1-128x128 (torch.float32) \| 114.2 \| 125.0 cumsum-dim1-512x512 (torch.float32) \| 186.2 \| 196.1 cumsum-dim1-1024x1024 (torch.float32) \| 752.0 \| 825.0 cumsum-1d-100 (torch.float32) \| 112.4 \| 122.0 cumsum-1d-10000 (torch.float32) \| 793.5 \| 863.5 cumsum-1d-1000000 (torch.float32) \| 66431.8 \| 66040.0 cumsum-dim0-32x32 (torch.bfloat16) \| 111.6 \| 121.6 cumsum-dim0-128x128 (torch.bfloat16) \| 139.0 \| 138.4 cumsum-dim0-512x512 (torch.bfloat16) \| 217.6 \| 230.1 cumsum-dim0-1024x1024 (torch.bfloat16) \| 305.2 \| 325.6 cumsum-dim1-32x32 (torch.bfloat16) \| 100.5 \| 110.9 cumsum-dim1-128x128 (torch.bfloat16) \| 112.8 \| 125.0 cumsum-dim1-512x512 (torch.bfloat16) \| 187.8 \| 208.9 cumsum-dim1-1024x1024 (torch.bfloat16) \| 790.9 \| 864.7 cumsum-1d-100 (torch.bfloat16) \| 111.6 \| 124.6 cumsum-1d-10000 (torch.bfloat16) \| 778.1 \| 844.9 cumsum-1d-1000000 (torch.bfloat16) \| 64654.3 \| 64082.5 Times are in microseconds (us). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156241 Approved by: https://github.com/malfet	2025-06-17 22:30:22 +00:00
PyTorch MergeBot	fa4f07b5b8	Revert "[Docs] Convert to markdown to fix 155032 (#155520 )" This reverts commit cd66ff80307862ef8e75520054ecd19a5eff9f7e. Reverted https://github.com/pytorch/pytorch/pull/155520 on behalf of https://github.com/atalman due to breaks multiple test_quantization.py::TestQuantizationDocs::test_quantization_ ([comment](https://github.com/pytorch/pytorch/pull/155520#issuecomment-2981996091))	2025-06-17 22:22:50 +00:00
Julian De la Barrera Brandner	54998c2daa	Document padding size limitations in nn.modules.padding (#134840 ) (#155618 ) Fixes #134840 Added documentation to clarify padding size constraints for all padding modes in nn.modules.padding: - Circular padding: size must be less than or equal to the corresponding input dimension - Reflection padding: size must be less than the corresponding input dimension - Replication padding: output dimensions must remain positive These changes help prevent runtime errors when users attempt to use large padding values. ## PR Checklist - [x] The PR title and message follow our [commit guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#commit-message-format) - [x] The PR is made against the correct branch - [x] The PR is labeled with `docathon` - [x] The PR is labeled with `module: nn` - [x] The PR is labeled with `documentation` - [x] The PR description includes a reference to the issue being fixed - [x] The PR includes tests if applicable - [x] The PR includes documentation changes - [x] The PR has been tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/155618 Approved by: https://github.com/AlannaBurke, https://github.com/malfet	2025-06-17 22:16:48 +00:00
Jane Xu	937529f0b3	Pass by const ref instead of by value in StableIValue from (#156126 ) I realize I was passing stable::Tensors by value (thus making a copy every time) which is not what I want from the `from` function that converts Ts to StableIValues. `from` should not mutate the input and should be read-only. I asked an LLM whether this is API BC breaking (with an intuition that it shouldn't be), and it said no, cuz: 1. "Passing by const reference is more permissive than passing by value. e.g., if T is a type that has a deleted or inaccessible copy constructor (e.g., std::unique_ptr), the original code would have been invalid, while the new code would be valid." Nice. We are good with additive. 2. We didn't modify the original input before (cuz we took a copy) and we don't now (cuz we promise const). Update: The LLM failed to mention primitives, with which we should not pass references around, so we are only changing the signatures of std::optional<T> and stable::Tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/156126 Approved by: https://github.com/swolchok ghstack dependencies: #155367, #155977	2025-06-17 22:11:30 +00:00
Daniel Galvez	4c0aa37dda	Support stream capture of event record and wait nodes in cuda graphs (#155372 ) These are created by the user passing cudaEventRecordExternal and cudaEventWaitExternal to cudaEventRecordWithFlags() and cudaStreamWaitEvent() respectively. We do this by allowing the user to specify external=True when constructing a torch.cuda.Event(). If external=False, the cudaEventRecord and cudaStreamWaitEvent API's have a different meaning described here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events In short, they will be used to experess fork and join operations in the graph if external=False. External events can be used for expressing a fine-grained dependency on the outcome of some nodes in a cuda graph (rather than all nodes). They can also be used for timing parts of a cuda graph's execution, rather than timing the entire graph's execution. Finishes #146145 I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372 Approved by: https://github.com/ngimel	2025-06-17 21:44:51 +00:00
Oguz Ulgen	8e02cd9c5a	Skip cache related configs for cache config serialization (#156195 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156195 Approved by: https://github.com/masnesral	2025-06-17 21:24:07 +00:00
Junjie Wang (PyTorch)	3106a33e41	[fr] Fix one error in analysis script when subPG world size is smaller than global size (#156156 ) Summary: We run into an interesting case when we see so many mismatches while lot of mismatch turns out to be a fully match. The reason is that we use the dump ranks (which is from 0 to 79) to compare against the local pg ranks (0 to 7) this leads to false positive of mismatches. We can just check whether dump ranks contain all expected ranks or not, that should be sufficient. Test Plan: Test with the failed case with the script and we now see the correct behavior + new unit test case. Rollback Plan: Differential Revision: D76775373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156156 Approved by: https://github.com/VieEeEw	2025-06-17 21:17:58 +00:00
henrylhtsang	bb462a6237	[cutlass backend] Fix prescreening non-deterministic problem (#156144 ) Differential Revision: [D76642615](https://our.internmc.facebook.com/intern/diff/D76642615/) What do we expect to see when we run two identical matmul back to back? We expect to see the second one spending no time in precompilation, autotuning and prescreening. However, the introduction of prescreening bring some non-deterministics-ness. Basically, we have 1. prescreening of first matmul chooses a set of kernels to advance to autotuning 2. autotuning re-does the autotuning of the winners, potentially changing their timings a bit 3. second prescreening results in a slightly different set of kernels 4. since not all timings are present, an autotune is re-done. With this diff: ``` SingleProcess AUTOTUNE benchmarking takes 3.8633 seconds and 134.7364 seconds precompiling for 32 choices and 24.4472 seconds prescreening SingleProcess AUTOTUNE benchmarking takes 0.0003 seconds and 0.0027 seconds precompiling for 32 choices and 0.0006 seconds prescreening ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156144 Approved by: https://github.com/mlazos	2025-06-17 20:39:06 +00:00
windsonsea	cd66ff8030	[Docs] Convert to markdown to fix 155032 (#155520 ) Fix #155032 - ✅ quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html) - ✅ quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html) - ✅ quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html) - ✅ quantization.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization.html) vs [main](https://docs.pytorch.org/docs/main/quantization.html) - ✅ random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-17 20:29:45 +00:00
Nicolas Macchioni	50940270ae	[BE][3/X] Phase out usage of `use_max_autotune()` (#155849 ) See #155847 for context Pull Request resolved: https://github.com/pytorch/pytorch/pull/155849 Approved by: https://github.com/masnesral	2025-06-17 20:26:29 +00:00
Xuehai Pan	b020971e78	[BE] fix typos in torchgen/ (#156083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156083 Approved by: https://github.com/jingsh ghstack dependencies: #156079, #156082	2025-06-17 19:25:50 +00:00
Xuehai Pan	a69785b3ec	[BE] fix typos in tools/ (#156082 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156082 Approved by: https://github.com/soulitzer ghstack dependencies: #156079	2025-06-17 19:25:50 +00:00
Xuehai Pan	ccea6ddac3	[BE] fix typos in cmake/ (#156079 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156079 Approved by: https://github.com/Skylion007	2025-06-17 19:25:43 +00:00
Dmitry Nikolaev	5eb5c3700b	[ROCm] enable batched eigen decomposition (syevD_batched) on ROCm (#154525 ) This PR implements `Batched Eigen Decomposition` (syevD_batched) on ROCm by calling rocSolver directly. cuSolver doesn't support syevD_batched and neither does hipSolver. Direct call to rocSolver is required. `syevD_batched` will be used on ROCm if all the following conditions are met: - `rocSolver version >= 3.26` - input data type is `float` or `double` - batch size >= 2 Otherwise, non-batched `syevD` will be used on ROCm (complex data types, batch size==1, rocSolver <3.26) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154525 Approved by: https://github.com/Mellonta	2025-06-17 19:20:36 +00:00
PyTorch MergeBot	ec08eb8ba2	Revert "[inductor][cutlass] binary remote cache (#156106 )" This reverts commit 9a2c669425379eb264f896390b8fcd8d3f2ce959. Reverted https://github.com/pytorch/pytorch/pull/156106 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156106#issuecomment-2981533904))	2025-06-17 19:07:49 +00:00
Aidyn-A	4a26bb8a12	[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900 ) Fixes #155668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900 Approved by: https://github.com/ngimel	2025-06-17 18:59:44 +00:00
Thien Tran	fc177801af	Enable FP8 row-wise scaled-mm for sm12x (#155991 ) ## Update using Cutlass 3.x (2025/06/15) Following @alexsamardzic's advice, I tried out Cutlass 3.x API and it's impressive (rated specs is 419 TFLOPS) M \| N \| K \| TFLOPS ---\|---\|---\|-------- 16\|4096\|4096\|17.56 64\|4096\|4096\|69.63 256\|4096\|4096\|266.57 1024\|4096\|4096\|339.28 4096\|4096\|4096\|388.91 This uses the same SM100 template. The only difference is - Cluster size is fixed to `<1,1,1>` since sm120 does not have multicast feature - ~~Tile size is fixed to `<128,128,128>` due to default kernel schedule does not support `<64,128,128>`. I will work a bit on improve perf for small M.~~ Fixed. Use `KernelTmaWarpSpecializedPingpong` when TileShape.M == 64 Perf for small M is still bad since it seems like Cutlass does not support TileShape.M < 64 for this kernel. It's possible to boost perf a bit by using TileShape `<64,64,128>`. ## Original using SM89 I tried using cutlass FP8 row-wise scaled-mm for sm89 on sm120 (5090) and it works. I guess it makes sense because sm120 matmul uses the standard sm80 PTX instructions (`cp.async`+`mma` and friends). Simple benchmark script ```python import torch from torch._inductor.utils import do_bench_using_profiling N, K = 4096, 4096 for M in [16, 64, 256, 1024, 4096]: A = torch.randn(M, K, device="cuda").to(torch.float8_e4m3fn) B = torch.randn(N, K, device="cuda").to(torch.float8_e4m3fn).T scale_A = torch.ones(M, 1).cuda() scale_B = torch.ones(1, N).cuda() out = torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16) out_ref = ((A.float() @ B.float()) * scale_A * scale_B).bfloat16() torch.testing.assert_close(out, out_ref) latency_us = do_bench_using_profiling(lambda: torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16)) tflops = (2 * M * N * K) / latency_us / 1e9 print(f"{M=}\t{N=}\t{K=}\t{tflops:.2f} TFLOPS") ``` M \| N \| K \| TFLOPS ---\|---\|---\|--- 16 \| 4096 \| 4096 \| 25.73 TFLOPS 64 \| 4096 \| 4096 \| 71.84 TFLOPS 256 \| 4096 \| 4096 \| 86.40 TFLOPS 1024 \| 4096 \| 4096 \| 112.12 TFLOPS 4096 \| 4096 \| 4096 \| 121.24 TFLOPS Accodring to [RTX Blackwell Whitepaper](https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf), FP8 MMA with FP32 accumulate is 419 TFLOPS. So the result is quite bad here... However, if I change `ThreadblockSwizzle` to `cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>` M \| N \| K \| TFLOPS ---\|---\|---\|-------- 16\|4096\|4096\|27.13 TFLOPS 64\|4096\|4096\|84.84 TFLOPS 256\|4096\|4096\|96.75 TFLOPS 1024\|4096\|4096\|110.21 TFLOPS 4096\|4096\|4096\|122.98 TFLOPS Small M slightly improves, but large M is still bad. If I further change `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` for M>256, which is taken from [cutlass example 58](https://github.com/NVIDIA/cutlass/blob/v3.9.2/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu), I get the following results M \| N \| K \| TFLOPS ---\|---\|---\|-------- 1024\|4096\|4096\|313.28 4096\|4096\|4096\|376.73 Which is much closer to hardware limit. And it also means this kernel is sufficient to get the most perf out of sm120. Only need better tuned configs. To make sure this high perf is only obtainable with `GemmIdentityThreadblockSwizzle<1>` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`, I also try using `ThreadblockSwizzleStreamK` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` M \| N \| K \| TFLOPS ---\|---\|---\|-------- 1024\|4096\|4096\|144.03 4096\|4096\|4096\|156.86 A bit better than current configs, but still very far away from hardware limit. @alexsamardzic I noticed you chose this configs in #149978. Do you have any numbers how the current configs perform on sm89? Update: Using triton codegen-ed from inductor `compiled_scaled_mm = torch.compile(torch._scaled_mm, dynamic=False, mode="max-autotune-no-cudagraphs")` M \| N \| K \| TFLOPS ---\|---\|---\|-------- 16\|4096\|4096\|25.60 64\|4096\|4096\|71.74 256\|4096\|4096\|161.64 1024\|4096\|4096\|185.89 4096\|4096\|4096\|215.53 Better than default configs, but still far away from the config above for compute-bound Pull Request resolved: https://github.com/pytorch/pytorch/pull/155991 Approved by: https://github.com/drisspg, https://github.com/eqy	2025-06-17 18:52:44 +00:00
Scott Wolchok	e323d46b61	ELU: compute ELU(0) with the cheaper definition (#155765 ) Both halves of the ELU definition yield 0 when evaluated at 0. Let's choose the half that doesn't require expm1. (I have no particular evidence that the input is often 0 in any case, but this seems like a free win.) Differential Revision: [D76481038](https://our.internmc.facebook.com/intern/diff/D76481038/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155765 Approved by: https://github.com/ezyang	2025-06-17 18:20:22 +00:00
Animesh Jain	8b0e0e4f23	[dynamo] Support tracing of functools.lru_cached method (#156125 ) Fixes https://github.com/pytorch/pytorch/issues/155841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156125 Approved by: https://github.com/williamwen42	2025-06-17 18:11:32 +00:00
Svetlana Karslioglu	fc5ae12293	Fix issue with right-nav (#156119 ) Enable on page right nav. For autosummary, we need to set `"show_toc_level": 2` so that navigation is enabled. Example: * Main: https://docs.pytorch.org/docs/main/special.html - right nav (under On this page) is empty. * Preview: https://docs-preview.pytorch.org/pytorch/pytorch/156119/special.html - right nav (under On this page) has a all the object listed <img width="1125" alt="Screenshot 2025-06-16 at 2 48 16 PM" src="https://github.com/user-attachments/assets/0790bb72-5997-4542-9847-0a89be4598c0" /> vs <img width="1030" alt="Screenshot 2025-06-16 at 2 48 55 PM" src="https://github.com/user-attachments/assets/4897c49c-044d-4bea-a8cd-490c90cca2b0" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156119 Approved by: https://github.com/albanD	2025-06-17 18:09:51 +00:00
Catherine Lee	32c1611263	[CI][run_test] Fix rerun logic for failing at exit (#155853 ) Sometimes a test file reports success according to pytest, but fails afterwards, and the rerun logic doesn't handle that correctly. The name of the last run test is saved in order to do more efficient reruns (target the last run test for a rerun without rerunning the entire file). This usually correct, ex test fails and pytest catches it -> lastrun = the test that failed, test segfaults (pytest doesn't catch) -> lastrun is the test that segfaulted. But sometimes pytest reports a success, but the process has non zero exit code. The two cases I know of are hangs and double freeing at exit. In this case, its unclear which test caused the failure, so lastrun is set to be the first test that ran in that session, so that during the next session it will start from the beginning in an attempt to replicate the error (an alternate solution would be to just fail and not rerun, which might be the better option). But then it reruns with runsingle, which prevents lastrun from being reset (not sure why, I'm pretty sure there's no difference between resetting and not normally), so lastrun becomes the last test that ran, and its not always true that lastrun is the one that caused it. Then on the next run, it starts from the last test and the process now exits cleanly Short term solution here: ensure the lastrun is always set to the initial value if the session succeeds. This is correct even in the normal path because initial value shouldn't change in that case Things that still need to be fixed: * log says "running single test" which is not true * no xml reports get generated here * also no xml reports get generated on segfault * docs for this I think I have a PR that fixes the above but its old so I need to take another look Testing: This from when I was based on a commit that had a hang for macs, and before I added the skips in inductor array ref: `cc862d2c14` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155853 Approved by: https://github.com/malfet	2025-06-17 17:51:40 +00:00
Nikita Shulga	6629eaf0c6	[CMAKE] Fix torch_cpu relink logic if metal shaders are recompiled (#156193 ) Beforehand, shader recompilation updated `caffe2/aten/src/ATen/metallib_dummy.cpp` but `torch_cpu` were dependent on `aten/src/ATen/metallib_dummy.cpp` Test plan: Run `python3 ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal` and observe that torch_cpu is being relinked Pull Request resolved: https://github.com/pytorch/pytorch/pull/156193 Approved by: https://github.com/manuelcandales	2025-06-17 17:49:33 +00:00
Manuel Candales	a4ea242edc	[MPS] Implement scan metal kernels (#156100 ) Implements metal kernels for scan operations: - Migrates cumsum and cumprod from MPSGraph implementation to Metal. - Fixes #154881 - Adds MPS backend support for cummin and cummax Pull Request resolved: https://github.com/pytorch/pytorch/pull/156100 Approved by: https://github.com/malfet	2025-06-17 17:44:22 +00:00
Jane Xu	9a5c59368d	Replace all RAIIATH with Tensor in libtorch_agnostic test, test some APIs (#155977 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155977 Approved by: https://github.com/albanD ghstack dependencies: #155367	2025-06-17 17:36:31 +00:00
Jane Xu	b115a4c03a	torch::stable::Tensor beginnings, mainly mem mgmt (#155367 ) ``` // The torch::stable::Tensor class is a highlevel C++ header-only wrapper around // the C shim Tensor APIs. We've modeled this class after TensorBase, as custom // op kernels only really need to interact with Tensor metadata (think sizes, // strides, device, dtype). Other functions on Tensor (like empty_like) should // live like the ATen op that they are and exist outside of this struct. // // There are several goals of this class over AtenTensorHandle and // RAIIAtenTensorHandle: // 1. torch::stable::Tensor is a nicer UX much closer to torch::Tensor than the // C APIs with AtenTensorHandle. Under the hood we still call to these C shim // APIs to preserve stability. // 2. RAIIAtenTensorHandle boils down to a uniq_ptr that forces the user to pass // around ownership. This makes it difficult to pass one input into 2 // different functions, e.g., doing something like c = a(t) + b(t) for // stable::Tensor t. Thus, we use a shared_ptr here. ``` This PR: - exemplifies the above - adds test cases in libtorch_agnostic to make sure the file actually works - includes the results of a battle with template specialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/155367 Approved by: https://github.com/albanD	2025-06-17 17:36:31 +00:00
Soumith Chintala	2625c70aec	Update CODEOWNERS (#156182 ) as title says. removing me as codeowner for cpp extensions Pull Request resolved: https://github.com/pytorch/pytorch/pull/156182 Approved by: https://github.com/albanD	2025-06-17 17:15:41 +00:00
rzou	a24afbff3f	Support torch.cuda.*Tensor in Dynamo (#156107 ) Summary: This PR adds support for torch.cuda.FloatTensor and friends in Dynamo. These are indeed legacy APIs, but that doesn't stop us from adding support for them in torch.compile. I add support for these in the same way that we support torch.Tensor: these APIs can be safely put into the Dynamo graph. Fixes #130722 Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/156107 Approved by: https://github.com/williamwen42	2025-06-17 16:31:10 +00:00
Ruben Rodriguez Buchillon	9a2c669425	[inductor][cutlass] binary remote cache (#156106 ) Summary: # Why speed up cutlass kernel generation and retrieval # What using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally Test Plan: ## prove that we can upload successfully ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` manifold ls coconutruben-test-01/tree/cutlass_concept_2 673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so 649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so ``` ## prove that we can download successfully ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so ``` ## prove that we can upload errors successfully ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` manifold ls coconutruben-test-01/tree/cutlass_concept_2 4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error 4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error ``` ## prove that we can download errors successfully ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error ``` ## showing timing information ``` I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s) I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s) ``` Rollback Plan: Reviewed By: henrylhtsang Differential Revision: D76454741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156106 Approved by: https://github.com/henrylhtsang Co-authored-by: atalman <atalman@fb.com>	2025-06-17 16:24:10 +00:00
David Berard	d66b4bcc3f	[inductor][triton pin] Support triton builtins after #7054 (#156031 ) Triton's PR 7054 modifies the builtins to take _semantic as a kwarg instead of _builder. To handle this, this PR checks the signature of tl.core.view (to see if it takes _builder or _semantic), and adds a wrapper converting _semantic to _builder if the new _semantic kwarg is being used. (Previously-)failing test: `python test/inductor/test_cooperative_reductions.py -k test_welford_non_power_of_2_rsplit_persistent_True_x_9_r_8000_rsplit_37` Differential Revision: [D76801240](https://our.internmc.facebook.com/intern/diff/D76801240) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156031 Approved by: https://github.com/NikhilAPatel	2025-06-17 16:09:55 +00:00
Yuki Kobayashi	d083841c0e	Fix a small sphinx markup error (#156061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156061 Approved by: https://github.com/colesbury	2025-06-17 15:36:02 +00:00
Catherine Lee	0079c80b35	[CI] Do not constrain memory for ROCm testing in CI (#156115 ) Fixes ROCm OOMs introduced by https://github.com/pytorch/pytorch/pull/155631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156115 Approved by: https://github.com/jeffdaily	2025-06-17 15:30:36 +00:00
windsonsea	7fcad0231c	[Docs] Convert to markdown to fix 155025 (#155789 ) Related to #155025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155789 Approved by: https://github.com/svekars	2025-06-17 15:08:14 +00:00
Nikita Shulga	4886ba64dc	[BE] Refactor functions from optional_submodules (#155954 ) And use `pathlib.Path` instead of `os.path` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155954 Approved by: https://github.com/Skylion007 ghstack dependencies: #155947	2025-06-17 14:41:52 +00:00
Frank Lin	cf90c9f8d1	[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 ) Fixes #154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097 Approved by: https://github.com/ngimel, https://github.com/cyyever Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-06-17 14:15:49 +00:00
Xuehai Pan	42015db6a9	[BE] fix typos in benchmarks/ (#156077 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156077 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #156069	2025-06-17 13:12:18 +00:00
Luca Wehrstedt	0a0023d984	Enable NCCL zero-copy (user buffer registration) for FSDP2 (#150564 ) In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention). This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing. FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers. This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported). Pull Request resolved: https://github.com/pytorch/pytorch/pull/150564 Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/syed-ahmed	2025-06-17 12:54:58 +00:00
Wang, Chuanqi	11bb1ece50	[CI] Fix triton version split issue (#155670 ) Fix a bug caused by #155313, refer https://github.com/pytorch/pytorch/actions/runs/15576592378/job/43862613039?pr=154194#step:7:652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155670 Approved by: https://github.com/atalman, https://github.com/EikanWang	2025-06-17 12:42:40 +00:00
Xuehai Pan	1cce73b5f4	[build] Change `--cmake{,-only}` arguments to envvars to support modern Python build frontend (#156045 ) See also: - #156029 - #156027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156045 Approved by: https://github.com/ezyang ghstack dependencies: #156040, #156041	2025-06-17 11:40:24 +00:00
Xuehai Pan	57084ca846	[BE][setup] allow passing pytorch-specific `setup.py` options from envvars (#156041 ) See also: - #156029 - #156027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156041 Approved by: https://github.com/ezyang ghstack dependencies: #156040	2025-06-17 11:40:24 +00:00
LuFengqing	092aed1b18	[Intel GPU] Enable GQA and different head_dim of value for SDPA (#150992 ) In OneDNN v3.7, SDPA doesn't support num_head_q != num_head_kv (aka GQA) and head_dim_qk != head_dim_v. In OneDNN v3.8, SDPA supports these two scenarios. Enable them in this PR. SDPA UTs pass in local test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150992 Approved by: https://github.com/guangyey, https://github.com/drisspg, https://github.com/EikanWang Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-06-17 11:09:51 +00:00
Wei (Will) Feng	4a8f5e752b	[FSDP2] explain user contract for fully_shard (#156070 ) <img width="896" alt="Screenshot 2025-06-16 at 1 36 00 AM" src="https://github.com/user-attachments/assets/7cdea256-2454-49c7-8b32-24549a13134d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156070 Approved by: https://github.com/mori360	2025-06-17 10:03:19 +00:00
Xuehai Pan	8d7ee0f4fb	[BE] fix typos in .ci/, .circleci/, .github/ (#156069 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156069 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-06-17 09:43:14 +00:00
Xuehai Pan	2e0e08588e	[BE][PYFMT] migrate PYFMT for `torch/[e-n]*/` to `ruff format` (#144553 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144553 Approved by: https://github.com/ezyang ghstack dependencies: #144551	2025-06-17 08:18:47 +00:00
cyy	95cb42c45d	Use CMAKE_COLOR_DIAGNOSTICS (#154583 ) `CMAKE_COLOR_DIAGNOSTICS` was introduced in CMake 2.24. Use it to simplify CMake code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154583 Approved by: https://github.com/ezyang	2025-06-17 04:52:31 +00:00
cyy	d43c0bdf46	[CI] Move ASAN jobs to clang-18 (#149099 ) Use clang-18 for ASAN jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149099 Approved by: https://github.com/ezyang	2025-06-17 04:51:07 +00:00
Animesh Jain	7b0118884e	[invoke_subgraph][inductor] Dont fallback on complex dtype (#155885 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155885 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #155828	2025-06-17 04:47:12 +00:00
Animesh Jain	ffcc6fea7b	[invoke_subgraph] Ignore redundantly nested invoke_subgraph (#155828 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155828 Approved by: https://github.com/zou3519	2025-06-17 04:47:12 +00:00
Nikita Shulga	b1713c6655	[MPS][Testing][BE] Fix samples for full_like (#156026 ) Now that device is known, one can avoid creating tensors of `torch.double` type Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026 Approved by: https://github.com/dcci ghstack dependencies: #156121	2025-06-17 04:46:26 +00:00
Ke Wen	82672206b7	[SymmMem] Make get_rank_to_global_rank return const ref (#156117 ) Avoiding a copy, not expecting a caller to change its value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156117 Approved by: https://github.com/fegin ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116	2025-06-17 04:13:18 +00:00
Ke Wen	eea3bcb3d1	[SymmMem] Cache rank_to_global_rank exchange (#156116 ) The rank-to-global-rank exchange is a major overhead in `NVSHMEMSymmetricMemory` creation. We should cache its result on per-group basis. Before this change: ``` TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py exchanged_n_times: 18 ``` After this change: ``` TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py exchanged_n_times: 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156116 Approved by: https://github.com/fegin, https://github.com/ngimel ghstack dependencies: #155506, #155835, #155968, #155971, #155975	2025-06-17 04:12:37 +00:00
Oguz Ulgen	a2a75be0f8	Rename inductor cache (#156128 ) Requested by Simon on a different PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128 Approved by: https://github.com/xmfan	2025-06-17 03:57:18 +00:00
henrylhtsang	45382b284d	[cutlass backend] changes how gpu_kernels_o are handled for cutlass (#155875 ) Currently, we do it a bit hacky: Look at all the .o we have from this session, add them all to AOTI. This for example doesn't work if we do multiple AOTI compilation in one session, without clearing the inductor cache. Also I want to change how cutlass .so are compiled. Hence this change. This change is broken down since @coconutruben is trying to make a change to the same files too. Differential Revision: [D76563003](https://our.internmc.facebook.com/intern/diff/D76563003/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155875 Approved by: https://github.com/ColinPeppler	2025-06-17 02:06:54 +00:00
cyy	64bb6317a5	[Accelerator] Fix Python typing in accelerator (#152394 ) There are some changes: 1. Use keywords for arguments if possible. 2. `__exit__ ` of `device_index` is changed to return None. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152394 Approved by: https://github.com/XuehaiPan, https://github.com/guangyey, https://github.com/ezyang Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com> Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-06-17 01:27:40 +00:00
William Wen	1f0eb79e3e	[dynamo] fix KeyError in LOAD_FAST_CHECK (#155763 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155763 Approved by: https://github.com/StrongerXi, https://github.com/jansel ghstack dependencies: #155761	2025-06-17 00:54:16 +00:00
William Wen	4e833c2005	[dynamo] support tracing weakref callback (#155761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155761 Approved by: https://github.com/StrongerXi, https://github.com/jansel	2025-06-17 00:54:16 +00:00
xadupre	e6252f62ef	[ONNX] Implements converter for higher order ops scan (#154513 ) Fixes #151327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154513 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-06-17 00:54:07 +00:00
Pian Pawakapan	b618817479	[PGO] include ints/floats in suggested whitelist (#155980 ) Made the mistake of dropping these Pull Request resolved: https://github.com/pytorch/pytorch/pull/155980 Approved by: https://github.com/bobrenjc93	2025-06-17 00:41:38 +00:00
Benjamin Glass	4311aea5e7	[AOTInductor] Add class declarations to torch._C._aoti interface file (#155128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155128 Approved by: https://github.com/desertfire ghstack dependencies: #155149	2025-06-17 00:10:57 +00:00
yifanmao	82fb904140	Add warning for incorrected grad results at world size 1 (#154928 ) Add warning for the issue discussed at https://github.com/pytorch/pytorch/issues/144045 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154928 Approved by: https://github.com/weifengpy	2025-06-17 00:08:04 +00:00
mori360	eb4cf59ecd	Add FSDP2 logging (#155826 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/155826 Approved by: https://github.com/weifengpy	2025-06-16 23:49:58 +00:00
Nikita Shulga	6e2992a998	Remove unused Azure pipeline trigger script (#156134 ) ## Summary - delete `.circleci/scripts/trigger_azure_pipeline.py` ## Testing - `python3 -m pip install flake8` - `python3 -m flake8 .circleci/scripts` ------ https://chatgpt.com/codex/tasks/task_e_6850a55f530c83279036800308fb6871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156134 Approved by: https://github.com/izaitsevfb	2025-06-16 23:42:52 +00:00
codingwithsurya	4781b0ee60	[SymmMem] Add NVSHMEM GET support to Triton (#155890 ) Adds NVSHMEM GET operation support for Triton kernels: - Add `getmem_block` core.extern wrapper for nvshmemx_getmem_block - Add basic `test_triton_get` for 2-rank GET operation - Add `test_triton_get_ring` for ring topology GET across arbitrary ranks Tests: `$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py` `TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_get` ```python @skipIfRocm @requires_triton() def test_triton_get(self) -> None: @triton.jit def get_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr): nvshmem.getmem_block(dst_ptr, src_ptr, numel, peer) # ... setup code ... val = 7 inp = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_( val if rank == 0 else -1 ) out = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(-1) peer = 1 - rank if rank == 1: # Rank 1 gets data from rank 0 get_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib) dist.barrier() print(f"[Rank {rank}] inp buffer: {inp}") print(f"[Rank {rank}] out buffer: {out}") print(f"[Rank {rank}] got data from peer {peer}") ``` ``` [Rank 0] inp buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:0', dtype=torch.int8) [Rank 1] inp buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:1', dtype=torch.int8) ... [Rank 1] out buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:1', dtype=torch.int8) ... [Rank 1] got data from peer 0 ---------------------------------------------------------------------- Ran 2 tests in 17.046s OK ``` ```python @skipIfRocm @requires_triton() def test_triton_get_ring(self) -> None: @triton.jit def get_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr): nvshmem.getmem_block(dst_ptr, src_ptr, numel, peer) # ... setup code ... # Ring topology: each rank gets data from the rank to its left peer = (rank - 1) % world_size # All ranks execute the get operation get_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib) dist.barrier() print(f"[Rank {rank}] inp buffer: {inp}") print(f"[Rank {rank}] out buffer: {out}") print(f"[Rank {rank}] got data from peer {peer}") ``` ``` Output (8 GPUs): [Rank 0] inp buffer: tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0', dtype=torch.int8) [Rank 2] inp buffer: tensor([2, 2, 2, 2, 2, 2, 2, 2], device='cuda:2', dtype=torch.int8) [Rank 5] inp buffer: tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:5', dtype=torch.int8) [Rank 6] inp buffer: tensor([6, 6, 6, 6, 6, 6, 6, 6], device='cuda:6', dtype=torch.int8) [Rank 3] inp buffer: tensor([3, 3, 3, 3, 3, 3, 3, 3], device='cuda:3', dtype=torch.int8) [Rank 1] inp buffer: tensor([1, 1, 1, 1, 1, 1, 1, 1], device='cuda:1', dtype=torch.int8) [Rank 2] out buffer: tensor([1, 1, 1, 1, 1, 1, 1, 1], device='cuda:2', dtype=torch.int8) [Rank 5] out buffer: tensor([4, 4, 4, 4, 4, 4, 4, 4], device='cuda:5', dtype=torch.int8) [Rank 0] out buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:0', dtype=torch.int8) [Rank 2] got data from peer 1 [Rank 5] got data from peer 4 [Rank 0] got data from peer 7 [Rank 7] inp buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:7', dtype=torch.int8) [Rank 6] out buffer: tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:6', dtype=torch.int8) [Rank 3] out buffer: tensor([2, 2, 2, 2, 2, 2, 2, 2], device='cuda:3', dtype=torch.int8) [Rank 6] got data from peer 5 [Rank 3] got data from peer 2 [Rank 1] out buffer: tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:1', dtype=torch.int8) [Rank 1] got data from peer 0 [Rank 4] inp buffer: tensor([4, 4, 4, 4, 4, 4, 4, 4], device='cuda:4', dtype=torch.int8) [Rank 7] out buffer: tensor([6, 6, 6, 6, 6, 6, 6, 6], device='cuda:7', dtype=torch.int8) [Rank 7] got data from peer 6 [Rank 4] out buffer: tensor([3, 3, 3, 3, 3, 3, 3, 3], device='cuda:4', dtype=torch.int8) [Rank 4] got data from peer 3 ---------------------------------------------------------------------- Ran 1 test in 41.045s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155890 Approved by: https://github.com/kwen2501, https://github.com/mandroid6	2025-06-16 23:18:15 +00:00
Nikita Shulga	bb1f3d1a55	[MPSInductor] Improve `_default` dtype inference (#156121 ) By just adding 'mps' as one of the backend options and fixing reduction op to actually return tuple of CSEVariable's rather than tuple of strings Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/156121 Approved by: https://github.com/dcci	2025-06-16 23:11:53 +00:00
Nicolas Macchioni	508cdc4fc9	[BE][4/X] Phase out usage of `use_max_autotune()` (#155850 ) See #155847 for context Pull Request resolved: https://github.com/pytorch/pytorch/pull/155850 Approved by: https://github.com/masnesral	2025-06-16 23:10:26 +00:00
Yiming Zhou	f2d70898c6	[nativert] Move OpKernel to PyTorch core (#156011 ) Summary: Moves OpKernel base class to PyTorch core. It is an abstract interface representing a kernel, which is responsible for executing a single Node in the graph. Torch Native Runtime RFC: pytorch/rfcs#72 Test Plan: buck2 run mode/dev-nosan caffe2/test/cpp/nativert:op_kernel_test Rollback Plan: Differential Revision: D76525939 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156011 Approved by: https://github.com/zhxchen17	2025-06-16 22:53:10 +00:00
PyTorch MergeBot	35ecd7c2d4	Revert "[Cutlass] Fix buffer missing issues (#155897 )" This reverts commit 9bd42c15707a4b410ee005d5916e882a7db432bb. Reverted https://github.com/pytorch/pytorch/pull/155897 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/155897#issuecomment-2978391416))	2025-06-16 22:44:39 +00:00
PyTorch MergeBot	190f76fa31	Revert "Implement guard collectives (#155558 )" This reverts commit 5a5a05a6a3be376130848e235df73b752eef0230. Reverted https://github.com/pytorch/pytorch/pull/155558 on behalf of https://github.com/malfet due to Hmm, may be I'm looking at the wrong metric, but `c92f1075aa/1` shows that test started to pass after PR were reverted ([comment](https://github.com/pytorch/pytorch/pull/155558#issuecomment-2978337152))	2025-06-16 22:26:52 +00:00
Ting Lu	c92f1075aa	Fix if condition for CUDA 12.9 Win build (#156108 ) follow-up for https://github.com/pytorch/pytorch/pull/155799/files Currently the last if condition will be executed for CUDA 12.9, overriding previous CUDA_ARCH_LIST. We should exclude 12.9 from the last if condition to fix this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156108 Approved by: https://github.com/atalman	2025-06-16 21:57:34 +00:00
Scott Wolchok	cce4d213a6	Remove non-header-only dep from c10_headers target (#155858 ) It depends on /c10/util:base which is not header-only. Differential Revision: [D76552750](https://our.internmc.facebook.com/intern/diff/D76552750/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D76552750/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/155858 Approved by: https://github.com/ezyang	2025-06-16 21:41:25 +00:00
bobrenjc93	a24ce67dee	[ez] fix grammar error in comment (#156053 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156053 Approved by: https://github.com/jingsh ghstack dependencies: #155982, #155996	2025-06-16 20:53:07 +00:00
Colin Peppler	247113e03e	Add size_hint_or_throw (#155615 ) ## Summary `TypeError("Cannot convert symbols to int")` is coming up more recently since more unbacked symints are making its way into Inductor. See https://github.com/pytorch/pytorch/issues/155484 - One way to deal with this is to add `size_hint_or_throw` to throw if we try to pull a hint from an unbacked expr. - Then, repurpose `size_hint` to accommodate unbacked symints by setting a default fallback or adding an appropriate fallback for each callsite. This PR adds `size_hint_or_throw` which will throw if unbacked symints exist - use `size_hint_or_throw` -- usually when the callee can try/catch the exception or guards against unbacked symints ------ with Codex https://chatgpt.com/codex/tasks/task_e_684869dfc740832882c88d05534cc8f9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155615 Approved by: https://github.com/ezyang, https://github.com/laithsakka, https://github.com/jingsh Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-06-16 20:46:51 +00:00
Justin Silver	008345be9d	Fix #155018 (convert distributed rst to markdown) (#155528 ) Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/) Fixes #155018 Docs comparison (check out the 'new' whenever docs build) 1. distributed.checkpoint ([old](https://docs.pytorch.org/docs/main/distributed.checkpoint.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.checkpoint.html)) 2. distributed.elastic ([old](https://docs.pytorch.org/docs/main/distributed.elastic.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.elastic.html)) 3. distributed.fsdp.fully_shard ([old](https://docs.pytorch.org/docs/main/distributed.fsdp.fully_shard.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.fsdp.fully_shard.html)) 4. distributed.optim ([old](https://docs.pytorch.org/docs/main/distributed.optim.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.optim.html)) 5. distributed.pipelining ([old](https://docs.pytorch.org/docs/main/distributed.pipelining.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.pipelining.html)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155528 Approved by: https://github.com/wz337, https://github.com/svekars	2025-06-16 20:46:09 +00:00
Xuan Zhang	eb2af14f8e	[PT2][partitioners] Add aten.split to view_ops list [relanding #155424 ] (#155943 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155943 Approved by: https://github.com/ShatianWang	2025-06-16 20:42:54 +00:00
PyTorch MergeBot	03488d820c	Revert "[MPS][Testing][BE] Fix samples for full_like (#156026 )" This reverts commit 2d832c9587fd99db295b62d0c9b459d509c19d06. Reverted https://github.com/pytorch/pytorch/pull/156026 on behalf of https://github.com/atalman due to Sorry breaks MPS tests: test_ops.py::TestMathBitsCPU::test_neg_view_full_like_cpu_float64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15683608879/job/44182730620) [HUD commit link](`2d832c9587`) ([comment](https://github.com/pytorch/pytorch/pull/156026#issuecomment-2977903074))	2025-06-16 19:50:26 +00:00
Pian Pawakapan	6d2155db49	[PGO] no code state update on dynamic=False (#155961 ) Summary: When tensor size changes are detected on `dynamic=False`, overwrites the PGO state with the newest static shapes to reflect the latest frame state, instead of updating automatic dynamic. A longer term solution, if we move to shared PGO state between multiple jobs, would be to update automatic dynamic, but avoid suggesting/logging the whitelist (compiling with `dynamic=False` should already override any dynamic PGO that's read, so we're fine there). This way if any particular job runs with `dynamic=False`, it won't statically overwrite the entire PGO state if it's shared with many other jobs. Test Plan: test/dynamo/test_pgo.py Rollback Plan: Differential Revisi,on: D76630499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155961 Approved by: https://github.com/bobrenjc93	2025-06-16 19:47:55 +00:00
Edward Z. Yang	5a5a05a6a3	Implement guard collectives (#155558 ) When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is: 1. Perform compiled code lookup (evaluating guards) 2. Run a collective, communicating if you found a compiled code or not 3. If anyone requires recompile, force everyone to recompile One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective. I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558 Approved by: https://github.com/Microve	2025-06-16 19:46:16 +00:00
PyTorch MergeBot	61b271e0f3	Revert "Implement guard collectives (#155558 )" This reverts commit 38e5e81e55fc5d85d6cf8a83c96c88578995e3fe. Reverted https://github.com/pytorch/pytorch/pull/155558 on behalf of https://github.com/atalman due to Breaks CI, sorry: [GH job link](https://github.com/pytorch/pytorch/actions/runs/15683161593/job/44181274826) [HUD commit link](`38e5e81e55`) ([comment](https://github.com/pytorch/pytorch/pull/155558#issuecomment-2977871178))	2025-06-16 19:40:46 +00:00
Georgia Phillips	7cf38d2a05	Make benchmark by op for TS model work with sample inputs (#155988 ) Summary: Add pickle input type to allow for running ptvsc2_predictor_bench to get individual node benchmarks for SR Test Plan: ``` buck2 run mode/opt caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- --scripted_model=/data/users/georgiaphillips/models/742055223/1/742055223_1.predictor.local --pt_inputs=/data/users/georgiaphillips/models/742055223/0/mix.pt --pt_enable_static_runtime=1 --compare_results=0 --iters=1000 --warmup_iters=100 --num_threads=1 --do_profile=1 --method_name=${MODULE_NAME}.forward --set_compatibility --do_benchmark=1 --pytorch_predictor_default_model_id=${MODEL_ENTITY_ID}_${SNAPSHOT_ID} --input_type=pickle ``` Rollback Plan: Reviewed By: dolpm Differential Revision: D76554920 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155988 Approved by: https://github.com/dolpm	2025-06-16 19:15:07 +00:00
Julian De la Barrera Brandner	2dc1627451	[doc] Add documentation for division by zero behavior in autograd (#155987 ) Fixes #128796 This PR adds documentation about the behavior of division by zero operations in PyTorch's autograd system. The documentation explains: 1. How division by zero produces `inf` values following IEEE-754 floating point arithmetic 2. How autograd handles these cases and why masking after division can lead to `nan` gradients 3. Provides concrete examples showing the issue 4. Recommends two solutions: - Masking before division - Using MaskedTensor (experimental API) The documentation is added to the autograd notes section, making it easily discoverable for users who encounter this common issue. This addresses the original issue #128796 which requested better documentation of this behavior to help users avoid common pitfalls when dealing with division by zero in their models. dditional changes: - Fixed formatting consistency by replacing curly apostrophes with straight apostrophes in the existing documentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/155987 Approved by: https://github.com/soulitzer Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>	2025-06-16 19:02:12 +00:00
Simon Fan	907d0931cc	[ca] default on in CI, with fallback for tests in test/compiled_autograd_skips/ (#155480 ) For every test that is ran with PYTORCH_TEST_WITH_DYNAMO=1, turn on compiled autograd via config if it is not skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/155480 Approved by: https://github.com/jansel ghstack dependencies: #155521	2025-06-16 18:45:03 +00:00
Simon Fan	9ff9c28fe8	[ca] Functionalize AccumulateGrad (#155521 ) This PR changes compiled autograd's handling of gradient accumulation, by proxying it as a `call_accumulate_grad`, which does the .grad mutation in python bytecode for dynamo to see. For eager, the only change is the leaf invariant check was moved up. Before: - Compiled Autograd Engine: proxies call to inductor accumulate_grad op - Dynamo: polyfills the inductor accumulate_grad op (not respecting all of the accumulateGrad implementation e.g. sparse, gradient layout contract) ```python new_grad_strided: "f32[s21]" = torch.empty_like(getitem_1); getitem_1 = None copy_: "f32[s21]" = new_grad_strided.copy_(aot3_tangents_1); copy_ = None ``` - AOTAutograd: functionalizes the copy_ After: - Compiled Autograd Engine: proxies call to `call_accumulate_grad`, which calls `torch._dynamo.compiled_autograd.ops.AccumulateGrad`/`AccumulateGrad_apply_functional_no_hooks_ivalue`, similar to other functional autograd implementations, but also sets .grad from python. Hooks are still handled separately from this call. - Dynamo: `torch._dynamo.compiled_autograd.ops.AccumulateGrad` was allow_in_graph'd - AOTAutograd: traces into the op, with FunctionalTensors. While functionalizing the tensors, we insert an autograd Error node to ensure that we don't use the autograd meta from tracing. This clashes with the "leaf variable has been moved into the graph interior" error check, I could not find a way to identify a FunctionalTensor subclass from C++, so I bypass that for Error nodes in the compiled case. In the CI PR, this fixes 19 tests relating to sparse tensors, and more are hidden by an earlier failure in dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/155521 Approved by: https://github.com/jansel	2025-06-16 18:45:02 +00:00
Benjamin Glass	42ff6a4a5c	[Inductor] Delay codegen for fallback arguments and improve typing (#154371 ) Delays code generation for arguments to fallback ops. This is inspired by #155642, and likely fixes similar memory leaks. Additionally, prepare for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled: 1. removing a number of now clearly unnecessary asserts 2. adding a few more targeted asserts to validate the code's current assumptions 3. removing some unneeded control flow in several functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371 Approved by: https://github.com/desertfire	2025-06-16 18:00:04 +00:00
Xuehai Pan	4162c0f702	[BE][setup] gracefully handle envvars representing a boolean in `setup.py` (#156040 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156040 Approved by: https://github.com/malfet	2025-06-16 17:56:31 +00:00
angelayi	f48a157660	[aoti] Add more to error message (#155974 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/155974 Approved by: https://github.com/yushangdi	2025-06-16 17:49:52 +00:00
windsonsea	fbd88ae2b5	Convert to markdown: checkpoint.rst (#156009 ) Related to #155014 Use two commits to have a try. ```bash 1800 git mv docs/source/checkpoint.rst docs/source/checkpoint.md 1802 git commit -m "[Docs] Rename checkpoint.rst" 1803 git push origin ckpoint # update the markdown file 1805 git add . 1806 git commit -m "modify checkpoint.md" 1807 git push origin ckpoint ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156009 Approved by: https://github.com/svekars	2025-06-16 17:48:23 +00:00
windsonsea	a10024d7de	Convert complex_numbers.rst to markdown (#156039 ) Related to #155014 Have a try by following https://github.com/pytorch/pytorch/pull/155899#issuecomment-2974715750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156039 Approved by: https://github.com/svekars	2025-06-16 17:24:37 +00:00
PyTorch MergeBot	e9fdaf8701	Revert "[Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109 )" This reverts commit e375d21bb9b0ef6fefe7a8af5a054a17de8c63c9. Reverted https://github.com/pytorch/pytorch/pull/155109 on behalf of https://github.com/malfet due to Looks like it broke ROCM tests ([comment](https://github.com/pytorch/pytorch/pull/155109#issuecomment-2977428354))	2025-06-16 17:22:55 +00:00
Justin Chu	45596ec58f	Delete tools/onnx/update_default_opset_version.py (#156055 ) The tool is no longer relevant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156055 Approved by: https://github.com/titaiwangms	2025-06-16 17:21:36 +00:00
PyTorch MergeBot	365ce465f3	Revert "[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900 )" This reverts commit 8142a0286016e63a0e91b5667e1fb1a5e868ffd7. Reverted https://github.com/pytorch/pytorch/pull/155900 on behalf of https://github.com/clee2000 due to causing some sort of hang? in test_distributed_spawn [GH job link](https://github.com/pytorch/pytorch/actions/runs/15678895788/job/44168117193) [HUD commit link](`8142a02860`) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/155900#issuecomment-2977365699))	2025-06-16 16:59:25 +00:00
albanD	2a4e357192	Fix compilation warning with gcc14 (#155934 ) Note that nccl still doesn't work so you have to build with `USE_NCCL=0` @eqy is that something being tracked there? Pull Request resolved: https://github.com/pytorch/pytorch/pull/155934 Approved by: https://github.com/malfet, https://github.com/janeyx99	2025-06-16 16:43:15 +00:00
PyTorch MergeBot	503362d019	Revert "Unify dynamic shapes APIs naming 2 (expect_true and check) (#155776 )" This reverts commit 603a54a9b33e1aabe1407721d7935b881a160968. Reverted https://github.com/pytorch/pytorch/pull/155776 on behalf of https://github.com/atalman due to failing internal build ([comment](https://github.com/pytorch/pytorch/pull/155776#issuecomment-2977041192))	2025-06-16 15:13:53 +00:00
PyTorch MergeBot	b8d96c3f78	Revert "[cuBLASLt][cuBLAS] Support 2D bias and `beta != 1.0` in cuBLASLt (#154170 )" This reverts commit 47c8810b5275179833d6b33ca3d70922f485272c. Reverted https://github.com/pytorch/pytorch/pull/154170 on behalf of https://github.com/atalman due to failing torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/154170#issuecomment-2976990461))	2025-06-16 14:59:01 +00:00
Xuehai Pan	013dfeabb4	[BE] fix typos in top-level files (#156067 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156067 Approved by: https://github.com/malfet ghstack dependencies: #156066	2025-06-16 14:56:07 +00:00
Xuehai Pan	6c493e2b14	[BE] add `codespell` linter (#156066 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156066 Approved by: https://github.com/malfet	2025-06-16 14:56:07 +00:00
Nikita Shulga	2d832c9587	[MPS][Testing][BE] Fix samples for full_like (#156026 ) Now that device is known, one can avoid creating tensors of `torch.double` type Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026 Approved by: https://github.com/dcci	2025-06-16 14:27:42 +00:00
Nikita Shulga	831c9010c7	[BE] Remove non-existing operator from unimplemented list (#156025 ) Never heard of torch.login :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156025 Approved by: https://github.com/dcci	2025-06-16 14:14:58 +00:00
Edward Z. Yang	38e5e81e55	Implement guard collectives (#155558 ) When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is: 1. Perform compiled code lookup (evaluating guards) 2. Run a collective, communicating if you found a compiled code or not 3. If anyone requires recompile, force everyone to recompile One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective. I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558 Approved by: https://github.com/Microve	2025-06-16 14:09:14 +00:00
dependabot[bot]	05faba4028	Bump requests from 2.32.2 to 2.32.4 in /.github (#155491 ) Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4. - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](https://github.com/psf/requests/compare/v2.32.2...v2.32.4) --- updated-dependencies: - dependency-name: requests dependency-version: 2.32.4 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-16 06:48:08 -07:00
PyTorch UpdateBot	d6ee5144ca	[xla hash update] update the pinned xla hash (#156064 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156064 Approved by: https://github.com/pytorchbot	2025-06-16 11:11:10 +00:00
Aidyn-A	8142a02860	[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900 ) Fixes #155668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900 Approved by: https://github.com/ngimel	2025-06-16 10:55:47 +00:00
Anthony Barbier	bf7e290854	Add __main__ guards to jit tests (#154725 ) This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs. In jit tests: - Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run. - Raise a RuntimeError on tests which have been disabled (not run) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154725 Approved by: https://github.com/clee2000	2025-06-16 10:28:45 +00:00
Justin Chu	f810e98143	[ONNX] Update default opset to 18 (#156023 ) Update default opset for the torchscript exporter to 18 to match the dynamo exporter, because support was actaully added and tested in https://github.com/pytorch/pytorch/pull/118828. In the next version we should plan to update to opset 21 or higher. This change also removes the hard limit on the torchscript exporter for more flexibility. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156023 Approved by: https://github.com/Skylion007	2025-06-16 08:40:49 +00:00
Bob Ren	39c605e8b3	remove allow-untyped-defs from context.py (#155622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155622 Approved by: https://github.com/Skylion007	2025-06-16 07:38:34 +00:00
Xia, Weiwen	d9799a2ee7	Support boolean tensor for torch.fused_moving_avg_obs_fake_quant on CUDA (#153699 ) Fixes #153310 As the title Test plan ``` pytest test/quantization/core/test_workflow_ops.py -k test_fused_obs_fake_quant_moving_avg ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153699 Approved by: https://github.com/mingfeima, https://github.com/jerryzh168	2025-06-16 07:10:06 +00:00
PyTorch UpdateBot	156b28e62a	[audio hash update] update the pinned audio hash (#155648 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155648 Approved by: https://github.com/pytorchbot	2025-06-16 03:57:28 +00:00
Dhia naouali	c620d0b5c7	convert: rst to myst pr2/2 (#155911 ) Fixes #155038 parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check) this PR converts the following three .rst files with the mentioned referenced in each file - [torch.compiler_faq](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_faq.rst) - torch.compiler_troubleshooting - nonsupported_numpy_feats - torchdynamo_fine_grain_tracing - [torch.compiler_fine_grain_apis](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fine_grain_apis.rst) - None - [torch.compiler_get_started](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_get_started.rst) - torch.compiler_overview - torch.compiler_api - torchdynamo_fine_grain_tracing I made the suggested edits by the maintainers as commented in the parent PR (used git mv on all files, yet it still appeared as delete-create action) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155911 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-16 00:44:44 +00:00
David Berard	c83041cac2	[test][triton pin] add device-side TMA tests (AOTI + test_triton_kernels) (#155827 ) Tests added: ``` python test/inductor/test_triton_kernels.py -k test_on_device_tma python test/inductor/test_triton_kernels.py -k test_add_kernel_on_device_tma python test/inductor/test_aot_inductor.py -k test_triton_kernel_on_device_tma ``` These pass on Triton 3.3 but not yet on Triton 3.4 (note: to support tests for both Triton versions, there's two triton kernels - one for old api and one for new api - and a given version of the test will only run if that version of the API is available). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155827 Approved by: https://github.com/FindHao ghstack dependencies: #155777, #155814	2025-06-15 20:24:19 +00:00
David Berard	bc9b8ea230	[user triton] JIT inductor support for new host-side TMA api (#155814 ) This PR adds JIT inductor support for user-defined triton kernels using the new host-side TMA api. * handle TensorDescriptor.from_tensor in ir.py * codegen TensorDescriptor.from_tensor in wrapper.py * generate the right signature for functions that take TensorDescriptor arguments (i.e. in the @triton_heuristics.user_autotune decorator) AOTI support is not implemented yet. Tests: ran test_triton_kernels.py w/ both Triton 3.3 and 3.4 and there were no failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155814 Approved by: https://github.com/aakhundov ghstack dependencies: #155777	2025-06-15 20:24:19 +00:00
David Berard	b7c95acc6c	[user triton] triton_kernel_wrap support for new host-side TMA API (#155777 ) This adds support for user-defined triton kernels using TensorDescriptor.from_tensor into triton_kernel_wrap: i.e. storing metadata about the TMA descriptors and doing mutation analysis. Major changes: * TMADescriptorMetadata has changed: previously it was a dict[str, tuple[list[int], list[int], int]]. But now there are two metadata formats: one for experimental API and one for stable API. Now the metadata format is dict[str, tuple[str, tuple[...]]], where tuple[...] is tuple[list[int], list[int], int] for experimental and tuple[list[int],] for stable API. And then most handling of the metadata has to be branched based on whether the metadata represents a stable or experimental TMA descriptor * mutation analysis: unlike experimental TMA (where the mutation analysis / ttir analysis pretends that the TMA descriptor is actually just a tensor), we need to construct an actual TMA descriptor before getting the Triton frontend to create the TTIR (otherwise assertions fail). A TensorDescriptor (i.e. stable TMA API descriptor) passed into a python triton kernel actually turns into 1 + 2N parameters in the TTIR (for a rank-N tensor), so the arg list also needs to be patched for this reason (in generate_ttir) mutation analysis: now we also need to pass tma_descriptor_metadata into the mutation analysis, in order to create the TMA descriptors that are passed into the frontend code (ie. the previous point). This is why all the mutation tests are modified with an extra return value (the tma_descriptor_metadata) Inductor is not modified (Inductor just errors out if you use a stable API tma descriptor). This will be the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155777 Approved by: https://github.com/aakhundov	2025-06-15 20:24:19 +00:00
Animesh Jain	54976bca10	[dynamo] Provide helper functions for guard filter hook (#155083 ) Collection of ready-made guard filters. One issue is that they are not composable - `filter1(filter2(guard))`. On the other hand, they are easy to use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155083 Approved by: https://github.com/zhxchen17, https://github.com/jansel	2025-06-15 17:49:36 +00:00
Yu, Guangye	0935a97d95	[Dynamo] Add torch.accelerator API to trace_rules (#155884 ) # Motivation - Add binding API and non-binding API in torch.accelerator to trace rules. - Add some function in torch.accelerator to const fold functon list for Dynamo capature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155884 Approved by: https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #155787, #155788	2025-06-15 17:09:57 +00:00
Yu, Guangye	b51d803785	[Dynamo] Add XPU API to trace_rules (#155788 ) # Motivation - Add binding API and non-bindling API to trace rules for XPU; - Add some XPU API to the const fold function for Dynamo capture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155788 Approved by: https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #155787	2025-06-15 17:09:57 +00:00
Yu, Guangye	69acba2b19	[Dynamo] Add generic and XPU-specific Stream&Event in UserDefineClass (#155787 ) # Motivation - Add XPU-specific Stream and Event to in graph calss list for Dynamo capture. - Add generic Stream and Event to i graph class list for Dynamo capture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155787 Approved by: https://github.com/jansel, https://github.com/EikanWang	2025-06-15 17:09:57 +00:00
nirajkamalk	53cd18f6b3	Update gradient behavior note in torch.amin and torch.amax (#155071 ) Fixes #155048 The behavior of `min` and `max` were changed in #43519. The note about gradient behavior in torch.amin and torch.amax docs are updated to reflect this change: New note: `amax, amin, max(dim), min(dim) evenly distributes gradient between equal values when there are multiple input elements with the same minimum or maximum value.` cc - @spzala @svekars @soulitzer @sekyondaMeta @AlannaBurke @ezyang @gqchen @nikitaved @Varal7 @xmfan Pull Request resolved: https://github.com/pytorch/pytorch/pull/155071 Approved by: https://github.com/soulitzer	2025-06-15 16:09:31 +00:00
PyTorch UpdateBot	655b3b14ff	[executorch hash update] update the pinned executorch hash (#156007 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156007 Approved by: https://github.com/pytorchbot	2025-06-15 04:51:37 +00:00
Austin Wahle	517d2995e0	Add__int__ and __float__ methods to _sympy.functions.Identity (#155873 ) Fixes #155688 Root Cause: in [`torch/_inductor/index_propagation.py`](`f151b20123/torch/_inductor/index_propagation.py (L57-L68)`) When creating a `TypedExpr` from an `Identity` (a `torch.utils._sympy.functions.Identity`, not a `sympy.matrices.expressions.Identity `) and the inner value of the identity, `Identity.args[0]`, is any torch int type, the `TypedExpr.__post_init__` method tries to cast the Identity object to a python `int`. This is where to `TypeError` from the issue was raised, because Identity does not know how to cast to an `int`. Fix: Define `__int__` method for `torch.utils._sympy.functions.Identity`. wlog for `float` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155873 Approved by: https://github.com/williamwen42	2025-06-15 04:24:40 +00:00
Aaron Gokaslan	6ebe9a4f47	[BE][Ez]: Optimize nvshmem alloc with missing move (#156000 ) Saw this in another PR where there was a missing move on this potentially very hot path with Pull Request resolved: https://github.com/pytorch/pytorch/pull/156000 Approved by: https://github.com/kwen2501, https://github.com/cyyever	2025-06-15 03:04:08 +00:00
Ke Wen	32eee8ed22	[SymmMem] Add nvshmem_free (#155975 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Calling `nvshmem_free` when an `NVSHMEMAllocation` is being destructed. Use a `is_finalizing()` as a guard as done in `CUDASymmetricMemory.cu` to avoid "driver shutting down" error (destruction fiasco). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155975 Approved by: https://github.com/ngimel ghstack dependencies: #155506, #155835, #155968, #155971	2025-06-15 01:23:49 +00:00
fduwjj	b8aee84fb9	[c10d][fr] Shrink the range of mutex lock to avoid deadlock (#155949 ) While looking into a case when FR dump (actual dump not monitoring thread) takes 30 mins, I realized that our global write lock is grabbed too early so the second effort to dump FR without stack trace will fail because of a deadlock because the global write lock is still hold. So we should only grab the lock when we are ready to write so that we are less likely to keep the lock forever. Also I did an audit to the lock within FR as well and found that there is one place we can shrink as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155949 Approved by: https://github.com/Skylion007	2025-06-15 00:37:42 +00:00
Howard Huang	3159ee2ad3	Update test_schedule_multiproc to use world_size=2 (#155921 ) The multiproc schedule tests previously ran with world_size=2, and PP tests became flakier due to the longer pipeline execution, this is restoring previously behavior. This will fix the tests (https://github.com/pytorch/pytorch/issues/154373, https://github.com/pytorch/pytorch/issues/154391, https://github.com/pytorch/pytorch/issues/154408, https://github.com/pytorch/pytorch/issues/154443, https://github.com/pytorch/pytorch/issues/154481 In follow up PRs I will refactor the tests and move some tests to use large world sizes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155921 Approved by: https://github.com/fduwjj, https://github.com/Skylion007 ghstack dependencies: #155920	2025-06-15 00:24:18 +00:00
Howard Huang	8e1471bdc9	Allow MultiProcContinuousTest to set world_size (#155920 ) `MultiProcContinuousTest` will automatically set world_size to number of devices. This change allows this attribute to be modified by the derived test class Pull Request resolved: https://github.com/pytorch/pytorch/pull/155920 Approved by: https://github.com/fduwjj	2025-06-15 00:24:17 +00:00
Michael Lazos	9bd42c1570	[Cutlass] Fix buffer missing issues (#155897 ) Handles constants and constant folding with aoti. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155897 Approved by: https://github.com/henrylhtsang	2025-06-15 00:08:50 +00:00
Henry Tsang	a35b3a9b95	[cutlass backend][forward fix] use _cuda_compiler path to check if nvcc exists (#155939 ) Differential Revision: D76571828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155939 Approved by: https://github.com/Skylion007, https://github.com/masnesral	2025-06-15 00:01:57 +00:00
eqy	47c8810b52	[cuBLASLt][cuBLAS] Support 2D bias and `beta != 1.0` in cuBLASLt (#154170 ) Fixes https://github.com/pytorch/pytorch/issues/153590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154170 Approved by: https://github.com/malfet	2025-06-14 23:34:31 +00:00
bobrenjc93	0fa361e429	[ez] fix typo in _inductor/scheduler.py (#155996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155996 Approved by: https://github.com/Skylion007 ghstack dependencies: #155982	2025-06-14 21:21:35 +00:00
Ke Wen	77ac3a0965	[SymmMem] Remove wrappers around nvshmem APIs (#155971 ) `NVSHMEMSymmetricMemory.cu` and `nvshmem_extension.cu` are under the same compilation condition now (i.e. only when `USE_NVSHMEM=True`), see https://github.com/pytorch/pytorch/blob/main/caffe2/CMakeLists.txt#L1013-L1018. Therefore there is no need to build an extra layer to hide dependency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155971 Approved by: https://github.com/Skylion007 ghstack dependencies: #155506, #155835, #155968	2025-06-14 19:58:09 +00:00
Ke Wen	2c0d94a7de	[SymmMem] Remove unused ptr_to_symm_mem_ (#155968 ) No code enqueues entries to `ptr_to_symm_mem_`, thus it is always empty. This PR removes it and supports relying functionalities via the `allocations_` map. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155968 Approved by: https://github.com/Skylion007 ghstack dependencies: #155506, #155835	2025-06-14 19:57:06 +00:00
Aaron Gokaslan	a317c63d1b	[BE]: Update NCCL to 2.27.3 (#155233 ) Fixes: https://github.com/pytorch/pytorch/issues/155052 and https://github.com/pytorch/pytorch/issues/153517 This upgrade is needed to effectively use those symmetric memory kernels anyway. Also fixes some nasty NCCL bugs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155233 Approved by: https://github.com/nWEIdia, https://github.com/kwen2501, https://github.com/atalman, https://github.com/eqy	2025-06-14 19:20:31 +00:00
Jithun Nair	794ef6c9b8	Enable manywheel build and smoke test on main branch for ROCm (#153287 ) Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287 Approved by: https://github.com/jeffdaily	2025-06-14 19:14:31 +00:00
albanD	5285d10243	remove duplicated pybind flag in mps code (#155936 ) gcc14 (at least) warns that this is already defined Pull Request resolved: https://github.com/pytorch/pytorch/pull/155936 Approved by: https://github.com/Skylion007	2025-06-14 18:41:12 +00:00
Aaron Orenstein	e95e8eed0a	mypy 1.16.0 (#155821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155821 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-06-14 18:18:43 +00:00
Marcin Pioch	ce79056471	Custom FX pass for inductor's backend registration (#154841 ) This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-06-14 17:29:54 +00:00
Laith Sakka	603a54a9b3	Unify dynamic shapes APIs naming 2 (expect_true and check) (#155776 ) The functions guard_lt, guard_equals, and guard_leq work similarly to torch.check and expect_true, but they operate on SymPy expressions. Notably, guard_equals applies local replacements before comparison, which might be better extracted into a separate function. This pull request standardizes naming conventions to match symbolic_shapes.py. Specifically, - it introduces size_vars.expect_true and size_vars.check. - guard_lt becomes check_lt - guard_leq becomes check_leq - guard_equals becomes check_equals I am also seeing a couple of wrong usages !! that i will fix in the next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/155776 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #154774	2025-06-14 17:13:53 +00:00
Laith Sakka	c219dbd2fc	avoid gso in has_internal_overlap (#155870 ) existing comment already explains it Pull Request resolved: https://github.com/pytorch/pytorch/pull/155870 Approved by: https://github.com/bobrenjc93	2025-06-14 17:13:20 +00:00
Xuehai Pan	279cae52e7	[BE][PYFMT] migrate PYFMT for `torch/ao/` to `ruff format` (#148185 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148185 Approved by: https://github.com/ezyang	2025-06-14 16:47:04 +00:00
cyy	c2beeadeb4	[Reland] Use 3.27 as the minimum CMake version (#154783 ) Reland of #153153, which was incidentally closed. Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as CUDA::nvperf_host so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783. It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21). Pull Request resolved: https://github.com/pytorch/pytorch/pull/154783 Approved by: https://github.com/ezyang	2025-06-14 16:37:51 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	370fc49dde	Handle aten.to at submodule boundaries (#153972 ) Summary: #buildall Test Plan: CI Differential Revision: D74582970 When we decompose to inference IR, aten.to can sometimes disappear. As a result, export module call graph tree will start containing dead nodes because previous provenance tracking is insufficient. This PR fixes that. The caveat is that this won't work in general for tensor subclass inputs to submodule that user wants to preserve signature because we always desugar the tensor subclass into constituent tensors in inference IR making it impossible to preserve the original calling convention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153972 Approved by: https://github.com/avikchaudhuri	2025-06-14 16:13:29 +00:00
PyTorch UpdateBot	d42c11819f	[executorch hash update] update the pinned executorch hash (#153436 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153436 Approved by: https://github.com/pytorchbot	2025-06-14 16:09:41 +00:00
bobrenjc93	70b68caf58	Fix logging of failed tensorified ops (#155982 ) Tested via ``` TORCH_LOGS="torch.fx.passes._tensorify_python_scalars" tlp python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unspecialized_float_fallback_symint_specialization I0613 21:50:38.247000 4163366 torch/fx/passes/_tensorify_python_scalars.py:314] [0/1] Failed to tensorify <built-in function pow> I0613 21:50:38.247000 4163366 torch/fx/passes/_tensorify_python_scalars.py:314] [0/1] Failed to tensorify <built-in function floor> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155982 Approved by: https://github.com/flaviotruzzi	2025-06-14 14:23:54 +00:00
Xia, Weiwen	e375d21bb9	[Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109 ) Fixes #154328 Summary Fail reason: The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected. Fix: Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t. Test plan ``` pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2025-06-14 14:12:38 +00:00
Xuehai Pan	1a568f4e5d	[BE][Easy] bump `isort` to 6.0.1 (#155919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155919 Approved by: https://github.com/Skylion007 ghstack dependencies: #155909, #155914	2025-06-14 12:29:01 +00:00
Xuehai Pan	5467765990	[BE][Easy] bump `ruff` to 0.11.13 (#155914 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155914 Approved by: https://github.com/Skylion007 ghstack dependencies: #155909	2025-06-14 12:29:01 +00:00
Xuehai Pan	736a15a81a	[torchgen] Fix `ruff format` for `# fmt: skip` comment for function signature (#155909 ) See also: - astral-sh/ruff#18658 This fix follows the suggestion from: - https://github.com/astral-sh/ruff/issues/18658#issuecomment-2970130276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155909 Approved by: https://github.com/ezyang	2025-06-14 12:28:55 +00:00
Aaron Gokaslan	d859e65826	[DCP][Ez]: Fix broadcast_object bug in DCP utils (#155912 ) Fixes #152310. Broadcast_object is now symmetric with gather_object and scatter_object. It was likely a typo that wasn't fixed in https://github.com/pytorch/pytorch/pull/147675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155912 Approved by: https://github.com/ezyang	2025-06-14 12:14:14 +00:00
Xuehai Pan	596b418391	[BE][PYFMT] migrate PYFMT for `{torch,test}/{nn,optim}/**` to `ruff format` (#144548 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144548 Approved by: https://github.com/ezyang	2025-06-14 11:27:04 +00:00
penknife6153	3e38feb05f	[inductor] Add configuration control for CUTLASS operation selection. (#155770 ) Added a new configuration option `cutlass_enabled_ops` that allows users to control which operations use CUTLASS lowerings. By default, CUTLASS is enabled for all operations (maintaining backward compatibility), but users can now selectively enable it only for specific operations to optimize compilation time. Fixes #155718 ## Usage Examples ```bash # Enable CUTLASS for all operations (default behavior) export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="ALL" # Enable CUTLASS only for matrix multiplication operations export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="mm,addmm" # Enable CUTLASS only for batch operations export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="bmm,baddbmm" # Disable CUTLASS for all operations export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155770 Approved by: https://github.com/henrylhtsang	2025-06-14 08:19:54 +00:00
FFFrog	1982ec2d22	Add api info for torch._C._nn.pyi (#148405 ) APis involved are as followed: - adaptive_avg_pool2d - adaptive_avg_pool3d - binary_cross_entropy - col2im ISSUE Related: https://github.com/pytorch/pytorch/issues/148404 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148405 Approved by: https://github.com/ezyang	2025-06-14 07:57:07 +00:00
Laith Sakka	7070ab3180	use guard_or_false in checkInBoundsForStorage (#155874 ) this was added in https://github.com/pytorch/pytorch/pull/147354, the comment already justify guard_or_false Pull Request resolved: https://github.com/pytorch/pytorch/pull/155874 Approved by: https://github.com/bobrenjc93	2025-06-14 07:21:26 +00:00
Laith Sakka	d79651571f	assume sparse tensor not coalesced_ gsv -> guard_or_false. (#155869 ) preserve current behavior. Generalize it such that no need for torch._check_is_size to opt into this, and make it work for more complex unbacked sizes with ranges [-inf, inf] Pull Request resolved: https://github.com/pytorch/pytorch/pull/155869 Approved by: https://github.com/bobrenjc93	2025-06-14 07:19:56 +00:00
Xuehai Pan	e7da21806f	[Easy][BE] update recommanded VS Code settings (#152760 ) Changes: - Remove old invalid settings and replace with new settings. - Add commonly used VS Code extensions to support `cmake`, `ruff`, `mypy`, `flake8`, `editorconfig`, and spell checker. Also, add corresponding settings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152760 Approved by: https://github.com/drisspg	2025-06-14 07:11:10 +00:00
cyy	1393f71e07	Use CUDA language in generated CMakeLists.txt from cpp_builder.py (#155979 ) The CMake CUDA module has been deprecated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155979 Approved by: https://github.com/ezyang	2025-06-14 06:52:51 +00:00
David Berard	c843909d9e	[flex attention][triton pin] use new TMA API (#155771 ) Triton 3.4 will remove the experimental TMA APIs: https://github.com/triton-lang/triton/pull/6488. Ahead of this, we are replacing the experimental TMA API usage with the stable TMA API in flex attention. This means that flex attention TMA will stop working with Triton 3.2 or Triton 3.3/3.3.1 for now (but it should work for Triton 3.4 in the PyTorch 2.8 release, and Meta-internal triton 3.3.1fb, which have the new TMA API). This PR does the following: * replace the experimental TMA APIs with the stable TMA APIs * remove the workspace args. Testing: I ran test/inductor/test_flex_attention.py on a H100 with @mandroid6's PR #153662 patched in to turn on TMA [TODO: confirm results once all the local tests pass, but from the first 100 tests I ran locally, all the failing tests were also failing on #153662 alone] Note: When #153662 lands, turning on TMA support by default, it should be checking specifically for stable TMA API support (commented on PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155771 Approved by: https://github.com/mandroid6, https://github.com/nmacchioni	2025-06-14 06:34:16 +00:00
Oguz Ulgen	92b7ed6d07	Add Helion softmax test (#155976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155976 Approved by: https://github.com/jansel	2025-06-14 05:53:21 +00:00
Phillip Liu	9338d85d45	[ProcessGroupNCCL] Added log when fr dump triggered from pipe (#155754 ) Summary: TSIA Created from CodeHub with https://fburl.com/edit-in-codehub Test Plan: eyes Sandcastle run Differential Revision: D76472617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155754 Approved by: https://github.com/fduwjj, https://github.com/Skylion007	2025-06-14 04:34:29 +00:00
Ti-Tai Wang	bf897b4cea	[ONNX] Support 0/1 on dynamic dimension (#155717 ) Previous to this PR, the exporter does not support dynamic dim with traced inputs containing 0/1. But after https://github.com/pytorch/pytorch/pull/148696, this is supported by torch.export.export. This PR adds the patch to torch.onnx.export. However, there is still known pitfall existing because the difference between eager and export. Compiler needs to decide the exported shape ahead, and whether the "hidden broadcasting" being applied results in different export. For example, ```python import torch class Model(torch.nn.Module): def forward(self, x, y, z): return torch.cat((x, y), axis=1) + z model = Model() x = torch.randn(2, 3) y = torch.randn(2, 5) z = torch.randn(1, 8) model(x, y, z) DYN = torch.export.Dim.DYNAMIC ds = {0: DYN, 1: DYN} with torch.fx.experimental._config.patch(backed_size_oblivious=True): ep = torch.export.export(model, (x, y, z), dynamic_shapes=(ds, ds, ds)) print(ep) """ ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, x: "f32[s7, s16]", y: "f32[s7, s43]", z: "f32[s7, s16 + s43]"): # sym_size_int: "Sym(s7)" = torch.ops.aten.sym_size.int(x, 0) sym_size_int_1: "Sym(s16)" = torch.ops.aten.sym_size.int(x, 1) sym_size_int_2: "Sym(s7)" = torch.ops.aten.sym_size.int(y, 0) sym_size_int_3: "Sym(s43)" = torch.ops.aten.sym_size.int(y, 1) sym_size_int_4: "Sym(s7)" = torch.ops.aten.sym_size.int(z, 0) sym_size_int_5: "Sym(s16 + s43)" = torch.ops.aten.sym_size.int(z, 1) # File: /home/titaiwang/pytorch/test_export.py:7 in forward, code: return torch.cat((x, y), axis=1) + z cat: "f32[s7, s16 + s43]" = torch.ops.aten.cat.default([x, y], 1); x = y = None # eq: "Sym(True)" = sym_size_int_2 == sym_size_int; sym_size_int_2 = None _assert_scalar_default = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s58, s35) on node 'eq'"); eq = _assert_scalar_default = None add_1: "Sym(s16 + s43)" = sym_size_int_1 + sym_size_int_3; sym_size_int_1 = sym_size_int_3 = None eq_1: "Sym(True)" = add_1 == sym_size_int_5; add_1 = sym_size_int_5 = None _assert_scalar_default_1 = torch.ops.aten._assert_scalar.default(eq_1, "Runtime assertion failed for expression Eq(s16 + s43, s23) on node 'eq_1'"); eq_1 = _assert_scalar_default_1 = None eq_2: "Sym(True)" = sym_size_int == sym_size_int_4; sym_size_int = sym_size_int_4 = None _assert_scalar_default_2 = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(s35, s7) on node 'eq_2'"); eq_2 = _assert_scalar_default_2 = None # File: /home/titaiwang/pytorch/test_export.py:7 in forward, code: return torch.cat((x, y), axis=1) + z add: "f32[s7, s16 + s43]" = torch.ops.aten.add.Tensor(cat, z); cat = z = None return (add,) Graph signature: # inputs x: USER_INPUT y: USER_INPUT z: USER_INPUT # outputs add: USER_OUTPUT Range constraints: {s7: VR[0, int_oo], s16: VR[0, int_oo], s43: VR[0, int_oo], s16 + s43: VR[0, int_oo]} """ ep.module()(x, y, z) """ Traceback (most recent call last): File "/home/titaiwang/pytorch/test_export.py", line 20, in <module> ep.module()(x, y, z) File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 840, in call_wrapped return self._wrapped_call(self, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 416, in __call__ raise e File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 403, in __call__ return super(self.cls, obj).__call__(args, *kwargs) # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl return self._call_impl(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1873, in _call_impl return inner() ^^^^^^^ File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1800, in inner args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/_dynamo/eval_frame.py", line 895, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/titaiwang/pytorch/torch/export/_unlift.py", line 83, in _check_input_constraints_pre_hook _check_input_constraints_for_graph( File "/home/titaiwang/pytorch/torch/_export/utils.py", line 426, in _check_input_constraints_for_graph _check_symint( File "/home/titaiwang/pytorch/torch/_export/utils.py", line 338, in _check_symint raise RuntimeError( RuntimeError: Expected input at args[2].shape[0] to be equal to 2, but got 1 """ ``` The explanation (from @pianpwk): In the model we have `return torch.cat((x, y), axis=1) + z`. Before this add is executed, the LHS has shape `[s7, s16 + s43]`, while the z has shape, say `[s8, s16 + s43]` (we don't know `s7 == s8` yet). When we execute this add, the compiler is making a decision: does broadcasting apply or not? The choices are: 1) Yes -> then we must specialize `s8` to 1 2) No -> then this element-wise op is only valid if the shapes match up, and we assume `s7 == s8`. Unfortunately export can only follow one of these options, and in avoiding 0/1 specialization (because a dynamic dimension was requested), it assumed case 2). For an operation like a + b, in eager semantics it's possible to have all options (either a == 1 OR b == 1 OR a == b), but with export we need to make a decision on what the output shape of this operation is, and keeping all branches alive requires expressing the output shape with a conditional (e.g. output shape = `a if b == 1 else b`), which is pretty hard for the compiler to reason about. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155717 Approved by: https://github.com/justinchuby	2025-06-14 04:04:47 +00:00
FFFrog	187828dcb4	[OpenReg][5/N] add set_.source_Storage for openreg (#155191 ) Changes: - add set_.source_Storage for openreg to support torch.load & torch.serialization - uncomment some related tests in the test_openreg.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/155191 Approved by: https://github.com/albanD ghstack dependencies: #153947, #154018, #154019, #154106, #154181, #155101	2025-06-14 03:44:32 +00:00
FFFrog	e4fd0bf771	[OpenReg][4/N] Migrate cpp_extensions_open_device_registration to OpenReg (#155101 ) As the title stated. Involved testcases: - test_open_device_storage_pin_memory - test_open_device_serialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/155101 Approved by: https://github.com/albanD ghstack dependencies: #153947, #154018, #154019, #154106, #154181	2025-06-14 03:44:32 +00:00
FFFrog	1e7989cad5	[OpenReg][3/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154181 ) As the title stated. Involved testcases: - test_open_device_quantized - test_open_device_random - test_open_device_tensor - test_open_device_packed_sequence - test_open_device_storage Pull Request resolved: https://github.com/pytorch/pytorch/pull/154181 Approved by: https://github.com/albanD ghstack dependencies: #153947, #154018, #154019, #154106	2025-06-14 03:44:32 +00:00
FFFrog	7e5f29b2de	[OpenReg][2/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154106 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154106 Approved by: https://github.com/nareshrajkumar866, https://github.com/albanD ghstack dependencies: #153947, #154018, #154019	2025-06-14 03:44:32 +00:00
FFFrog	676abded4b	[OpenReg][1/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154019 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154019 Approved by: https://github.com/albanD ghstack dependencies: #153947, #154018	2025-06-14 03:44:32 +00:00
FFFrog	d3d469092f	[Openreg] Split TestOpenReg into two parts (#154018 ) ---- - TestPrivateUse1: testing 3rd accelerator integration mechinasm itself - TestOpenReg: testing openreg itself Pull Request resolved: https://github.com/pytorch/pytorch/pull/154018 Approved by: https://github.com/albanD ghstack dependencies: #153947	2025-06-14 03:44:31 +00:00
FFFrog	cafd2344d6	[OpenReg] add manual_seed related capabilities (#153947 ) Changes: - Add manual_seed manual_seed_all initial_seed and so on - Delay execution of self._lazy_init more deeply Pull Request resolved: https://github.com/pytorch/pytorch/pull/153947 Approved by: https://github.com/albanD	2025-06-14 03:44:31 +00:00
Sean McGovern	297805fd8f	Typo fixes for "overridden" in comments and function names (#155944 ) This word appears often in class descriptions and is not consistently spelled. Update comments and some function names to use the correct spelling consistently. Facilitates searching the codebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155944 Approved by: https://github.com/Skylion007	2025-06-14 03:37:38 +00:00
Edgar Romo Montiel	ca3cabd24a	Convert to markdown: named_tensor.rst, nested.rst, nn.attention.bias.rst, nn.attention.experimental.rst, nn.attention.flex_attention.rst #155028 (#155696 ) Fixes #155028 This pull request updates the documentation by transitioning from .rst to .md format. It introduces new Markdown files for the documentation of named_tensor, nested, nn.attention.bias, nn.attention.experimental, and nn.attention.flex_attention Pull Request resolved: https://github.com/pytorch/pytorch/pull/155696 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-14 03:32:00 +00:00
dolpm	cdfa33a328	[nativert] move execution frame to torch (#155830 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D76369008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155830 Approved by: https://github.com/zhxchen17	2025-06-14 03:28:55 +00:00
Nicolas Macchioni	a6084b71ed	[BE][1/X] Phase out usage of `use_max_autotune()` (#155847 ) `use_max_autotune()` is likely not what people expect it to be; Originally, `use_max_autotune()` was setup to decide when we should include Triton templates as choices in GEMM autotuning. As expected, `use_max_autotune()=True` if `max_autotune=True` or `max_autotune_gemm=True`. However, with the addition of the offline GEMM autotuning cache two years back `use_max_autotune()=True` also in the case that `search_autotune_cache=True`; in this case though, `search_autotune_cache=True` should never trigger autotuning. Over time, people have used `use_max_autotune()` likely without realizing that this gives unexpected behavior if `search_autotune_cache=True`. We could rename the method to be more clear, but prefer to phase it out entirely for maximal clarity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155847 Approved by: https://github.com/jingsh, https://github.com/masnesral	2025-06-14 03:16:20 +00:00
Yiming Zhou	7982b8c703	[BE][AOTI] Remove duplicate schema for ExternKernelNode (#155867 ) Summary: The definition of `ExternKernelNode` and `ExternKernelNodes` schema in `torch/_export/serde/aoti_schema.py` is a complete duplicate of the ones in `torch/_export/serde/schema.py`. Test Plan: CI Rollback Plan: Differential Revision: D76558294 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155867 Approved by: https://github.com/jingsh	2025-06-14 02:03:27 +00:00
Yiming Zhou	8f5f01bf19	[BE][AOTI] Combine DynamicArgType in Proxy Executors (#155871 ) Summary: As title. Move the duplicate definition to the base class header `proxy_executor.h` Test Plan: CI Rollback Plan: Differential Revision: D76559180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155871 Approved by: https://github.com/yushangdi	2025-06-14 01:52:43 +00:00
PyTorch MergeBot	4574b39aa4	Revert "[BE]: Sync cusparselt 12.9 with static build and other cuda 12 (#155709 )" This reverts commit bbbced94a43cf764ddfe719e7d4c161a3992830c. Reverted https://github.com/pytorch/pytorch/pull/155709 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/15645591737/job/44082402642) [HUD commit link](`bbbced94a4`) landrace with 155819? easy forward fix but its the end of the week so idk when id get a review ([comment](https://github.com/pytorch/pytorch/pull/155709#issuecomment-2972094849))	2025-06-14 01:43:16 +00:00
Nikita Shulga	c10339559d	[BE] Better uv detection in `pip init` (#155972 ) If one has some UV and non-UV environments locally, one shoudl call `uv pip install` only on the UV-enabled ones, which could be detected by checking if `uv/python` path is present in `sys.base_prefix` Fixes https://github.com/pytorch/pytorch/issues/152999 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155972 Approved by: https://github.com/janeyx99	2025-06-14 01:35:50 +00:00
PyTorch MergeBot	d7e3c9ce82	Revert "Enable manywheel build and smoke test on main branch for ROCm (#153287 )" This reverts commit 3b6569b1ef4b9ff25f5b75fe0a216d6d084d573f. Reverted https://github.com/pytorch/pytorch/pull/153287 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/15646152483/job/44083912145) [HUD commit link](`3b6569b1ef`) ([comment](https://github.com/pytorch/pytorch/pull/153287#issuecomment-2972088294))	2025-06-14 01:32:27 +00:00
anwang	c165b36a31	[MTIA Aten Backend] Migrate relu / relu_ (#155927 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate relu / relu_. Note: Pytorch in-tree implementation delegates relu to clamp_min, so no more need to launch relu kernel. https://www.internalfb.com/code/fbsource/[0c9eedb2fc8f99bcca00cb67a5738cfe07e39349]/fbcode/caffe2/aten/src/ATen/native/Activation.cpp?lines=512-520 Let me know if any concern about this Differential Revision: [D75803582](https://our.internmc.facebook.com/intern/diff/D75803582/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155927 Approved by: https://github.com/egienvalue ghstack dependencies: #154632, #154659, #155925, #155926	2025-06-14 01:24:48 +00:00
anwang	50f6431e0a	[MTIA Aten Backend] Migrate sqrt.out / rsqrt.out / sin.out / silu.out (#155926 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate sqrt.out / rsqrt.out / sin.out / silu.out Differential Revision: [D75801847](https://our.internmc.facebook.com/intern/diff/D75801847/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155926 Approved by: https://github.com/egienvalue ghstack dependencies: #154632, #154659, #155925	2025-06-14 01:24:48 +00:00
anwang	7b11cb8c12	[MTIA Aten Backend] Migrate tanh.out and tanh_backward.grad_input (#155925 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate tanh.out and tanh_backward.grad_input Differential Revision: [D75769242](https://our.internmc.facebook.com/intern/diff/D75769242/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155925 Approved by: https://github.com/egienvalue ghstack dependencies: #154632, #154659	2025-06-14 01:24:41 +00:00
anwang	0185d3a5ed	[MTIA Aten Backend] Migrate bitwise_or.Tensor_out (#154659 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate bitwise_or.Tensor_out from out-of-tree to in-tree. Differential Revision: [D75629937](https://our.internmc.facebook.com/intern/diff/D75629937/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154659 Approved by: https://github.com/egienvalue ghstack dependencies: #154632	2025-06-14 01:24:34 +00:00
anwang	163cdaaa3a	[MTIA Aten Backend] Migrate bitwise_not.out (#154632 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate bitwise_not.out from out-of-tree to in-tree. Differential Revision: [D75610643](https://our.internmc.facebook.com/intern/diff/D75610643/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154632 Approved by: https://github.com/egienvalue	2025-06-14 01:24:27 +00:00
Jinhang Choi	04cf2c9d24	fix tensor print behavior for MAIA (#155609 ) This pull request fixes the tensor print behavior for `MAIA` to account for the absence of double-precision support in its backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155609 Approved by: https://github.com/soulitzer	2025-06-14 01:04:12 +00:00
cz2h	dabb55baff	Add resolve in add decomp to enable view (#153945 ) Fixes #148950. During the construction of graph and running the node of add under [interpreter](/github.com/pytorch/pytorch/blob/d68d4d31f4824f1d1e0d1d6899e9879ad19b0754/torch/fx/interpreter.py#L301 ), the functional argument of conj complex tensor gets cloned. This result in always having .is_conj() evaluted to false in decomposition function. Propose a fix of calling resolve_conj() in the decomposition of complex tensor add. Test as below `python test/dynamo/test_repros.py ReproTests.test_add_complex_conj` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153945 Approved by: https://github.com/jansel	2025-06-14 00:41:50 +00:00
Nikita Shulga	fec571cfd4	[BE][CI] Remove hardshrink integer exclusions (#155965 ) As they are not called anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/155965 Approved by: https://github.com/dcci	2025-06-14 00:32:57 +00:00
Boyuan Feng	38410cf9b5	Fix DDPOptimizer issue on static tensor index (#155746 ) We rely on `_try_get_metadata_from_dynamo()` to get static input indices. When the meta info is missing, it just returns an empty list of static input indices. This wrong list of static input indices lead to repeated cudagraph re-recording, which looks like a hang from the user perspective. `bc3972b80a/torch/_functorch/aot_autograd.py (L1025-L1031)` The root cause is `split_module` in DDP Optimizer loses meta info and gm attributes. This PR fixes the issue by propagating these metadata from original module to submodules. `bc3972b80a/torch/_dynamo/backends/distributed.py (L515-L517)` Fixes #140395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155746 Approved by: https://github.com/xmfan, https://github.com/bdhirsh	2025-06-14 00:15:58 +00:00
Jithun Nair	3b6569b1ef	Enable manywheel build and smoke test on main branch for ROCm (#153287 ) Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287 Approved by: https://github.com/jeffdaily	2025-06-14 00:05:57 +00:00
Aaron Gokaslan	bbbced94a4	[BE]: Sync cusparselt 12.9 with static build and other cuda 12 (#155709 ) followup for https://github.com/pytorch/pytorch/pull/154980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155709 Approved by: https://github.com/tinglvv, https://github.com/atalman, https://github.com/nWEIdia, https://github.com/cyyever	2025-06-13 23:10:01 +00:00
Nikita Shulga	d512584718	[BE] Refactor clamp dtypes check (#155930 ) By introducing `check_for_unsupported_clamp_dtypes` similar to `check_for_unsupported_isin_dtypes` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155930 Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/clee2000 ghstack dependencies: #155470	2025-06-13 23:05:02 +00:00
Nikita Shulga	0cb85c188f	[BE] Move optional submodules checkout to its own module (#155947 ) To expand it to optional eigen checkout later Pull Request resolved: https://github.com/pytorch/pytorch/pull/155947 Approved by: https://github.com/Skylion007	2025-06-13 23:02:38 +00:00
dggaytan	3003c681ef	Converting .rst files to .md files (#155377 ) Fixes #155036 This pull request updates the documentation for several modules by transitioning from .rst to .md format, improving readability and usability. It introduces new Markdown files for the documentation of torch.ao.ns._numeric_suite, torch.ao.ns._numeric_suite_fx, AOTInductor, AOTInductor Minifier, and the torch.compiler API Pull Request resolved: https://github.com/pytorch/pytorch/pull/155377 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-13 22:54:27 +00:00
ggsmith842	799443605b	Convert to markdown: distributed.tensor.parallel.rst, distributed.tensor.rst, distributions.rst, dlpack.rst (#155297 ) Fixes #155019 ## Description Convert to markdown: distributed.tensor.parallel.rst, distributed.tensor.rst, distributions.rst, dlpack.rst ## Checklist - [X] dlpack.rst converted to dlpack.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/dlpack.html) - [X] distributions.rst converted to distributions.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributions.html) - [X] distributed.tensor.rst converted to distributed.tensor.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributed.tensor.html) - [X] distributed.tensor.parallel.rst converted to distributed.tensor.parallel.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributed.tensor.parallel.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155297 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-13 22:08:37 +00:00
Nikita Shulga	764c02b78b	[BE] Raise `NotImplementedError` (#155470 ) When op is unimplemented for a specific dtype Which makes more sense, than a RuntimeError Example ```python >>> import torch >>> torch.nn.Hardshrink()(torch.randint(0, 5, (10,))) NotImplementedError: "hardshrink_cpu" not implemented for 'Long' ``` release notes bc-breaking: After this release `NotImplementedError` exception will be raised when ATen operation is called on the combinaiton of input tensor dtypes it has not been implemented for Mark few more unary ops as unimplemented to satisfy foreach testing error reporting consistency between CPU and CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/155470 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-06-13 22:07:03 +00:00
Catherine Lee	d59ed21d0f	[CI] Reuse old whl: track why failed to use the old whl (#155860 ) As in title Any other things I should track? Pull Request resolved: https://github.com/pytorch/pytorch/pull/155860 Approved by: https://github.com/malfet	2025-06-13 22:01:31 +00:00
Catherine Lee	3596c0c77f	Fix test after revert (#155946 ) ex test_dynamic_shapes.py::TestUbackedOps::test_unbacked_reshape2 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15642199583/job/44073674212) [HUD commit link](`06408dae49`) started after 06408dae49d06b6146fdd9d7a37eb5dde4f5e78d idk what the test does so maybe theres a better way to fix this Pull Request resolved: https://github.com/pytorch/pytorch/pull/155946 Approved by: https://github.com/yangw-dev, https://github.com/huydhn, https://github.com/malfet	2025-06-13 21:52:07 +00:00
Catherine Lee	eef253d9f6	[CI] Keep going display on HUD: upload log when test fails (#155371 ) I guess this is more of an RFC Goal: Enable keep going so that we can get information immediately for failures. We want be aware of failures as soon as possible, especially on the main branch, this is so that reverts can happen quickly. Proposal: A job with `keep-going` will continue through errors in `python run_test.py`. If a test fails, before it runs the next test, it will upload a fake log that should have enough information in it so that viewing the log will be able to tell you what failed and any stack traces/error logs, and should be able to be parsed by log classifier to get a line. I am getting the log by concating the test logs in test/test-reports, which is all the text outputted by pytest (unless someone runs with `ci-verbose-test-logs` label). There are obviously many things this won't catch, ex output outside of run_test.py, some output inside of run_test.py, but it should be enough. After a log finishes, eventually its raw log is uploaded to ossci-raw-job-status s3 bucket and the log classifier will read it to do classification. This means we will have to change log classifier to read from this bucket as well. I'm thinking just add an input parameter to log classifier like https://github.com/pytorch/test-infra/pull/6723/files Also upload the temp results to a temp attribute instead of the real one To overwrite the conclusion on HUD, I'm thinking a lambda that is s3 put trigger on the fake log being put into s3, that does something similar to log classifier where it just mutates the entry `13a990b678/aws/lambda/log-classifier/src/network.rs (L85)` to add a new field like "will_fail": true, and also triggers the log classifier to run Then we change HUD/ClickHouse to point the raw log url to the alternate place, the new "will_fail" field as the conclusion, and the temp log classifier result if needed Why always write to temp attribution/column? I am unsure about overwriting the real results with fake ones Pros: Not many changes outside of HUD/UI Cons: Lots of moving parts, lots of temp fields that will require adjustment for queries, temp fields never really get deleted Pull Request resolved: https://github.com/pytorch/pytorch/pull/155371 Approved by: https://github.com/malfet	2025-06-13 21:21:55 +00:00
mori360	e5ed267f83	Update h100-distributed image (#155861 ) Move non inductor workflows cuda 12.6->cuda 12.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155861 Approved by: https://github.com/seemethere	2025-06-13 21:17:05 +00:00
Joona Havukainen	20a74c370b	Add error message with assert to topK if ndims() - dim > 4 (#155475 ) Addressing #154890 Not really a proper fix but at least it's more informative than the current crash. For a more long term solution I'm testing if we can use the TopK API released in MacOS14 as it does not have the same MPSScan op issue that the Sort and ArgSort are hitting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155475 Approved by: https://github.com/kulinseth	2025-06-13 21:10:06 +00:00
Runtian (Rachel) Li	049dc48d1e	fix code chunk indentation for `jit_language_reference_v2.md` (#155937 ) Fixes https://github.com/pytorch/pytorch/issues/155023 Related PR: #155781 Description: As discussed, this PR is a follow-up update for `jit_language_reference_v2.md` by deleting the code chunk indentation. Checklist: - [x] The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023) - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request. @pytorchbot label "topic: docs" @pytorchbot label "topic: not user facing" @pytorchbot label docathon-h1-2025 @pytorchbot label "module: docs" Pull Request resolved: https://github.com/pytorch/pytorch/pull/155937 Approved by: https://github.com/jingsh, https://github.com/svekars	2025-06-13 21:05:23 +00:00
GdoongMathew	731351bb4a	Convert rst to markdown - optim.rst #155031 (#155813 ) Fixes #155031 ![image](https://github.com/user-attachments/assets/36507ca1-eb1e-4358-9e66-ce25ec8a2be1) @pytorchbot label "docathon-h1-2025" "module: docs" "topic: not user facing" "topic: docs" Pull Request resolved: https://github.com/pytorch/pytorch/pull/155813 Approved by: https://github.com/AlannaBurke	2025-06-13 21:03:39 +00:00
Benjamin Glass	92388bb2ab	[export] Remove broken check for multiple cpp files in PT2 package (#155149 ) This check was recently added, but (when fixed to refer to CPP rather than library files) fails with the separate kernel and wrapper build of AOTInductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155149 Approved by: https://github.com/angelayi	2025-06-13 21:02:31 +00:00
windsonsea	7d1b3f599d	[Docs] Convert to markdown cond.rst, config_mod.rst (#155653 ) Related to #155014 Only included 2 files in this PR: - cond.rst - config_mod.rst Pull Request resolved: https://github.com/pytorch/pytorch/pull/155653 Approved by: https://github.com/svekars	2025-06-13 20:58:57 +00:00
henrylhtsang	fdf5d97fa8	[cutlass backend][ez] Log timings from prescreening (#155757 ) Differential Revision: [D76474669](https://our.internmc.facebook.com/intern/diff/D76474669/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155757 Approved by: https://github.com/ColinPeppler	2025-06-13 20:44:04 +00:00
Justin Silver	f3e6c8e834	Fix #155016 for Docathon - convert rst to markdown (#155198 ) Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/) One note is that "Created On" and "Last Updated On" banner doesn't show in the markdown files... I'm not sure if that's just an artifact of my local build though. Fixes #155016 Docs comparison (check out the 'new' whenever docs build) 1. cuda ([old](https://docs.pytorch.org/docs/main/cuda.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/cuda.html)) 2. cuda.tunable ([old](https://docs.pytorch.org/docs/main/cuda.tunable.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/cuda.tunable.html)) 3. leave cudnn_persistent_rnn.rst as is because it's reused in docstrings 4. cudnn_rnn_determinism.rst as is because it's reused in docstrings. 5. data ([old](https://docs.pytorch.org/docs/main/data.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/data.html)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155198 Approved by: https://github.com/albanD, https://github.com/svekars	2025-06-13 20:24:34 +00:00
Ankita George	bf798a2f01	Change _hfstorage to hfstorage (#155837 ) Summary: Change HF classes to not have an underscore, there-by making them public, we will add documentation to them following this Test Plan: ensure existing tests pass Rollback Plan: Differential Revision: D76364024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155837 Approved by: https://github.com/saumishr	2025-06-13 20:19:51 +00:00
zeshengzong	77f884c2ec	Optimize Tensor.backward type hints (#155656 ) Fixes #81963 ## Test Result ![image](https://github.com/user-attachments/assets/67380fdc-73c4-43d8-b2a5-5e16d63f4fd3) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155656 Approved by: https://github.com/soulitzer	2025-06-13 19:16:48 +00:00
PyTorch MergeBot	06408dae49	Revert "Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757 )" This reverts commit 0029259bdfeee627181df2b9f5ff6979f65090ec. Reverted https://github.com/pytorch/pytorch/pull/154757 on behalf of https://github.com/laithsakka due to post land issue ([comment](https://github.com/pytorch/pytorch/pull/154757#issuecomment-2971385787))	2025-06-13 19:11:43 +00:00
Michael Lazos	4628f1b7a9	[Hierarchical-Compile] Track mutations for setitem (#155880 ) This fixes a bug in tensor variable where we would not do things like set the example value on setitem nodes (but these don't typically have users so it doesn't matter) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155880 Approved by: https://github.com/anijain2305	2025-06-13 18:59:31 +00:00
Ting Lu	344731fb25	Add CUDA 12.9.1 sbsa nightly binaries (#155819 ) https://github.com/pytorch/pytorch/issues/155196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155819 Approved by: https://github.com/atalman	2025-06-13 18:52:41 +00:00
fduwjj	ce44877961	[c10d][PGNCCL] Make watchdog thread a class (#155831 ) By extracting both monitor thread and watchdog thread into a separate class this will help us learn what dependencies we have for each thread and it will kind of simplify the consolidation work for each thread (consolidating from thread per PG instance to per PG class) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155831 Approved by: https://github.com/d4l3k, https://github.com/kwen2501	2025-06-13 18:05:22 +00:00
Dhia-naouali	c5d00e150a	convert: rst to myst pr 1/2 (#155840 ) Fixes #155038 parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check) this PR converts the following two .rst files - [torch.compiler_dynamo_overview](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_dynamo_overview.rst) - [torch.compiler_fake_tensor](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fake_tensor.rst) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155840 Approved by: https://github.com/sekyondaMeta	2025-06-13 18:02:28 +00:00
Nikita Shulga	36bf81e363	[BE] Fix minifier when one has multiple Python runtimes (#155918 ) By using `sys.executable` instead of `"python"` Otherwise, it fails on Ubuntu with `python not found` error Pull Request resolved: https://github.com/pytorch/pytorch/pull/155918 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/zou3519	2025-06-13 17:55:04 +00:00
Runtian (Rachel) Li	093aaccae2	convert `jit_language_reference_v2.rst` to `jit_language_reference_v2.md` (#155781 ) Fixes https://github.com/pytorch/pytorch/issues/155023 Description: converted `jit_language_reference_v2.rst` to `jit_language_reference_v2.md` I indented the code blocks to minimize the file difference to pass the sanity check for no more than 2000 lines of change. I will submit another PR to fix the indentation after this PR is merged. Checklist: - [x] The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023) - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request. @pytorchbot label "topic: docs" @pytorchbot label "topic: not user facing" @pytorchbot label docathon-h1-2025 @pytorchbot label module: docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/155781 Approved by: https://github.com/svekars	2025-06-13 17:33:10 +00:00
PyTorch UpdateBot	f0bee87eea	[xla hash update] update the pinned xla hash (#155779 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155779 Approved by: https://github.com/pytorchbot	2025-06-13 17:13:37 +00:00
Aidyn-A	1f3cc4875c	[ATen][CUDA][cuSOLVER] Add cusolverDnXsyevBatched for torch.linalg.eigh (#155695 ) This PR add a new API for SYEV operation of cuSOLVER [`cusolverDnXsyevBatched`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdnxsyevbatched) which is a new alternative to [`cusolverDn<t>syevjBatched`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdn-t-syevjbatched). This API was introduced in cuSOLVER as part of 64-bit API in CUDA Tool Kit 12.6.2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155695 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-06-13 17:12:26 +00:00
Nikita Shulga	b6add8c8ba	[MPSInductor] Fix remainder implementation for int types (#155891 ) Introduce `c10:🤘:remainder` and call it from both inductor and eager implementation, with integer specialization, which should make it much faster than before, while still compliant with Python way of rounding up negative numbers. This allows one to remove complex type detection logic from mps codegen and rely on Metal(C++) type system to figure out input and output types. This fixes compilation of something like ```python @torch.compile def f(x, y): return x[y % 5] ``` which beforehand failed to compile with ``` torch._inductor.exc.InductorError: SyntaxError: failed to compile #include <c10/metal/utils.h> kernel void generated_kernel( device float* out_ptr0, constant long* in_ptr0, constant float* in_ptr1, uint xindex [[thread_position_in_grid]] ) { int x0 = xindex; auto tmp0 = in_ptr0[x0]; auto tmp1 = 12; auto tmp2 = static_cast<float>(tmp0) - static_cast<float>(tmp1) * metal::floor(static_cast<float>(tmp0) / static_cast<float>(tmp1)); auto tmp3 = 1024; auto tmp4 = static_cast<long>(tmp3); auto tmp5 = tmp2 + tmp4; auto tmp6 = tmp2 < 0; auto tmp7 = tmp6 ? tmp5 : tmp2; if ((tmp7 < 0) && (tmp7 > 1024)) return; auto tmp9 = in_ptr1[tmp7]; out_ptr0[x0] = static_cast<float>(tmp9); } with program_source:372:28: error: array subscript is not an integer auto tmp9 = in_ptr1[tmp7]; ^~~~~ ``` This fixes fail_to_compile for GPT2ForSequenceClassification Huggingface model using `transformers==4.44.2` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155891 Approved by: https://github.com/manuelcandales	2025-06-13 16:42:56 +00:00
Georgia Phillips	9462106b7e	[nativert] Move graph_passes to nativert (#155411 ) Summary: Move graph_passes to nativert Test Plan: CI Rollback Plan: Differential Revision: D76205048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155411 Approved by: https://github.com/zhxchen17	2025-06-13 16:41:01 +00:00
Chao Gu	338a8c7853	fix slice w/ dynamic shapes (#153131 ) Summary: guard_size_oblivious has side effects that'll result in invalid strides when slice nodes take negative index on dynamic input shapes. Cause overflow error with a huge number “9223372036854776048” Test Plan: CIs should pass. Differential Revision: D74354663 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153131 Approved by: https://github.com/laithsakka	2025-06-13 15:53:17 +00:00
Chang Pan	a5938ff431	[BE][c10d/Store]add check in pyi (#155855 ) (#155865 ) Summary: "check" is already binded https://fburl.com/code/9lx1zf9o which is also documented in https://docs.pytorch.org/docs/stable/distributed.html add it to pyi for type checking Test Plan: skip Rollback Plan: Differential Revision: D76547457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155865 Approved by: https://github.com/fduwjj	2025-06-13 15:39:27 +00:00
Stephen Jia	bee93f9f0d	Move glslc to cas to enable remote execution (#155832 ) Meta: `fbsource//xplat/caffe2:gen_torch_vulkan_spv_cpp` takes on average 2 min to build and is one of topmost slow targets in fbandroid. See: https://fb.workplace.com/groups/2840058936242210/posts/4067730240141734 This target hat to run locally because it uses manifold backend for dotslash. This diff moves the `glslc` to cas backend so that it can run on RE. Here are commands executed: ``` % manifold get dotslash_glslc/flat/glslc-linux-x86_64.tar.gz % manifold get dotslash_glslc/flat/glslc-macos-v2024_4.tar.gz % manifold get dotslash_glslc/flat/glslc-windows-v2024_3.tar % ls -rw-r--r-- 1 navidq staff 2.0M Jun 12 10:02 glslc-linux-x86_64.tar.gz -rw-r--r-- 1 navidq staff 4.7M Jun 12 10:03 glslc-macos-v2024_4.tar.gz -rw-r--r-- 1 navidq staff 4.4M Jun 12 10:03 glslc-windows-v2024_3.tar % frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-linux-x86_64.tar.gz ea5d674e0e7e9782be3f5c309e3484732e5b3a331cbe3258f3e929002811627b:2072937 % frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-macos-v2024_4.tar.gz 1331dc691835e4676832b7c21ef669083a3acc8856981583d0698192f466c51a:4898649 % frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-windows-v2024_3.tar 76181fbb1ce5c62d0c905db26df3a64e999d0baff2e93270775921daa91e3a1a:4585984 ``` Differential Revision: [D76513735](https://our.internmc.facebook.com/intern/diff/D76513735/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155832 Approved by: https://github.com/GregoryComer	2025-06-13 14:38:51 +00:00
PyTorch MergeBot	ce6e0523f9	Revert "[BE] Raise `NotImplementedError` (#155470 )" This reverts commit 5ab6a3fb6fd37c542060c606edd4b95c7e3cae82. Reverted https://github.com/pytorch/pytorch/pull/155470 on behalf of https://github.com/malfet due to foreach tests are failing on ROCm because we are not running the same on CUDA ([comment](https://github.com/pytorch/pytorch/pull/155470#issuecomment-2970592124))	2025-06-13 14:32:50 +00:00
James Wu	3819584f12	[precompile] Implement PrecompileContext for recording precompile artifacts, integrate with CompilePackage (#154415 ) This PR implements a basic interface and test for PrecompileContext, a special CacheArtifactManager specifically designed for precompile. The job of a PrecompileContext is to record things precompile needs as torch is compiling, dump it all into bytes, and then stitch it back together into a cache of callables. ## Why use CacheArtifactManager? Precompile needs a way to record various serializable data as torch is compiling. CacheArtifactManager already does this today pretty well, handling a lot of serialization and cache information. So we're reusing a bunch of that infrastructure directly. ## How is it different from CacheArtifactManager? Unlike regular CacheArtifactManager, PrecompileContext needs to be able to take the recorded artifacts and stitch them together after deserialization, to create a single working callable. Since PrecompileContext doesn't need the cache keys, the "key" field of PrecompileArtifacts can be used for metadata relating to how to stitch the individual functions being compiled together into a full callable. For example, on a given dynamo compile, if there are multiple functions (via graph breaks or recompiles) being compiled, MegaCache would organize it like so: ![image](https://github.com/user-attachments/assets/49a0a75b-1e7f-4d96-8d81-6769fe5a53ca) Whereas we'd visualize PrecompileContext's result like so: ![image](https://github.com/user-attachments/assets/fcc0dd4e-dfbf-4b13-9c08-2e99b373180b) For now, we just handle eager mode; in the diff above, I'll hook up the other backend artifacts from PrecompileContext. After this PR, precompile consists of three main interfaces: ### CompilePackage - Everything needed to run one torch.compile'd function (including graph breaks) - `__init__(fn, cache_entry)` Initializes with a DynamoCacheEntry - `install(backends)` load precompile artifacts into function's dynamo state with a dictionary of backends - `cache_entry()` return a serializable cache entry to save ### DynamoStore - Responsible for tracking CompilePackages on disk (and/or in memory) - `load_package(path)`: load a package given a torch compiled function and a path to the cache artifact - `save_package(package, path): Save a CompiledPackage to a path. Calls PrecompileContext to grab backend data - `record_package(package)`: Record a package to PrecompileContext (for global serialization/deserialization) ### PrecompileContext - Overarching context for serializing and deserializing precompile artifacts. Supports global and local setups. - `serialize()`: (Global) serializes all artifacts in PrecompileContext into bytes - `populate_caches(bytes)`: (Global) takes serialized bytes and puts them into DynamoStore (TODO) - `serialize_artifact_by_key(key)`: (Local) serialize a single artifact by its cache key <img width="1455" alt="image" src="https://github.com/user-attachments/assets/99b61330-7607-4763-bdbc-85b366e82cdd" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154415 Approved by: https://github.com/zhxchen17 ghstack dependencies: #155118	2025-06-13 14:11:24 +00:00
James Wu	b2fc9cfea1	[precompile] Add CompilePackage to serialize dynamo states. (#155118 ) Adding a per torch.compile() object CompilePackage which tracks dynamo artifact. CompilePackage is considered a low level component and should not be directly exposed to end users. It has the following interface: 1. `CompilePackage.__init__()` which optionally takes previously serialized dynamo states. a. when `dynamo` argument is None, it will contruct a brand new CompilePackage object. b. when `dynamo` argument is not None, it will load a pre-compiled dynamo state. 2. `package.save()` which dumps the dynamo states into _DynamoCacheEntry. 3. `package.install(backends)` which will handle all the side-effectful global scope updates with compiled functions and resume functions. This diff focus on making the low level mechanism for precompile. It will be left to upper level interface to use these API to build more user-facing frontend. Differential Revision: [D75956538](https://our.internmc.facebook.com/intern/diff/D75956538/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155118 Approved by: https://github.com/jamesjwu Co-authored-by: James Wu <jjwu@meta.com>	2025-06-13 13:54:10 +00:00
xinan.lin	670dab6c63	[AOTI] Enable OP `test__weight_int4pack_mm_with_scales_and_zeros` in AOTI. (#155780 ) The op test__weight_int4pack_mm_with_scales_and_zeros is for Intel GPU. It is functionally equivalent to the CUDA/CPU op test__weight_int4pack_mm (with the constraint that oneDNN only supports integer zero points, which is why we need this API). Since test__weight_int4pack_mm is already included in AOTI's fallback list, this PR adds support for XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155780 Approved by: https://github.com/jansel	2025-06-13 11:12:13 +00:00
Avik Chaudhuri	463fe36532	fix error message on specialization with Dim.DYNAMIC (#155738 ) Previously specialization error messages would render sources that were pretty far from source-code names. E.g., given args named `x, y, zs`, the source for `y.size()[0]` would be rendered as `args[0][1].size()[0]`. This is because we created artificial local names following `(args, kwargs)` structure instead of reusing signatures. This PR fixes that situation. Basically we map prefixes of key paths that correspond to original arg names to root sources corresponding to those names; the rest of the key paths hang from these root sources. Differential Revision: [D76461391](https://our.internmc.facebook.com/intern/diff/D76461391/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155738 Approved by: https://github.com/bobrenjc93	2025-06-13 10:33:46 +00:00
anwang	6abe450a6f	[pytorch Aten] Delete unused duplicate clamp_stub, to avoid compile error (#154631 ) I found the `clamp_stub` in `UnaryOps.h` is not used. And it's a duplicate of the `clamp_stub` in `TensorCompare.cpp`: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/TensorCompare.cpp#L313-L314 This diff/PR deletes it as this duplicate caused build failure for me: ``` ATen/native/UnaryOps.h:109:1: error: redefinition of 'clamp_stub_DECLARE_DISPATCH_type' ``` Differential Revision: [D75612521](https://our.internmc.facebook.com/intern/diff/D75612521/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154631 Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/nautsimon ghstack dependencies: #154589, #154591	2025-06-13 10:01:51 +00:00
anwang	1cc31b213d	[MTIA Aten Backend] Migrate bitwise_and.Tensor_out (#154591 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff - Migrate where.self and where.self_out - Add tests for dtype casting and shape broadcasting Differential Revision: [D75578498](https://our.internmc.facebook.com/intern/diff/D75578498/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154591 Approved by: https://github.com/malfet ghstack dependencies: #154589	2025-06-13 10:01:51 +00:00
fengqing.lu@intel.com	65b9c13cce	[Intel GPU] Enable safe softmax for XPU SDPA (#151999 ) Fix https://github.com/intel/torch-xpu-ops/issues/1432#event-16899653975 When one row of Q*K attention score is masked with `-inf`, `softmax(score)` would output `NaN` for whole row which would cause model corruption. With this new flag, it would output `0` for whole row which is aligned with Pytorch CPU/CUDA's behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151999 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-06-13 08:53:47 +00:00
anwang	56b03df6ac	[MTIA Aten Backend] Migrate where.self and where.self_out (#154589 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff - Migrate where.self and where.self_out - Add tests for dtype casting and shape broadcasting Differential Revision: [D75577304](https://our.internmc.facebook.com/intern/diff/D75577304/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154589 Approved by: https://github.com/malfet	2025-06-13 08:25:13 +00:00
ZhaoqiongZ	3d595fd559	update get start xpu (#151886 ) update link and product name add print to print ```torch.xpu.is_available()``` result in code snippet for user not using command python Pull Request resolved: https://github.com/pytorch/pytorch/pull/151886 Approved by: https://github.com/guangyey, https://github.com/AlannaBurke Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-06-13 07:46:13 +00:00
Isuru Fernando	53d06e18d9	[dynamo] add missing algorithm header (#154754 ) Needed for `std::max(<initializer-list>)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154754 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2025-06-13 06:56:11 +00:00
Bob Ren	6020440683	remove allow-untyped-defs from adaround_fake_quantize.py (#155621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155621 Approved by: https://github.com/Skylion007	2025-06-13 06:14:22 +00:00
Ke Wen	99e99d5bfe	[a2av] Test must allocate tensors symmetrically (#155835 ) This is a requirement of most SHMEM backends. Otherwise, allocations may misalign across ranks. In this PR, we make the (total) input size and output size a constant number, even though the split sizes are created random. (Previously we sum the splits up as input size, which creates misalignment in SHMEM heap across ranks). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155835 Approved by: https://github.com/fduwjj, https://github.com/fegin, https://github.com/Skylion007 ghstack dependencies: #155506	2025-06-13 06:05:38 +00:00
angelayi	0860606729	[export] Add meta[val] to getattr nodes (#154934 ) Fixes [P1830293318](https://www.internalfb.com/intern/paste/P1830293318/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154934 Approved by: https://github.com/yushangdi, https://github.com/muchulee8	2025-06-13 05:48:21 +00:00
Nikita Shulga	25717da8c8	[BE] Don't run the same tests on 2xlarge and 4xlarge (#155859 ) Also, speedup builds by moving them to 4xlarge instances Pull Request resolved: https://github.com/pytorch/pytorch/pull/155859 Approved by: https://github.com/ZainRizvi, https://github.com/wdvr	2025-06-13 05:40:20 +00:00
fduwjj	a87dfc7480	[symm_mem] Update CMakeList to reflect code moving a dedicated folder (#155823 ) We moved all symm_mem code into a folder ([CudaDMAConnectivity](https://github.com/pytorch/pytorch/pull/155573)) but somehow forgot update for CudaDMAConnectivity in the CMakeList. Users see errors: RuntimeError: DMA connectivity detector for cuda over nvlink is not available while torch.distributed.init_process_group(backend=backend). So this PR should fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155823 Approved by: https://github.com/Skylion007	2025-06-13 05:27:59 +00:00
Justin Silver	70bb34929a	Convert to .md: draft_export.rst, export.ir_spec.rst, fft.rst (#155567 ) Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/) Fixes #155020. This PR is split into 3 to pass sanity check. Docs comparison (check out the 'new' whenever docs build) 1. draft_export ([old](https://docs.pytorch.org/docs/main/draft_export.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/draft_export.html)) 2. export.ir_spec ([old](https://docs.pytorch.org/docs/main/export.ir_spec.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/export.ir_spec.html)) 3. fft ([old](https://docs.pytorch.org/docs/main/fft.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/fft.html)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155567 Approved by: https://github.com/svekars	2025-06-13 05:19:43 +00:00
Henry Tsang	b878ca0c91	[cutlass backend] add fp8 to cutlass benchmark script (#155507 ) Summary: Add fp8. Right now FP8 only allows fast_accum. Test Plan: ``` Experiment group: _scaled_mm (8192x8192, 8192x8192) torch.float8_e4m3fn +-----------------------+--------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| teraflops (TFLOPS) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+--------------------+----------------------+--------------------+ \| aten \| 967.1226739883423 \| 1136.8895149998868 \| 1.219131228979677 \| NA \| \| triton \| 1764.6185159683228 \| 623.08743664783 \| 20.373826419003308 \| 82.46067054670186 \| \| triton_persistent_tma \| 1769.0335512161255 \| 621.5323768280928 \| 20.48663099599071 \| 82.91718297956578 \| \| cutlass_lvl_default \| 790.5075550079346 \| 1390.8932568835019 \| 13.788519630907103 \| -18.26191482535096 \| \| cutlass_lvl_3332 \| 803.7384748458862 \| 1367.996757884245 \| 226.81587297911756 \| -16.89384434227684 \| +-----------------------+--------------------+--------------------+----------------------+--------------------+ ``` Rollback Plan: Differential Revision: D76310809 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155507 Approved by: https://github.com/ColinPeppler	2025-06-13 05:11:15 +00:00
GdoongMathew	2ba930d4ce	Convert rst to markdown - profiler.rst #155031 (#155559 ) Fixes https://github.com/pytorch/pytorch/issues/155031 * [profiler.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/profiler.rst) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155559 Approved by: https://github.com/svekars	2025-06-13 05:02:54 +00:00
Runtian (Rachel) Li	e8b3dfa7c0	convert jit_language_reference.rst to jit_language_reference.md (#155633 ) Part of changes https://github.com/pytorch/pytorch/issues/155023 (parent PR https://github.com/pytorch/pytorch/pull/155429) - converted jit_language_reference.rst to jit_language_reference.md @pytorchbot label "topic: docs" @pytorchbot label "topic: not user facing" @pytorchbot label docathon-h1-2025 @pytorchbot label module: docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/155633 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-13 04:58:28 +00:00
Runtian (Rachel) Li	3f65e38b73	Convert hub.rst to hub.md (#155483 ) Part of changes https://github.com/pytorch/pytorch/issues/155023 (parent PR https://github.com/pytorch/pytorch/pull/155429) @pytorchbot label "topic: docs" @pytorchbot label "topic: not user facing" @pytorchbot label docathon-h1-2025 @pytorchbot label module: docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/155483 Approved by: https://github.com/svekars	2025-06-13 04:39:55 +00:00
Will Constable	0a6b66c881	Inductor comms reorder logs to tlparse (#155737 ) Hacked test_inductor_collectives test to demonstrate this works: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/whc/de50ff33-f460-406b-bfa9-457e6e17395b/custom/-_0_0_0/reorder_communication_preserving_peak_memory_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Follow up: it would be nice to move the logging out of this pass and into the broader comms pass loop, where the before/after each pass visualization could be logged into the same tlparse file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155737 Approved by: https://github.com/bdhirsh	2025-06-13 02:59:42 +00:00
Bin Bao	f151b20123	[AOTI] Remove the emit_current_arch_binary option (#155768 ) Summary: Remove the option as generating fatbin with PTX only doesn't work on H100, so switch to always include one PTX and one SASS for fatbin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155768 Approved by: https://github.com/angelayi	2025-06-13 02:06:07 +00:00
FFFrog	020da74437	[Easy] Remove empty file (#155796 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155796 Approved by: https://github.com/malfet ghstack dependencies: #155772	2025-06-13 01:42:11 +00:00
zeshengzong	905b194a2e	Replace device check of TORCH_INTERNAL_ASSERT with TORCH_CHECK (#155318 ) Fixes #136849 ## Test Result ```python >>> import torch >>> device = torch.cuda.device_count() + 1 >>> torch.cuda.current_stream(device) # INTERNAL ASSERT FAILED Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1083, in current_stream streamdata = torch._C._cuda_getCurrentStream( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Device index value 3 is out of index range [0, 2) >>> torch.cuda.default_stream(device) # INTERNAL ASSERT FAILED Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1101, in default_stream streamdata = torch._C._cuda_getDefaultStream( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Device index value 3 is out of index range [0, 2) >>> torch.cuda.set_per_process_memory_fraction(0.5, device) # INTERNAL ASSERT FAILED Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/cuda/memory.py", line 193, in set_per_process_memory_fraction torch._C._cuda_setMemoryFraction(fraction, device) RuntimeError: Allocator not initialized for device : did you call init? ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155318 Approved by: https://github.com/albanD	2025-06-13 01:20:19 +00:00
Laith Sakka	d7e657da35	pyfmt lint more torch/utils files (#155812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155812 Approved by: https://github.com/Skylion007 ghstack dependencies: #155782, #155783	2025-06-12 23:51:42 +00:00
angelayi	4d3ecefda5	[aoti][mps] Use cpp sym-expr printer (#155646 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155646 Approved by: https://github.com/desertfire ghstack dependencies: #155752, #154287, #155582, #155583	2025-06-12 23:33:28 +00:00
angelayi	2e65d72e1e	[aoti][mps] Fix int/symint kernel args (#155583 ) Integer arguments to mps kernels need to go through a different function, since `aoti_torch_mps_set_arg` only takes a Tensor. So I added a `aoti_torch_mps_set_arg_int`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155583 Approved by: https://github.com/desertfire ghstack dependencies: #155752, #154287, #155582	2025-06-12 23:33:28 +00:00
angelayi	ffbda61fbe	[aoti][mps] Fix dynamic dispatch size (#155582 ) In the case where we pass in a symint to the `dispatch` call, the compiler errors, so we need to cast the input to int64_t. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155582 Approved by: https://github.com/malfet ghstack dependencies: #155752, #154287	2025-06-12 23:33:15 +00:00
angelayi	a4ab392251	[aoti][mps] mps constants support (#154287 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154287 Approved by: https://github.com/malfet ghstack dependencies: #155752	2025-06-12 23:33:07 +00:00
angelayi	8821a9dc4e	[BE][aoti][mps] Fix tests to use common function (#155752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155752 Approved by: https://github.com/desertfire, https://github.com/malfet	2025-06-12 23:32:59 +00:00
Nikita Shulga	5ab6a3fb6f	[BE] Raise `NotImplementedError` (#155470 ) When op is unimplemented for a specific dtype Which makes more sense, than a RuntimeError Example ```python >>> import torch >>> torch.nn.Hardshrink()(torch.randint(0, 5, (10,))) NotImplementedError: "hardshrink_cpu" not implemented for 'Long' ``` release notes bc-breaking: After this release `NotImplementedError` exception will be raised when ATen operation is called on the combinaiton of input tensor dtypes it has not been implemented for Pull Request resolved: https://github.com/pytorch/pytorch/pull/155470 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-06-12 23:19:12 +00:00
Natalia Gimelshein	d9b8369f39	fix warning spam for list indexing (#155815 ) Per title, #154806 incorrectly placed a warning Pull Request resolved: https://github.com/pytorch/pytorch/pull/155815 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-06-12 23:07:24 +00:00
Laith Sakka	2903e5ad3c	pyfmt lint more export files (#155783 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155783 Approved by: https://github.com/Skylion007 ghstack dependencies: #155782	2025-06-12 23:04:11 +00:00
Laith Sakka	86b1116f22	pyfmt lint torch/_custom_op/* (#155782 ) file torch/_custom_op/functional.py does not exisits file torch/_custom_op/__init__.py is empty. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155782 Approved by: https://github.com/Skylion007	2025-06-12 23:04:11 +00:00
Jithun Nair	4cdbdcdbcf	Switch to miniconda for ROCm CI (#155239 ) Related to https://github.com/pytorch/pytorch/issues/148335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155239 Approved by: https://github.com/jeffdaily	2025-06-12 22:55:47 +00:00
Randolf Scholz	f04fd4dc4e	typing: allow integer in bitwise operations (#155704 ) Fixes #155701 (false positives) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155704 Approved by: https://github.com/Skylion007, https://github.com/aorenste	2025-06-12 22:40:17 +00:00
angelayi	938515fa75	[aoti] Update cshim for all backends (#155604 ) Fixes https://github.com/pytorch/pytorch/issues/155349 `python torchgen/gen.py --update-aoti-c-shim` will now update all cpu/cuda/mps/xpu shims -- I verified this using `aten._print.default`, but didn't commit the changes since I'm not sure if we actually want to add this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155604 Approved by: https://github.com/desertfire, https://github.com/janeyx99	2025-06-12 22:10:58 +00:00
Mikayla Gawarecki	38bfd462b8	Use swap_tensors path in nn.Module.to for FakeTensor (#152539 ) Fixes https://github.com/pytorch/pytorch/issues/148977 Differential Revision: [D76458023](https://our.internmc.facebook.com/intern/diff/D76458023) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152539 Approved by: https://github.com/albanD	2025-06-12 22:08:21 +00:00
Frost Mitchell	db01f1032f	Support XPU in memory tracker (#150703 ) This PR adds support for XPU devices to the distributed MemoryTracker tool, including unit test for XPU. Specifically, this code adds tracking for a few alloc-related statistics for XPUCachingAllocator. It also adapts the existing memory tracker tool to be device agnostic, by getting the device module and recording the necessary memory stats. (I get the device module instead of using `torch.accelerator` methods, as that API is still in-progress.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150703 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/d4l3k	2025-06-12 21:33:52 +00:00
Brian Hirsh	154a39bfbd	basic compile support for grouped_mm (#153384 ) grouped_mm is used in torchtitan, this adds just enough support in compile to allow inductor to lower it as a fallback kernel. I imagine that at some point in the future it may be valuable to get inductor to support templating grouped_mm, although this PR just provides basic support. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @ngimel @eellison Pull Request resolved: https://github.com/pytorch/pytorch/pull/153384 Approved by: https://github.com/eellison	2025-06-12 21:24:51 +00:00
Prachi Gupta	f2b44424a1	[ROCm] Skip _stress_cuda and test_ddp_apply_optim_in_backward (#155724 ) These tests are flaky on ROCm and have been skipped via Github issues, but the bot keeps closing the issues after not observing the failures for these tests in the rerun_disabled_tests runs (not sure why they don't fail there), and we have to keep reopening them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155724 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-06-12 21:18:04 +00:00
lzhang2	590fe4d2d7	Skip updating the default device distributed backend if already registered (#155320 ) Motivation: PyTorch maintain a `default_device_backend_map` https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L269 , which indicates the default distributed backend if no backend name is specified in user frontend (like `init_process_group`). Currently, `"xpu": XCCL` is also in this `default_device_backend_map`. However, if another process group name is registered as XPU distributed backend, it immediately replaces XCCL in this default map, which is not what we want. Therefore, we would like to skip updating the default distributed backend if one is already registered in the map. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155320 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-06-12 21:17:06 +00:00
Catherine Lee	29391c7cf9	[ez] Mark linalg svd memory allocation test as serial b/c OOMing on cu128 (#155811 ) `9df2e8020f/1` `8e8d4b13b0 (43980565863-box)` started OOMing after switching to cuda 12.8 Maybe b/c I made some changes fix the per process memory fraction so each proc has fewer memory ``` 2025-06-12T15:29:50.4998758Z FAILED [0.0124s] test_linalg.py::TestLinalgCUDA::test_svd_memory_allocation_cuda_complex128 - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.10 GiB. GPU 0 has a total capacity of 7.43 GiB of which 6.85 GiB is free. Process 80272 has 68.75 MiB memory in use. Process 83346 has 68.75 MiB memory in use. Process 83365 has 374.75 MiB memory in use. Process 83384 has 70.75 MiB memory in use. 2.90 GiB allowed; Of the allocated memory 240.00 MiB is allocated by PyTorch, and 2.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155811 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman, https://github.com/eqy	2025-06-12 21:05:32 +00:00
Parag Ekbote	093fd47dbe	Add a Additional Example that showcases the usage of `torch.autograd.functional.jacobian` (#155683 ) Fixes #132140 As described in the issue, I've added an example that showcases the use of higher-dimensional inputs and outputs, batched inputs, and the vectorize=True with `torch.autograd.functional.jacobian`. Could you please review? Pull Request resolved: https://github.com/pytorch/pytorch/pull/155683 Approved by: https://github.com/soulitzer	2025-06-12 19:46:55 +00:00
Ankita George	e6d71f3789	Support re-sharding for safetensors checkpoints (#154519 ) This change will add the ability to support re-sharding for hf safetensors checkpoints. This is done by adding more metadata when saving each file. This metadata captures the size and offset of the saved shard. This can be used to re-shard on load by using this information to create the chunks belonging to TensorStorageMetadata class. Differential Revision: [D75226344](https://our.internmc.facebook.com/intern/diff/D75226344/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154519 Approved by: https://github.com/saumishr	2025-06-12 19:38:29 +00:00
Amandeep Chhabra	d3da03d6fa	[2/n]passing event log handler to record function calls (#155457 ) Summary: This diff modifies the elastic agent's API to pass the event log handler to the record function calls. This change enables the elastic agent to log events to a specific destination, improving the monitoring and debugging capabilities of the distributed training process. Test Plan: unit tests ran an e2e training job. Differential Revision: D75194115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155457 Approved by: https://github.com/d4l3k	2025-06-12 19:35:08 +00:00
Justin Silver	e085012335	Fix #155020 - rst2markdown for export.rst (split PR) (#155753 ) Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/) Fixes #155020. This PR is split into 3 to pass sanity check. This is the 3rd one. Docs comparison (check out the 'new' whenever docs build) 1. export ([old](https://docs.pytorch.org/docs/main/export.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/export.html)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155753 Approved by: https://github.com/sekyondaMeta	2025-06-12 19:30:52 +00:00
Laith Sakka	4bb936d8b7	refresh expected results (#155817 ) some changes landed when the test is recently unstable with out updating the results. <img width="564" alt="Screenshot 2025-06-12 at 9 26 32 AM" src="https://github.com/user-attachments/assets/9a83f18b-f2a8-485d-a58e-67d8c161eb18" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/155817 Approved by: https://github.com/yushangdi	2025-06-12 19:14:21 +00:00
Grant Smith	7986c0dba6	rename distributed.rst to md (#155767 ) Fixes #155019 For sanity checks, split PR to have this one only include distributed.rst -> distributed.md Preview -> [distributed.md](https://docs-preview.pytorch.org/pytorch/pytorch/155767/distributed.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155767 Approved by: https://github.com/sekyondaMeta	2025-06-12 18:42:15 +00:00
Nikita Shulga	bcad962550	[BE][Testing] Delete some unused code (#155760 ) - Fix typo in class name `OpenRgistration`->`OpenRegistration` - Use existing `common` alias of `torch.testing._internal.common_utils`, i.e. `s/torch.testing._internal.common_utils.markDynamoStrictTest/common.markDynamoStrictTest/` - Remove unused `TEST_CUDA` and `TEST_ROCM` are unused in that file Pull Request resolved: https://github.com/pytorch/pytorch/pull/155760 Approved by: https://github.com/albanD	2025-06-12 18:41:53 +00:00
Yidi Wu	fac0cc16ef	[scan] fix doc of scan and list the restrctions. (#155577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155577 Approved by: https://github.com/zou3519	2025-06-12 18:22:28 +00:00
Mu-Chu Lee	a1257446f8	[AOTInductor] Memory leak fix for Fallback Kernels (#155642 ) Summary: We generate AtenTensorHandles for Fallback kernels regardless of the arg type. If we indeed "fallback", we will regenerate the AtenTensorHandles that will cause the first handle being generated not recycled, thus a memory leak would occur. Test Plan: python test/inductor/test_aot_inductor.py -k test_fallback_mem_leak Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/155642 Approved by: https://github.com/jingsh, https://github.com/desertfire	2025-06-12 17:42:56 +00:00
atalman	0d3d84d866	[CD] Windows Magma build 12.9 and cuda scripts (#155799 ) Scripts needed to build Magma and CUDA on windows Same as https://github.com/pytorch/pytorch/pull/146653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155799 Approved by: https://github.com/jeanschmidt	2025-06-12 17:41:24 +00:00
Chris Sidebottom	430cc1c636	Run tests on Amazon EC2 M8g Instances (#153940 ) Requires machines configured here: https://github.com/pytorch/test-infra/pull/6642 This adds additional test runs against AWS Graviton4 processors, alongside existing runs against AWS Graviton3 and AWS Graviton2 processors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153940 Approved by: https://github.com/fadara01, https://github.com/malfet	2025-06-12 17:33:08 +00:00
Shangdi Yu	522a18bd6c	Fix provenance unit test (#155747 ) Summary: Fix the test to adapt added provenance tracking in D75837494 Test Plan: ``` buck2 run @//mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_provenance ``` Rollback Plan: Differential Revision: D76466778 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155747 Approved by: https://github.com/YUNQIUGUO	2025-06-12 17:26:43 +00:00
zpcore	50d8168c8b	[DTensor] Support in gradient placement for local_map() (#155181 ) Support `in_grad_placements` argument in torch.distributed.tensor.experimental.local_map(). The argument helps enforce placement of gradient of the input Dtensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155181 Approved by: https://github.com/wanchaol	2025-06-12 17:07:04 +00:00
henrylhtsang	6c0b42fd2f	[inductor][cutlass backend] Log prescreening elpase (#155508 ) Differential Revision: [D76311352](https://our.internmc.facebook.com/intern/diff/D76311352/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155508 Approved by: https://github.com/jingsh	2025-06-12 16:48:52 +00:00
Simon Mahns	c1ae768baa	Basic MTIA ATen CMake (#155477 ) Summary: Basic ATen CMake Differential Revision: D75203592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155477 Approved by: https://github.com/andyanwang, https://github.com/cyyever	2025-06-12 16:29:32 +00:00
Laith Sakka	f4376cac54	unify symbolic_shapes and sizevars dynamic shapes APIs naming 1 (#154774 ) Inductor have a set of APIs that allows performing symbolic evaluations similar to that of symbolic shapes but it operates on sympy expressions instead of symnodes. Namings are not consistent making them consistent in this stack. Step 1 : unify statically_know_true naming! for consistent experience. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154774 Approved by: https://github.com/drisspg, https://github.com/bobrenjc93, https://github.com/eellison	2025-06-12 16:11:55 +00:00
Runtian (Rachel) Li	9df2e8020f	fix code indentation for fx.md (#155764 ) Fixes https://github.com/pytorch/pytorch/issues/155023 Related PR: #155482 Description: As discussed here https://github.com/pytorch/pytorch/pull/155482#pullrequestreview-2918032289, I removed indentation for python code blocks as a follow-up modification for fx.md Checklist: - [x] The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023) - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request. @pytorchbot label "topic: docs" @pytorchbot label "topic: not user facing" @pytorchbot label docathon-h1-2025 @pytorchbot label module: docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/155764 Approved by: https://github.com/svekars	2025-06-12 16:02:33 +00:00
Pian Pawakapan	75824035d3	[dynamic shapes] skip fused linear path if not definitely contiguous (#155051 ) Falls back to non-fused linear -> add bias path for non-contiguous tensors with unbacked sizes Pull Request resolved: https://github.com/pytorch/pytorch/pull/155051 Approved by: https://github.com/laithsakka	2025-06-12 15:55:21 +00:00
Catherine Lee	51560797ce	[CI] Reuse old whl: switch default to always (#155572 ) Switch default to always reuse old whl I have a few worries about API rate limits Pull Request resolved: https://github.com/pytorch/pytorch/pull/155572 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman	2025-06-12 15:43:29 +00:00
Aleksandar Samardžić	62fa3f5aeb	Support tuning of _grouped_mm (#153953 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153953 Approved by: https://github.com/ngimel	2025-06-12 15:39:35 +00:00
Henry Tsang	6b3eef6d31	[cutlass backend] Only consider to use re worker if nvcc doesn't exist (#155745 ) Differential Revision: D76463340 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155745 Approved by: https://github.com/masnesral	2025-06-12 15:23:52 +00:00
Manuel Candales	851a6fa82d	[MPS] Migrate softshrink (forward and backward) to Metal kernel (#155586 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155586 Approved by: https://github.com/malfet ghstack dependencies: #155304, #155316, #155462, #155479, #155571	2025-06-12 15:02:43 +00:00
PyTorch MergeBot	2a3b41cbd0	Revert "[CI] Use `setup-python` from for Mac tests (#155698 )" This reverts commit 2b9d638e3333e6e9ae324e1486774e83292e1883. Reverted https://github.com/pytorch/pytorch/pull/155698 on behalf of https://github.com/malfet due to It causes weird flaky failures in MPS and do not upload usage logs anymore ([comment](https://github.com/pytorch/pytorch/pull/155698#issuecomment-2967120676))	2025-06-12 14:42:32 +00:00
Zhengxu Chen	0fd711df19	[export] Allow user frame to be None when symbolic shape tries to get stacktrace. (#155744 ) Summary: Fixing https://github.com/pytorch/pytorch/issues/155605 Test Plan: CI Rollback Plan: Differential Revision: D76463358 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155744 Approved by: https://github.com/angelayi	2025-06-12 14:36:29 +00:00
Bin Bao	dd1b6621bc	Remove C10_DEPRECATED references in c10 (#151058 ) Summary: Revive https://github.com/pytorch/pytorch/pull/138406. Only limit the scope to files in c10. Summary from the original PR, ``` Looking in the code I see // NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses // the "__declspec(deprecated)" implementation and not the C++14 // "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on // MSVC, but ran into issues with some older MSVC versions. But looking at the MSVC C++ support table I see that the [[deprecated]] attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 or later. Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support [[deprecated]]. Therefore, since we are finished deprecating old MSVCs we can deprecate C10_DEPRECATED. ``` Test Plan: CI Differential Revision: D72762767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151058 Approved by: https://github.com/r-barnes	2025-06-12 13:38:03 +00:00
FFFrog	d632cf2cc9	[Easy][Code Clean] Remove the unused and undefined function in pickler (#155772 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155772 Approved by: https://github.com/malfet	2025-06-12 13:03:36 +00:00
Wang, Chuanqi	8e8d4b13b0	[XPU] Simplify XPU `make triton` by install from PyTorch source (#155675 ) Remove install from source code build Pull Request resolved: https://github.com/pytorch/pytorch/pull/155675 Approved by: https://github.com/atalman	2025-06-12 13:02:23 +00:00
David Berard	132babe7e0	[user triton] dynamo support for new host-side TMA API (#155662 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155662 Approved by: https://github.com/aakhundov ghstack dependencies: #155510	2025-06-12 12:56:23 +00:00
Aaron Gokaslan	9cced33c7c	[BE]: Update cudnn to 9.10.2.21 (#155576 ) Update to CUDNN 9.10.2.21 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155576 Approved by: https://github.com/eqy, https://github.com/atalman	2025-06-12 12:50:36 +00:00
atalman	c199a4d0fd	Move non inductor workflows cuda 12.6->cuda 12.8 (#155234 ) Move non inductor workflows cuda 12.6->cuda 12.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155234 Approved by: https://github.com/Skylion007, https://github.com/zxiiro, https://github.com/cyyever, https://github.com/malfet	2025-06-12 12:42:34 +00:00
leeeizhang	eecaa0bbc6	[Multiprocesing] Fix `_release_ipc_counter` missing in rebuilding cuda ipc tensor with UntypedStorage (#155312 ) Fixes https://github.com/pytorch/pytorch/issues/155311 To avoid `torch.multiprocessing.reductions::rebuild_cuda_tensor` failed on untyped storage, this FIX PR adds the `_release_ipc_counter` into UntypedStorage like the previous legacy typed storage. `e2d141dbde/torch/storage.py (L1466-L1469)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155312 Approved by: https://github.com/mikaylagawarecki	2025-06-12 10:41:58 +00:00
Laith Sakka	0029259bdf	Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757 ) address https://github.com/pytorch/pytorch/issues/153303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757 Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel	2025-06-12 09:58:15 +00:00
Michael Lazos	d3d655ad14	[Hierarchical-Compile] Hash int args in addition to input shapes (#155655 ) Fixes Swsl_resnext101_32x16d in TIMM Pull Request resolved: https://github.com/pytorch/pytorch/pull/155655 Approved by: https://github.com/anijain2305	2025-06-12 06:35:12 +00:00
David Berard	c3ecabf059	[inductor][triton pin] add support for new TMA API for mm.py templates (#155723 ) Triton 3.4 will remove the experimental TMA APIs: https://github.com/triton-lang/triton/pull/6488 For mm.py templates, this PR adds support for using the new APIs when they are available (and otherwise falls back to the experimental APIs). For flex_attention, we'll remove TMA support for Triton 3.2 and 3.3 (versions of triton that don't have the new API). For mm_scaled_grouped.py, https://github.com/pytorch/pytorch/pull/150944 will remove TMA support for Triton 3.2. Note: we attempted this earlier with https://github.com/pytorch/pytorch/pull/154858, but this broke TMA usage in Triton 3.2. Differential Revision: [D76444471](https://our.internmc.facebook.com/intern/diff/D76444471) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155723 Approved by: https://github.com/NikhilAPatel	2025-06-12 06:25:47 +00:00
Nikita Shulga	2b9d638e33	[CI] Use `setup-python` from for Mac tests (#155698 ) Instead of `setup-miniconda` - Remove `CONDA_RUN` macro... - Hack the search path in `macos-test.sh` to put both python and python3 aliases first in the path (not sure what other action are messing with path environment variable) - Skip `TestMultiprocessing.test_fs_sharing` as even though it completes, it hangs on the shutdown both in CI and in all local setups I have - Skip `TestCppExtensionOpenRgistration.test_base_device_registration` as it hangs on the shutdown as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/155698 Approved by: https://github.com/atalman ghstack dependencies: #155476, #155493, #155601, #155515, #155697	2025-06-12 04:58:00 +00:00
Yiming Zhou	57e4d7b5cc	[nativert] Move DelegateExecutor to PyTorch core (#155581 ) Summary: Moves DelegateExecutor base class to PyTorch core. It provides the extension point of backend delegation for NativeRT. Torch Native Runtime RFC: pytorch/rfcs#72 Test Plan: This is only a virtual base class. So relying on internal CI is sufficient. Rollback Plan: Differential Revision: D76351984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155581 Approved by: https://github.com/zhxchen17	2025-06-12 04:33:31 +00:00
Animesh Jain	a9d5157e25	[dynamo] Use BINARY_SUBSCR for pre-graph bytecode for regular dict accesses (#155727 ) vLLM profiler sets with_stack=True that shows the dict_getitem on the profiler, both inflating the numbers and confusing compile users. This PR keeps BINARY_SUBSCR for regular dicts, while using `dict.__getitem__` only for dict subclasses. Using binary_subscr is little bit faster, but not enough to make any major latency improvements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155727 Approved by: https://github.com/zou3519, https://github.com/StrongerXi, https://github.com/jansel	2025-06-12 04:02:29 +00:00
Animesh Jain	c9e9a0c823	[inductor][invoke_subgraph] Mark invoke_subgraph outputs as user_visible to constrain output strides (#155395 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155395 Approved by: https://github.com/zou3519	2025-06-12 03:58:16 +00:00
Shangdi Yu	9f5153b1a4	Preserve GrpahModule node stack trace after torch package deserializaion re-tracing (#155638 ) Summary: urrently the node.meta["stack_trace"] is not preserved when we torch package/load GraphModule, which means the original stack trace is lost. When we re-trace the packaged graph module, we just get a stack trace like fx-generated._0...... Adding the node.meta["stack_trace"] to torch packaged graph module Test Plan: ``` buck2 run @//mode/dev-nosan fbcode//caffe2/test:package -- -r TestPackageFX ``` Rollback Plan: Differential Revision: D76379692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155638 Approved by: https://github.com/angelayi	2025-06-12 03:48:27 +00:00
Nikita Shulga	ce9ba071fd	[BE] Fix warning in open_registration_extension.cpp (#155755 ) Namely ``` /Users/nshulga/git/pytorch/pytorch/test/cpp_extensions/open_registration_extension.cpp:306:33: warning: left operand of comma operator has no effect [-Wunused-value] 306 \| at::Tensor first = at::empty((2,3)).to(at::DeviceType::PrivateUse1); ``` Or switching between Python and C++ is hard In Python `(2, 3)` creates a tuple, in C/C++ it's just a integral literal 3 P.S. I could have vibe-coded the fix with Claude: https://claude.ai/share/82479e88-84cb-4299-aa2f-dafd28ee2d55 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155755 Approved by: https://github.com/huydhn, https://github.com/atalman	2025-06-12 03:01:30 +00:00
Yiming Zhou	d96dec8415	[export] Fix serialization for call_torchbind hop with as_none argument (#155647 ) Summary: As title. D75251816 broke one internal test. This diff fixes it. Test Plan: Internal CI Differential Revision: D76383202 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155647 Approved by: https://github.com/ydwu4	2025-06-12 02:59:03 +00:00
Kazuaki Ishizaki	b00b641ff1	[Docs] Convert to markdown: accelerator.rst, amp.rst, autograd.rst, backends.rst, benchmark_utils.rst (#155762 ) Fixes #155013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155762 Approved by: https://github.com/svekars	2025-06-12 02:55:06 +00:00
Xia, Weiwen	b6f84b3b0f	[Inductor][CPU] Use AMX-based microkernels when M > 4 for GEMM template for INT4 weight (#155444 ) Summary GEMM templates for INT4 weights are used for lowering `aten._weight_int4pack_mm_for_cpu` with Inductor when max-autotune is on. Currently, AMX-based microkernels are used only when M >= 16 if input tensor has shape [M, K]. However, we find that AMX kernel brings performance benefit when 4 < M < 16. For example, on a 6th gen of Intel(R) Xeon(R) platform, E2E latency can be improved by up to > 20% when running Llama-3.1-8B on 32 cores for M = 8. So, this PR changes the threshold so that AMX is used when M > 4. Test plan ``` pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155444 Approved by: https://github.com/sanchitintel, https://github.com/leslie-fang-intel	2025-06-12 02:28:48 +00:00
Simon Fan	212575f994	[ca] Annotate AccumulateGrad branching and add polyfill tests (#155289 ) Annotates AccumulateGrad and tracks the semantics for AccumulateGrad's polyfill , except for Scenario 1.4: Cloning MKLDNN new_grad and Scenario 2.2: Vmap-incompatible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155289 Approved by: https://github.com/jansel, https://github.com/albanD	2025-06-12 02:10:52 +00:00
Yu, Guangye	d84efde3f0	Move _storage_Use_Count to be gerneric (#155451 ) # Motivation `torch._C._storage_Use_Count` should be a generic API that is not aware of device type. It is also used in `337cd7c53d/torchtune/training/_activation_offloading.py (L323)` to do some memory optimization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155451 Approved by: https://github.com/albanD	2025-06-12 01:39:04 +00:00
PyTorch MergeBot	8372d0986a	Revert "[PT2][partitioners] Add aten.split to view_ops list (#155424 )" This reverts commit e1db10e05aa720aef1989773adcf48f311bcf920. Reverted https://github.com/pytorch/pytorch/pull/155424 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cpu_repro.py::CPUReproTests::test_transpose_with_norm [GH job link](https://github.com/pytorch/pytorch/actions/runs/15596830833/job/43931044625) [HUD commit link](`e1db10e05a`) but idk how, reverting to see if it fixes the problem ([comment](https://github.com/pytorch/pytorch/pull/155424#issuecomment-2964717706))	2025-06-12 01:38:34 +00:00
Catherine Lee	9b122aab5d	Fix set per proc memory fraction when running tests (#155631 ) env setting needs to happen before pool creation for it to take effect In theory this should fix some OOMs and also cause some OOMs, but this PR is green so idk alt options: use initializer? Pull Request resolved: https://github.com/pytorch/pytorch/pull/155631 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman	2025-06-12 01:28:08 +00:00
Pian Pawakapan	8ad6197b46	[draft export] avoid storing intermediate real tensors in proxies (#154630 ) Handles GC for non-strict draft export; GPU memory usage shouldn't be much more than eager mode + input tensors now. While trying to do draft export CPU offloading, I found out GC is feasible, because in non-strict, there's 2 places holding references to a `.real_tensor` attribute: 1) the FakeTensors in fake tensor prop, but these are held by the actual variables in the model's forward call, and so the real tensor gets gc-ed along with the fake one when the variable goes out of scope. 2) A clone of the fake tensor in 1) stored in `proxy.node.meta["val"]`, which was added in https://github.com/pytorch/pytorch/pull/150948. But we didn't actually need to store them on intermediate values; the placeholders are enough for retracing/lowering. Avoiding storing the intermediate values in 2), the values in 1) should be naturally GC-ed, and the real-tensor memory usage for non-strict should be pretty similar to eager computation? Strict still OOMs; dynamo still holds these in variable tracking, and not sure how to GC those. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154630 Approved by: https://github.com/angelayi, https://github.com/yushangdi	2025-06-12 01:18:57 +00:00
Shangdi Yu	4e19477196	[nativert] Move Pytree (#155136 ) Summary: fbcode/sigmoid/core/common -> fbcode/caffe2/torch/nativert/common Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/cpp/nativert:pytree_test ``` OSS CI Rollback Plan: Differential Revision: D75965059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155136 Approved by: https://github.com/zhxchen17, https://github.com/XuehaiPan, https://github.com/zou3519	2025-06-12 01:10:34 +00:00
Wanchao Liang	ee5c2908cb	[dtensor] refactor PlacementStrategy -> OpSpec, move utils to OpSchema (#155592 ) as titled. It's sometimes confusing to use PlacementStrategy as a name, as we also have OpStrategy and TupleStrategy, the latter two contain the former, so it is better to make the naming clearer. Renaming PlacementStrategy -> OpSpec as it is an operator spec that contains output_spec + input_specs. Also found some utils can be merged to OpSchema so included together in this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/155592 Approved by: https://github.com/awgu	2025-06-12 00:51:36 +00:00
Huy Do	7485ef078f	Run torch.compile benchmark more frequently on H100 (#155719 ) We have more capacity now with 20+ `linux.aws.h100` runners, half of them are idle. Running benchmark more frequently would utilize these runner better and provide early signals multiple times per day. Running every 8 hours to start with. The workflow usually finishes within 5 hours https://github.com/pytorch/pytorch/actions/runs/15578331612/job/43878878434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155719 Approved by: https://github.com/atalman	2025-06-12 00:24:21 +00:00
Ke Wen	9e9484d022	[SymmMem] Enable NVSHMEM for Triton (#155506 ) (This is an Experimental feature) Allow Triton kernels to invoke NVSHMEM device functions. ### Example Triton program Key parts: - Call `nvshmem.enable_triton()` to initialize; - Call `nvshmem.putmem_block` in Triton kernel; - Add `extern_libs` kwarg at kernel invocation. ``` import torch.distributed._symmetric_memory._nvshmem_triton as nvshmem @triton.jit def put_kernel( dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr, BLOCK_SIZE: tl.constexpr, ): nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer) if __name__ == "__main__": # Enable NVSHMEM for Triton nvshmem_lib = nvshmem.enable_triton() # Use torch Symmetric Memory to allocate Symmetric tensors ... peer = 1 - rank if rank == 0: kernel = put_kernel[(1, 1, 1)]( dst_ptr, src_ptr, numel=numel, peer=peer, BLOCK_SIZE=BLOCK_SIZE, extern_libs=nvshmem_lib, ) dist.barrier() if rank == 1: print(f"Rank {rank}: received {out=}") ``` ### Test output: ``` $ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put Rank 0: writing value 5 to Peer 1 Rank 1: received out=tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:1', dtype=torch.int8) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155506 Approved by: https://github.com/ngimel, https://github.com/fegin, https://github.com/fduwjj	2025-06-12 00:22:49 +00:00
Justin Silver	cf9878d7a2	Fix #155022 rst to markdown conversion (#155540 ) Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/) Fixes #155022 Docs comparison (check out the 'new' whenever docs build) 1. func.ux_limitations ([old](https://docs.pytorch.org/docs/main/func.ux_limitations.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/func.ux_limitations.html)) 2. func.whirlwind_tour ([old](https://docs.pytorch.org/docs/main/func.whirlwind_tour.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/func.whirlwind_tour.html)) 3. future_mod ([old](https://docs.pytorch.org/docs/main/future_mod.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/future_mod.html)) 4. futures ([old](https://docs.pytorch.org/docs/main/futures.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/futures.html)) 5. fx.experimental ([old](https://docs.pytorch.org/docs/main/fx.experimental.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/fx.experimental.html)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155540 Approved by: https://github.com/AlannaBurke, https://github.com/svekars	2025-06-12 00:21:22 +00:00
Sidharth	7918978653	[dynamo] uploaded full json file of all unimplemented_v2() calls currently in repository (#155758 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155758 Approved by: https://github.com/williamwen42	2025-06-12 00:17:28 +00:00
Tsung-Hsien Lee	a6210fd07b	[c10d] Enhance `get_process_group_ranks()` to accept `group=None` (#154902 ) Summary: This diff enhances the `get_process_group_ranks()` function to accept `group=None` as an optional argument. This allows the function to return all ranks associated with the default process group if no group is specified. Test Plan: contbuild & OSS CI Rollback Plan: Differential Revision: D75817800 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154902 Approved by: https://github.com/wz337	2025-06-11 23:41:03 +00:00
eqy	bd3c32916c	[cuDNN] Enabled dilation for deterministic convolutions in cuDNN (#154292 ) Provides order-of-magnitude speedup over fallback impl. https://github.com/pytorch/pytorch/issues/28777 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154292 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-06-11 23:35:52 +00:00
Ankita George	c13e725edd	Updates to HFStorageReader to use TensorStorageMetadata instead of BytesStorageMetadata (#154518 ) As we prepare to support re-sharding, the current approach of using BytesStorageMetadata to read safetenstors won't work anymore. Before, we didn't need to read the metadata of the safetensors file from its header because we were just loading the contents of the file directly into tensors with safetensor.load() that would handle the metadata and deserialization. But now, in preparation of handling re-sharding, we need to read the metadata directly from the header of the safetensors file and store it directly in TensorStorageMetadata objects so that we can perform re-sharding. Re-sharding won't currently work, as we need extra metadata to be stored on each save, so that will be added in a subsequent PR. In addition this PR adds an integration test in addition to the unit tests. It also removes the HfFileSystem import because that's only needed if users are using HfFileSystem, but we want to support any backend. Differential Revision: [D74891998](https://our.internmc.facebook.com/intern/diff/D74891998/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154518 Approved by: https://github.com/saumishr	2025-06-11 23:35:05 +00:00
jafraustro	1b032384b1	Convert rst files to md (#155369 ) Fixes #155021 Fixes #155158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155369 Approved by: https://github.com/svekars, https://github.com/malfet	2025-06-11 23:00:52 +00:00
Nikita Shulga	48921721d8	[MPS] Fix binary builds (#155733 ) Introduced by https://github.com/pytorch/pytorch/pull/155611 All functions in those headers must be static and inline Pull Request resolved: https://github.com/pytorch/pytorch/pull/155733 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-06-11 22:55:33 +00:00
Colin Peppler	c1446e1e9d	[easy] revert unintended changes from #152579 (#155614 ) Summary: I accidentally removed a test and a small change in my pr: https://github.com/pytorch/pytorch/pull/152579 - `test_load_package_multiple_gpus` from https://github.com/pytorch/pytorch/pull/152093 Rollback Plan: Differential Revision: D76370555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155614 Approved by: https://github.com/jingsh	2025-06-11 22:54:58 +00:00
Yidi Wu	4a954fc185	[refactor] make do_auto_functionalize_v2 take HopInstance (#154192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154192 Approved by: https://github.com/zou3519 ghstack dependencies: #155261, #154072, #154191	2025-06-11 22:52:37 +00:00
Yidi Wu	d6be87648f	[hop schema] add schema.tree_spec to support pytree inputs (#154191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154191 Approved by: https://github.com/zou3519 ghstack dependencies: #155261, #154072	2025-06-11 22:52:37 +00:00
Yidi Wu	6ded656aee	[hop] auto functionalize invoke_subgraph (#154072 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154072 Approved by: https://github.com/zou3519 ghstack dependencies: #155261	2025-06-11 22:52:28 +00:00
Yidi Wu	20fb8f5d1f	[refactor] make check input alias and mutation easier to use (#155261 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155261 Approved by: https://github.com/zou3519	2025-06-11 22:52:21 +00:00
Shunting Zhang	61e13782dd	[inductor] handle -1 for pointless view pairs (#155295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155295 Approved by: https://github.com/laithsakka, https://github.com/jansel	2025-06-11 22:20:36 +00:00
loganthomas	458cc7213b	DOC: Convert to markdown: mobile_optimizer.rst, model_zoo.rst, module_tracker.rst, monitor.rst, mps_environment_variables.rst (#155702 ) Fixes #155026 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155702 Approved by: https://github.com/sekyondaMeta, https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-11 22:16:04 +00:00
Shatian Wang	e1db10e05a	[PT2][partitioners] Add aten.split to view_ops list (#155424 ) Summary: Add `aten.split` to view_ops list in partitioners.py Test Plan: na Rollback Plan: Differential Revision: D76011951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155424 Approved by: https://github.com/xuanzhang816	2025-06-11 22:12:13 +00:00
PyTorch MergeBot	f59c76b549	Revert "[BE]: Update cudnn to 9.10.2.21 (#155576 )" This reverts commit 2d3615f577894c7a117a55e85bb8371bb598ec50. Reverted https://github.com/pytorch/pytorch/pull/155576 on behalf of https://github.com/malfet due to breaks the same test again (I remember there were a version that adjusted tolerances), see `bc3972b80a/1` ([comment](https://github.com/pytorch/pytorch/pull/155576#issuecomment-2964404710))	2025-06-11 22:03:45 +00:00
Shangdi Yu	bc3972b80a	[reland] Add stack_trace on make_fx (#155486 ) Summary: Previosuly, we only add stack trace in class _ModuleStackTracer(PythonKeyTracer) for non-strict export. I moved this stack trace logic to the parent class PythonKeyTracer, this way the graph traced from Module using make_fx will have stack_trace as well. Motivation: we've observed some uses cases where users first use make_fx on the Module, and then run export on the resulting graph. If the result of make_fx doesn't have stack trace, the stack trace information is lost. User needs to turn this on by passing in `stack_trace=True` to make_fx. We don't make this the default option since this might increase inductor compilation time (`make_fx` is used in inductor to trace graph patterns for pattern matching). It's also turned on if `_inductor.config.trace.enabled` is True. preserving stack trace is on by default for ModuleStackTracer, which is used for non-strict export. Test Plan: ``` buck run test:test_export -- -r test_stack_trace buck run fbcode//caffe2/test/dynamo:test_dynamo -- -k test_autocast_ordering ``` Rollback Plan: Differential Revision: D76298692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155486 Approved by: https://github.com/angelayi, https://github.com/zou3519	2025-06-11 21:27:43 +00:00
Pian Pawakapan	9bd0830ed8	[dynamic shapes] guard_or_false for cat, repeat (#155290 ) Summary: assumes: - specified repeats are non-negative - 1d cat arguments like [u0] aren't non-zero sized (replaces existing size-oblivious) Test Plan: test_export Rollback Plan: Differential Revision: D76092011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155290 Approved by: https://github.com/laithsakka	2025-06-11 21:03:32 +00:00
Manuel Candales	4609699bfd	[MPS] Migrate leaky_relu (forward and backward) to Metal kernel (#155571 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155571 Approved by: https://github.com/malfet ghstack dependencies: #155304, #155316, #155462, #155479	2025-06-11 20:58:46 +00:00
Manuel Candales	f8d93b3783	[MPS] Migrate hardswish (forward and backward) to Metal kernel (#155479 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155479 Approved by: https://github.com/kulinseth, https://github.com/malfet ghstack dependencies: #155304, #155316, #155462	2025-06-11 20:58:46 +00:00
Zhihan Fang	db5970c1a6	[coreml-backend-tool] fix pytorch-backended issue on new coremltools (#155543 ) Summary: the new coreml tool is export mlpakage instead mlmodel in default option. when we use new 8.0 coreml tool to convert to backend, the error is ``` Exception: MLModel of type mlProgram cannot be loaded just from the model spec object. It also needs the path to the weights file. Please provide that as well, using the 'weights_dir' argument. ``` Test Plan: tested with internal workflow Rollback Plan: Differential Revision: D76325462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155543 Approved by: https://github.com/shoumikhin	2025-06-11 20:52:26 +00:00
Laith Sakka	cec264c8c6	remove single remaining gso from compute_stride (#155635 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155635 Approved by: https://github.com/ColinPeppler	2025-06-11 20:36:21 +00:00
Laith Sakka	cc09d3a5ba	remove float args benchmark (#155674 ) This benchmark very sensitive. removing it for now until we make it better . <img width="755" alt="Screenshot 2025-06-11 at 12 01 25 AM" src="https://github.com/user-attachments/assets/01a45ae5-2028-42a2-b819-c30d4db3b5d4" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/155674 Approved by: https://github.com/bdhirsh, https://github.com/bobrenjc93	2025-06-11 20:34:58 +00:00
Aaron Gokaslan	2d3615f577	[BE]: Update cudnn to 9.10.2.21 (#155576 ) Update to CUDNN 9.10.2.21 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155576 Approved by: https://github.com/eqy, https://github.com/atalman	2025-06-11 20:32:07 +00:00
Catherine Lee	94ae615337	[trymerge] Error on ghstack commit with multiple PRs (#154941 ) see https://github.com/pytorch/pytorch/issues/154427#issuecomment-2932941343 for context Errors if do not find 1 match in ghstack commit Pull Request resolved: https://github.com/pytorch/pytorch/pull/154941 Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman	2025-06-11 20:26:50 +00:00
Justin Silver	b7a73a2cdb	Convert to markdown: export.programming_model.rst (#155659 ) Converts only export.programming_model.rst to markdown Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/) Fixes #155020, but split into a second PR to pass sanity check Docs comparison (check out the 'new' whenever docs build) 1. export.programming_model ([old](https://docs.pytorch.org/docs/main/export.programming_model.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155659/export.programming_model.html)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155659 Approved by: https://github.com/sekyondaMeta	2025-06-11 20:23:46 +00:00
Shuqi Yang	1b6772a90f	A small fix in do_bench_using_profiling (#155500 ) Summary: Results: https://docs.google.com/document/d/1B_4rtiDFPH_jV3VpnqLPnInwDMpF7yX29G82UoJTcu8/edit?tab=t.0 Test Plan: ``` buck2 run mode/opt -c fbcode.enable_gpu_sections=true ai_acceleration/float8/benchmarks/bench:bench_fp8_shapes_eval 2>&1 \| tee output44.txt ``` Rollback Plan: Differential Revision: D76298690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155500 Approved by: https://github.com/yoyoyocmu, https://github.com/nmacchioni	2025-06-11 20:06:19 +00:00
Scott Wolchok	1dd0b1d12b	Unbreak torch.is_vulkan_available() on Mac (re-send of #154675 , please stamp) (#155595 ) This is a new PR duplicating #154675 due to merge issues with that PR coming from my old (now updated) version of ghstack. I am a Vulkan noob, but this extension and flag seem to be necessary. See "Encounted VK_ERROR_INCOMPATIBLE_DRIVER" at https://vulkan-tutorial.com/Drawing_a_triangle/Setup/Instance . (For anyone trying to repro at home, I have the following homebrew packages installed, not all of which may be necessary: molten-vk, vulkan-headers, vulkan-loader, vulkan-tools, vulkan-utility-libraries. I also have VK_ICD_FILENAMES set to /opt/homebrew/etc/vulkan/icd.d/MoltenVK_icd.json, and I built PyTorch with USE_VULKAN=1. Making sure vkcube works helped me debug this setup.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155595 Approved by: https://github.com/malfet	2025-06-11 19:51:35 +00:00
Oguz Ulgen	d1947a8707	Migrate from lru_cache to cache (#155613 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613 Approved by: https://github.com/ezyang ghstack dependencies: #155612	2025-06-11 19:44:18 +00:00
PyTorch MergeBot	f80a61adf5	Revert "[dynamo] added github_cli to detect unimplemented_v2 calls (#155610 )" This reverts commit 5dd07c70e53a86b73f49711b8186d86dc4f1b32a. Reverted https://github.com/pytorch/pytorch/pull/155610 on behalf of https://github.com/malfet due to Looks like it fails on every pull request, based on https://github.com/pytorch/pytorch/actions/workflows/check-unimplemented-calls.yml, but it does not run on trunk ([comment](https://github.com/pytorch/pytorch/pull/155610#issuecomment-2963929765))	2025-06-11 19:31:55 +00:00
Ti-Tai Wang	1e373d02d5	[ONNX] Change deprecation message from 2.8 to 2.9 (#155580 ) ~~The PR: https://github.com/pytorch/pytorch/pull/152478 did not respect the release policy that the deprecation should happen after the deprecation message has been set for 2 releases. This PR postpone 2.8 to the rightful version 2.10.~~ ~~NOTE: "as early as" 2.10 shall give ONNX users more time to adapt and provide feedback.~~ To follow the upcoming torchscript deprecation, `torch.onnx.export` expects to switch dynamo=True (also turn on fallback=True for bc) on torch 2.9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155580 Approved by: https://github.com/justinchuby, https://github.com/tugsbayasgalan	2025-06-11 19:31:29 +00:00
Pian Pawakapan	3f29642ecf	Update XLA pin (#155471 ) Update pin after XLA PR https://github.com/pytorch/xla/pull/9312 landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/155471 Approved by: https://github.com/laithsakka	2025-06-11 19:16:52 +00:00
Aleksandar Samardžić	f8baec8984	Update auto-tuning support for _scaled_grouped_mm (#150944 ) 1. Enable strided inputs 2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs 3. Fix non-TMA load variant 4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor 5. Fix cases when group size along K dimension is not multiple of block size along K 6. Updated meta registration 7. Update synthetic offsets creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/150944 Approved by: https://github.com/ngimel, https://github.com/davidberard98	2025-06-11 19:12:52 +00:00
Simon Fan	6dfada220e	[ca] better error message for subclasses not supported by FakeTensor (#155481 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155481 Approved by: https://github.com/jansel ghstack dependencies: #155473, #155570	2025-06-11 19:09:29 +00:00
Simon Fan	5dcc718a77	[dynamo][ci] update PYTORCH_TEST_WITH_DYNAMO xfail/skips script for 3.13 (#155570 ) No more 311 runners, tested by generating the files for the next PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/155570 Approved by: https://github.com/zou3519 ghstack dependencies: #155473	2025-06-11 19:09:29 +00:00
Simon Fan	87b002b6fb	[ca] make torch.compile API respect ambient disable contexts (#155473 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155473 Approved by: https://github.com/jansel	2025-06-11 19:09:29 +00:00
Manuel Candales	be124a61a4	[MPS] Migrate hardsigmoid (forward and backward) to Metal kernel (#155462 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155462 Approved by: https://github.com/malfet ghstack dependencies: #155304, #155316	2025-06-11 19:09:23 +00:00
Oguz Ulgen	c04a4e7094	Add types to torch/utils/_triton.py (#155612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155612 Approved by: https://github.com/jamesjwu	2025-06-11 19:04:10 +00:00
Kazuaki Ishizaki	2002e3a311	[Docs] Convert to markdown: torch.compiler_transformations.rst, torch.compiler.config.rst (#155347 ) Part of changes #155040 (parent PR #155120) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155347 Approved by: https://github.com/svekars	2025-06-11 18:55:30 +00:00
Runtian (Rachel) Li	925fbfca27	Convert fx.rst to fx.md (#155482 ) Part of changes #155023 (parent PR #155429) @pytorchbot label "topic: docs" @pytorchbot label "topic: not user facing" @pytorchbot label docathon-h1-2025 @pytorchbot label module: docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/155482 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-11 18:46:35 +00:00
Eddie Yan	4d9d884c3f	[NCCL] Expose new `ncclConfig_t` flags in 2.27 (#155379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155379 Approved by: https://github.com/Skylion007	2025-06-11 18:26:55 +00:00
Pian Pawakapan	247f83e0a4	[dynamic shapes] guard individual terms in sym_and; user-code-friendly sym_and/sym_or (#154737 ) Previously when processing `sym_and(a, b, c)`, symbolic shapes wouldn't individually process a, b, and c and store their implications. This would lead us to data-dependent error on individual checks, e.g. we stored `u0 >= 0 & u0 <= 10`, but then couldn't figure out `u0 <= 10`. This handles that, and also makes `sym_and/or` user-code friendly, for testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154737 Approved by: https://github.com/laithsakka	2025-06-11 18:08:06 +00:00
Nikita Shulga	c1cbaca7fd	[CI] Move setuptools requirements from conda to pip (#155697 ) Needed for `import z5` to work without warning, otherwise `LoggingTests.test_logs_out` will fail Pull Request resolved: https://github.com/pytorch/pytorch/pull/155697 Approved by: https://github.com/atalman ghstack dependencies: #155476, #155493, #155601, #155515	2025-06-11 18:03:18 +00:00
PyTorch MergeBot	3a43dba21f	Revert "[cuBLASLt][cuBLAS] Support 2D bias and `beta != 1.0` in cuBLASLt (#154170 )" This reverts commit dc5e8f7999cccb51efcf0f5fe197a740a918c73d. Reverted https://github.com/pytorch/pytorch/pull/154170 on behalf of https://github.com/malfet due to It broke ROCM, see `c75c732481/1` ([comment](https://github.com/pytorch/pytorch/pull/154170#issuecomment-2963708109))	2025-06-11 18:01:08 +00:00
Nikita Shulga	c75c732481	[CI] Disable ET tests (#155708 ) I'm tired of seeing red on PRs and it has been consistently broken since May 30th per `59eb61b2d1/10` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155708 Approved by: https://github.com/clee2000, https://github.com/atalman	2025-06-11 17:56:52 +00:00
penknife6153	59eb61b2d1	[inductor] Improve GEMM logging to display batch size for batched operations (#155544 ) Improves the GEMM overview logging in PyTorch Inductor to properly display batch size information for batched matrix operations like `torch.bmm` and `torch.baddbmm`. Fixes #155307 ## Problem The current GEMM logging for `torch.bmm` shows: ```python # Repro import os os.environ["TORCH_LOGS"] = "inductor" import torch M, N, K = 1024, 1024, 1024 dtype = torch.bfloat16 A = torch.randn(10, M, K, device="cuda", dtype=dtype) B = torch.randn(10, K, N, device="cuda", dtype=dtype) compiled_model = torch.compile(torch.bmm, fullgraph=True) _ = compiled_model(A, B) ``` Before: ``` Name \| M \| N \| K \| Count ---------------------------------------------------------------------------------------------------- aten.bmm \| 1024 \| 1024 \| 1024 \| 1 ---------------------------------------------------------------------------------------------------- ``` The batch size (10) is missing from the logs, making it unclear what the actual operation dimensions were. ## Solution After: ``` Name \| B \| M \| N \| K \| Count ---------------------------------------------------------------------------------------------------------------------------------- aten.bmm \| 10 \| 1024 \| 1024 \| 1024 \| 1 aten.mm \| - \| 1024 \| 1024 \| 1024 \| 2 ---------------------------------------------------------------------------------------------------------------------------------- ``` ## Changes Made ### 1. Enhanced Parsing Logic in compile_fx.py - Detects batched operations by checking if operation name ends with `'bmm'` or `'baddbmm'` - For batched operations: takes last 4 parts as `batch, m, n, k` - For non-batched operations: takes last 3 parts as `m, n, k` - Dedicated "B" column: Added separate column for batch size instead of embedding in operation name - Shows batch size for batched operations, shows "-" for non-batched operations ### 2. Updated All MM Operations for Consistency - bmm.py: - Extract batch size from `mat1.get_size()[0]` for both `tuned_bmm` and `tuned_baddbmm` - Use positional counter keys: `aten.bmm_{batch_size}_{m}_{n}_{k}` - Enhanced log messages to include batch size information - mm.py: Updated counter keys for consistency: - `aten.mm_{m}_{n}_{k}` (no batch dimension) - `aten.addmm_{m}_{n}_{k}` (no batch dimension) - `aten._int_mm_{m}_{n}_{k}` (no batch dimension) - `aten._scaled_mm.default_{m}_{n}_{k}` (no batch dimension) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155544 Approved by: https://github.com/jansel, https://github.com/BoyuanFeng	2025-06-11 16:57:40 +00:00
Colin Peppler	7b7cd56f5e	[export] support linear & layer_norm unbacked (#155260 ) ## What - use `definitely_contiguous_for_memory_format` instead of `is_contiguous` when the non-contiguous case is fine if we encounter a DDE. - use ref's contiguous over Aten's contiguous because Aten's version will DDE and stop tracing. ref's version will use `definitely_contiguous_for_memory_format` and clone if there's a DDE. ## Example DDEs - Fixed with `definitely_contiguous_for_memory_format` in `fast_binary_impl` ``` torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq((u0//387), 0) (unhinted: Eq((u0//387), 0)). (Size-like symbols: u0) Caused by: layer_norm = self.layer_norm(linear) # caffe2/test/export/test_export.py:4566 in forward (_subclasses/fake_impls.py:1022 in fast_binary_impl) ``` - Fixed with `refs.contiguous` instead of calling aten's contiguous (that'd require a bigger re-write in Aten) ``` File "c10/core/TensorImpl.h", line 825, in torch::autograd::THPVariable_contiguous(_object, _object, _object) File "c10/core/SymbolicShapeMeta.h", line 87, in c10::TensorImpl::is_contiguous_default(c10::MemoryFormat) const File "c10/core/SymbolicShapeMeta.cpp", line 250, in c10::SymbolicShapeMeta::init_is_contiguous() const torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(128((u0//387)), 0) (unhinted: Eq(128((u0//387)), 0)). (Size-like symbols: u0) Caused by: (_refs/__init__.py:3302 in native_layer_norm) ``` - Fixed with `definitely_contiguous_for_memory_format` in ref's contiguous ``` torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression 387((u0//387)) < 2 (unhinted: 387*((u0//387)) < 2). (Size-like symbols: u0) Caused by: (_prims_common/__init__.py:279 in is_contiguous) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155260 Approved by: https://github.com/laithsakka ghstack dependencies: #155499	2025-06-11 16:47:34 +00:00
Jing Shan	b49edc0d6c	[Export] Fix some typos in docstring (#155485 ) Summary: nit change, fix the doc string Test Plan: CI Rollback Plan: Differential Revision: D76297740 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155485 Approved by: https://github.com/ColinPeppler	2025-06-11 16:44:38 +00:00
zeshengzong	18bf6addc4	set_grad_enabled add str and repr for prints (#155681 ) Fixes #86718 ## Test Result ```python >>> import torch >>> torch.set_grad_enabled(False) torch.autograd.grad_mode.set_grad_enabled(mode=False) >>> print(torch.set_grad_enabled(False)) torch.autograd.grad_mode.set_grad_enabled(mode=False) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155681 Approved by: https://github.com/soulitzer	2025-06-11 16:01:03 +00:00
Eddie Yan	dc5e8f7999	[cuBLASLt][cuBLAS] Support 2D bias and `beta != 1.0` in cuBLASLt (#154170 ) Fixes https://github.com/pytorch/pytorch/issues/153590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154170 Approved by: https://github.com/malfet	2025-06-11 15:20:48 +00:00
PyTorch MergeBot	45c5a23237	Revert "Add Intel GPU info collection to the collect env script (#137846 )" This reverts commit 5264f8cd8d08272003298cdefe6bd60b1b8c80b4. Reverted https://github.com/pytorch/pytorch/pull/137846 on behalf of https://github.com/malfet due to Just testing if it will fix PR time benchmarks signal ([comment](https://github.com/pytorch/pytorch/pull/137846#issuecomment-2963232606))	2025-06-11 15:18:47 +00:00
Nikita Shulga	359e8f5d69	[CI] Use `setup-python` from test-infra to do MacOS builds (#155515 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155515 Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/atalman ghstack dependencies: #155476, #155493, #155601	2025-06-11 15:11:38 +00:00
David Berard	9328a7fb58	[triton pin][tests] refactor test_triton_kernel.py tests to test new & old API (#155510 ) This splits out the tests so we can independently test both the new and old API. Note: the new API doesn't work yet - we still need to fix those tests. Differential Revision: [D76318840](https://our.internmc.facebook.com/intern/diff/D76318840) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155510 Approved by: https://github.com/oulgen	2025-06-11 13:52:15 +00:00
Ting Lu	4c3da611c2	Add CUDA 12.9.1 x86 nightly binaries (#154980 ) Adding CUDA 12.9.1 to nightly binaries matrix for linux (x86) builds. Add sbsa and libtorch build docker images, builds addition will be follow-up PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154980 Approved by: https://github.com/eqy, https://github.com/atalman	2025-06-11 13:43:17 +00:00
Kurt Mohler	013cf1e330	[MPS] Move expm1 op to Metal (#155611 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155611 Approved by: https://github.com/malfet	2025-06-11 13:06:14 +00:00
Bin Bao	44df7cf28d	[AOTI] Fix embed_kernel_binary error when max_autotune is ON (#155569 ) Summary: Stop removing cubin files so that it won't be missing when max_autotune is ON. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155569 Approved by: https://github.com/angelayi, https://github.com/yushangdi	2025-06-11 12:27:36 +00:00
Boyuan Feng	f34ab1628b	[Graph Partition] move cpu scalar tensor to gpu (#154464 ) cudagraph does not support cpu tensors. In this PR, we update the graph by explicitly moving cpu tensors to gpu when profitable, relying on graph partition to split off this data copy, and cudagraphifying the remaining gpu ops. This PR unblocked the graph partition + cudagraph on speech_transformer, leading to 39.5% speedup on inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315). Close: #119241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154464 Approved by: https://github.com/eellison, https://github.com/mlazos	2025-06-11 10:22:45 +00:00
Wang, Chuanqi	eaceb243df	[BE] Update the XPU support package to 2025.1.3 (#154346 ) Fixes #153632 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154346 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-06-11 09:46:18 +00:00
Laith Sakka	2585960b47	remove redundent type_id (#155539 ) Those were added in https://github.com/pytorch/pytorch/pull/92229 to prevent confusion of overloads. but the variants that accepts SymBool are all removed in https://github.com/pytorch/pytorch/pull/112890 with the introduction of SymbolicShapeMeta. Hence that dummy arg is not needed anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155539 Approved by: https://github.com/ezyang	2025-06-11 08:46:56 +00:00
David Berard	717a099d42	Revert "[flex attention][triton pin] triton_helpers shim for TMA apis (#154858 )" (#155640 ) This reverts commit ea7b233015ff00098df687884be4e2efbf7a55fa. It fails internal tests in fbcode, but they weren't running so we didn't get signal Reverting w/ a PR/diff because the PR has been landed for ~1 week - too old to revert directly from internal. Differential Revision: [D76380887](https://our.internmc.facebook.com/intern/diff/D76380887) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155640 Approved by: https://github.com/nmacchioni, https://github.com/danzimm	2025-06-11 07:37:47 +00:00
Oguz Ulgen	0e2013a12d	Add helion x pt2 test (#155513 ) This kinda just worked out of the box, shocking. PT2 traced into helion and emitted it as a user defined triton kernel: P1836496774 In the long run, we do not actually want this, but rather to create a helion HOP so we can do fusions etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155513 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-06-11 07:08:06 +00:00
bobrenjc93	5b9db4335e	Include c++ stack traces when we hit constraint violation (#155603 ) Example new error message ``` torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO. Framework stack: File "??", line 0, in _start File "", line 0, in __libc_start_main_alias_2 File "??", line 0, in __libc_start_call_main File "/usr/local/src/conda/python-3.10.16/Modules/main.c", line 1094, in Py_BytesMain File "/usr/local/src/conda/python-3.10.16/Modules/main.c", line 357, in pymain_run_file_obj File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 90, in _PyRun_AnyFileObject File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 456, in _PyRun_SimpleFileObject File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1208, in pyrun_file File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1312, in run_mod File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1291, in run_eval_code_obj File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 1134, in PyEval_EvalCode File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/scratch/repro.py", line 9, in <module> foo(x) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/eval_frame.py", line 699, in compile_wrapper return fn(args, kwargs) File "offloadstuff.c", line 0, in dynamo__custom_eval_frame File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 305, in _PyObject_Call File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1469, in __call__ return self._torchdynamo_orig_callable( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 153, in _PyObject_FastCallDictTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1248, in __call__ result = self._inner_convert( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 153, in _PyObject_FastCallDictTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 625, in __call__ return _compile( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1092, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_utils_internal.py", line 97, in wrapper_function return function(args, *kwargs) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 779, in compile_inner return _compile_inner(code, one_graph, hooks, transform) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 818, in _compile_inner out_code = transform_code_object(code, transform) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1424, in transform_code_object transformations(instructions, code_options) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 265, in _fn return fn(args, kwargs) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 743, in transform tracer.run() File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 3531, in run super().run() File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1359, in run while self.step(): File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1263, in step self.dispatch_table[inst.opcode](self, inst) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 422, in impl self.push(fn_var.call_function(self, self.popn(nargs), {})) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1160, in call_function return handler(tx, args, kwargs) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 792, in <lambda> return lambda tx, args, kwargs: obj.call_function( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1160, in call_function return handler(tx, args, kwargs) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1120, in _handle_insert_op_in_graph return wrap_fx_proxy(tx, proxy) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2500, in wrap_fx_proxy return wrap_fx_proxy_cls(target_cls=TensorVariable, kwargs) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2566, in wrap_fx_proxy_cls return _wrap_fx_proxy( File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2664, in _wrap_fx_proxy example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3205, in get_fake_value ret_val = wrap_fake_exception( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 2705, in wrap_fake_exception return fn() File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3206, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3373, in run_node return node.target(args, kwargs) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5917, in do_call_core File "/usr/local/src/conda/python-3.10.16/Objects/methodobject.c", line 430, in cfunction_vectorcall_FASTCALL File "/usr/local/src/conda/python-3.10.16/Objects/abstract.c", line 891, in binary_op1 File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7284, in slot_nb_multiply File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Objects/descrobject.c", line 344, in method_vectorcall_VARARGS_KEYWORDS File "python_variable_methods.cpp", line 0, in _object torch::autograd::TypeError_to_NotImplemented_<&torch::autograd::THPVariable_mul>(_object, _object, _object) File "python_variable_methods.cpp", line 0, in torch::autograd::THPVariable_mul(_object, _object, _object) File "??", line 0, in at::_ops::mul_Tensor::call(at::Tensor const&, at::Tensor const&) File "offloadstuff.c", line 0, in c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, at::Tensor const&), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) const File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const File "PythonFallbackKernel.cpp", line 0, in void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::pythonTLSSnapshotFallback>(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) const File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const File "VariableType_0.cpp", line 0, in c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::mul_Tensor>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, false>::call(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) File "VariableType_0.cpp", line 0, in torch::autograd::VariableType::(anonymous namespace)::mul_Tensor(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) File "??", line 0, in at::_ops::mul_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) File "offloadstuff.c", line 0, in c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, at::Tensor const&), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) const File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const File "PythonFallbackKernel.cpp", line 0, in (anonymous namespace)::pythonFallback(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::dispatch(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >) const File "??", line 0, in torch::handle_torch_function_no_python_arg_parser(c10::ArrayRef<_object>, _object, _object, char const, _object, char const, torch::TorchFunctionName) File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 577, in PyObject_CallMethod File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1346, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2029, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1442, in _cached_dispatch_impl return self._dispatch_impl(func, types, args, kwargs) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2552, in _dispatch_impl return maybe_propagate_real_tensors(fast_impl(self, args, *kwargs)) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_impls.py", line 956, in fast_binary_impl final_shape = infer_size(final_shape, shape) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_impls.py", line 916, in infer_size torch._check( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/__init__.py", line 1669, in _check _check_with(RuntimeError, cond, message) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/__init__.py", line 1632, in _check_with if expect_true(cond): File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1686, in expect_true return a.node.expect_true( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 552, in expect_true return self.guard_bool(file, line) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 536, in guard_bool r = self.evaluate() File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 510, in evaluate return self.shape_env.evaluate_sym_node(self, size_oblivious) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7113, in evaluate_sym_node return self.evaluate_expr( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/recording.py", line 272, in wrapper return retlog(fn(args, *kwargs)) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7215, in evaluate_expr return self._inner_evaluate_expr( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/recording.py", line 272, in wrapper return retlog(fn(args, *kwargs)) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7238, in _inner_evaluate_expr return self._evaluate_expr( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7505, in _evaluate_expr self._maybe_guard_rel(g) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6758, in _maybe_guard_rel self._refine_ranges(expr) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7709, in _refine_ranges self._set_replacement( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6667, in _set_replacement self.framework_specialization_stacks[source] = CapturedTraceback.extract(cpp=True) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/utils/_traceback.py", line 207, in extract torch._C._profiler.gather_traceback(python=True, script=script, cpp=cpp), File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall File "/usr/local/src/conda/python-3.10.16/Objects/methodobject.c", line 543, in cfunction_call File "offloadstuff.c", line 0, in pybind11::cpp_function::dispatcher(_object, _object, _object) File "offloadstuff.c", line 0, in pybind11::cpp_function::initialize<std::shared_ptr<torch::CapturedTraceback> (&)(bool, bool, bool), std::shared_ptr<torch::CapturedTraceback>, bool, bool, bool, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(std::shared_ptr<torch::CapturedTraceback> (&)(bool, bool, bool), std::shared_ptr<torch::CapturedTraceback> ()(bool, bool, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) File "??", line 0, in torch::CapturedTraceback::gather(bool, bool, bool) File "??", line 0, in torch::unwind::unwind() User stack: File "/home/bobren/local/a/pytorch/scratch/repro.py", line 5, in foo return torch.randn(5) x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155603 Approved by: https://github.com/zou3519, https://github.com/cyyever ghstack dependencies: #155133	2025-06-11 05:00:36 +00:00
Yiming Zhou	84c14361c2	[ez][AOTI] Add test for std::nullopt return in custom op (#155636 ) Summary: As title. Follow up of https://github.com/pytorch/pytorch/pull/154286 Test Plan: buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_optional_tensor_nullopt_output Rollback Plan: Differential Revision: D76378892 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155636 Approved by: https://github.com/zou3519, https://github.com/cyyever	2025-06-11 03:52:31 +00:00
zeshengzong	1e690b6c41	Replace TORCH_INTERNAL_ASSERT with TORCH_CHECK in set_history (#155453 ) Fixes #154357 ## Test Result ```bash >>> import torch >>> >>> x = torch.tensor(1, device=torch.device('cpu')) >>> y = torch.tensor([1.0, 2.0, 3.0], requires_grad=True) >>> z0 = (x.abs() * y).prod(dtype=torch.int16) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: Autograd not support dtype: Short ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155453 Approved by: https://github.com/albanD, https://github.com/soulitzer	2025-06-11 03:46:48 +00:00
Arsh Zahed	110ae0f433	Custom Op handle 1-element tuples (#155447 ) Fixes #150472 Modification of [PR 151408](https://github.com/pytorch/pytorch/pull/151408). This PR modifies the return parsing in `infer_schema` to handle the case of a Tuple with a single element. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155447 Approved by: https://github.com/bdhirsh, https://github.com/zou3519	2025-06-11 03:43:40 +00:00
Brian Hirsh	a2b0b2698d	inductor codecache: include private inductor configs in cache key (#153672 ) Fixes https://github.com/pytorch/torchtitan/issues/1185 It looks like inductor's logic to include inductor configs in the cache key skips configs with a leading underscore by default. This came up in torchtitan - there's an asyncTP pipelining pass in inductor gated by a private config, and by not caching on the config we were attempting to use asyncTP when we shouldn't be. I'm not sure how worried we should be on the blast radius of this change. On the one hand: (1) it technically fixes any silent correctness issues in the cache around any other private inductor configs (it looks like there are a few) (2) there is some risk that there are some "harmless" configs that we are now including in the key, which may increase false negatives. I do see that there is an explicit list for "configs we want to ignore for caching" (`_save_config_ignore`), so my hope is that all harmless configs are already encapsulated there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153672 Approved by: https://github.com/oulgen	2025-06-11 01:33:24 +00:00
Jing Xu	5264f8cd8d	Add Intel GPU info collection to the collect env script (#137846 ) As title, add Intel GPU info collection to the collect env script Output examples: 1. CPU on Windows ``` C:\Users\user\miniforge3\envs\py310\lib\site-packages\torch\_subclasses\functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_numpy.cpp:81.) cpu = _conversion_method_template(device=torch.device("cpu")) Collecting environment information... PyTorch version: 2.8.0.dev20250528+cpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Microsoft Windows 11 Enterprise (10.0.22631 64-bit) GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: N/A Python version: 3.10.17 \| packaged by conda-forge \| (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.22631-SP0 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Name: 12th Gen Intel(R) Core(TM) i7-1270P Manufacturer: GenuineIntel Family: 198 Architecture: 9 ProcessorType: 3 DeviceID: CPU0 CurrentClockSpeed: 1711 MaxClockSpeed: 2200 L2CacheSize: 9216 L2CacheSpeed: None Revision: None Versions of relevant libraries: [pip3] torch==2.8.0.dev20250528+cpu [conda] torch 2.8.0.dev20250528+cpu pypi_0 pypi ``` 2. XPU on Windows ``` Collecting environment information... PyTorch version: 2.8.0a0+gitef6306e Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Microsoft Windows 10 Pro (10.0.19045 64-bit) GCC version: (GCC) 13.1.0 Clang version: Could not collect CMake version: version 3.29.3 Libc version: N/A Python version: 3.10.17 \| packaged by conda-forge \| (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.19045-SP0 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Is XPU available: True XPU used to build PyTorch: 20250101 Intel GPU driver version: * 32.0.101.6795 (20250520000000.****+) Intel GPU models onboard: Intel(R) Arc(TM) A770 Graphics Intel GPU models detected: * [0] _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.33184', total_memory=15915MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1) HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: ---------------------- Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz Manufacturer: GenuineIntel Family: 179 Architecture: 9 ProcessorType: 3 DeviceID: CPU0 CurrentClockSpeed: 2401 MaxClockSpeed: 2401 L2CacheSize: 24576 L2CacheSpeed: None Revision: 21767 ---------------------- Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz Manufacturer: GenuineIntel Family: 179 Architecture: 9 ProcessorType: 3 DeviceID: CPU1 CurrentClockSpeed: 2200 MaxClockSpeed: 2401 L2CacheSize: 24576 L2CacheSpeed: None Revision: 21767 Versions of relevant libraries: [pip3] intel_extension_for_pytorch==2.8.10+gitb3ea3a1 [pip3] numpy==2.1.2 [pip3] optree==0.13.1 [pip3] pytorch-triton-xpu==3.3.1+gitb0e26b73 [pip3] torch==2.8.0a0+gitef6306e [conda] intel-extension-for-pytorch 2.8.10+gitb3ea3a1 pypi_0 pypi [conda] mkl 2025.1.0 pypi_0 pypi [conda] mkl-dpcpp 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-blas 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-datafitting 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-dft 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-lapack 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-rng 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-sparse 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-stats 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-vm 2025.1.0 pypi_0 pypi [conda] pytorch-triton-xpu 3.3.1+gitb0e26b73 pypi_0 pypi [conda] torch 2.8.0a0+gitef6306e pypi_0 pypi ``` 3. CPU on Linux ``` /opt/python/cp312-cp312/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.) cpu = _conversion_method_template(device=torch.device("cpu")) Collecting environment information... PyTorch version: 2.8.0.dev20250528+cpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: AlmaLinux 8.10 (Cerulean Leopard) (x86_64) GCC version: (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) Clang version: Could not collect CMake version: version 4.0.0 Libc version: glibc-2.28 Python version: 3.12.10 (main, Apr 19 2025, 05:03:56) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] (64-bit runtime) Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.28 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 88 On-line CPU(s) list: 0-87 Thread(s) per core: 2 Core(s) per socket: 22 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz Stepping: 7 CPU MHz: 1000.000 CPU max MHz: 3700.0000 CPU min MHz: 1000.0000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 30976K NUMA node0 CPU(s): 0-21,44-65 NUMA node1 CPU(s): 22-43,66-87 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear flush_l1d arch_capabilities Versions of relevant libraries: [pip3] torch==2.8.0.dev20250528+cpu [conda] Could not collect ``` 5. XPU on Linux ``` Collecting environment information... PyTorch version: 2.8.0.dev20250516+xpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.31.6 Libc version: glibc-2.35 Python version: 3.10.17 \| packaged by conda-forge \| (main, Apr 10 2025, 22:19:12) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-5.15.50-051550-generic-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Is XPU available: True XPU used to build PyTorch: 20250101 Intel GPU driver version: * intel_opencl: 24.39.31294.21-1032~22.04 * level_zero: 1.17.44.0-1022~22.04 Intel GPU models onboard: * Intel(R) Data Center GPU Max 1550 * Intel(R) Data Center GPU Max 1550 * Intel(R) Data Center GPU Max 1550 * Intel(R) Data Center GPU Max 1550 Intel GPU models detected: * [0] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [1] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [2] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [3] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [4] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [5] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [6] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [7] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 224 On-line CPU(s) list: 0-223 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8480+ CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 56 Socket(s): 2 Stepping: 6 CPU max MHz: 3800.0000 CPU min MHz: 800.0000 BogoMIPS: 4000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 5.3 MiB (112 instances) L1i cache: 3.5 MiB (112 instances) L2 cache: 224 MiB (112 instances) L3 cache: 210 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-55,112-167 NUMA node1 CPU(s): 56-111,168-223 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==2.2.5 [pip3] pytorch-triton-xpu==3.3.0+git0bcc8265 [pip3] torch==2.8.0.dev20250516+xpu [conda] mkl 2025.1.0 pypi_0 pypi [conda] numpy 2.2.5 pypi_0 pypi [conda] onemkl-sycl-blas 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-dft 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-lapack 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-rng 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-sparse 2025.1.0 pypi_0 pypi [conda] pytorch-triton-xpu 3.3.0+git0bcc8265 pypi_0 pypi [conda] torch 2.8.0.dev20250516+xpu pypi_0 pypi ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137846 Approved by: https://github.com/guangyey, https://github.com/malfet Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-06-11 01:22:06 +00:00
Michael Lazos	3040ca6d0f	[Cutlass] Include fp8 headers in aoti cpp wrapper (#155173 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155173 Approved by: https://github.com/desertfire ghstack dependencies: #154829, #154835, #155195	2025-06-11 01:21:16 +00:00
soulitzer	1ed243f01c	Add missing attr access check for legacy autograd.Function (#155055 ) Fixes https://github.com/pytorch/pytorch/issues/154981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155055 Approved by: https://github.com/albanD ghstack dependencies: #154509, #154852	2025-06-11 01:03:49 +00:00
Sidharth	5dd07c70e5	[dynamo] added github_cli to detect unimplemented_v2 calls (#155610 ) This PR adds the workflow of whenever a dev makes a PR that contains files under torch/_dynamo, we check for any unimplemented_v2() callsites and if any of them have been modified in some sort of way, the workflow fails and lists them exactly which callsites and let's them know what the command lines are to update the registry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155610 Approved by: https://github.com/williamwen42	2025-06-11 00:40:56 +00:00
soulitzer	3580b8dde4	[BE] Mention debug=True in AC error messages (#155593 ) See https://github.com/pytorch/pytorch/issues/155171#issuecomment-2949415407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155593 Approved by: https://github.com/janeyx99	2025-06-11 00:32:41 +00:00
Ankita George	dbec08bc1c	Changes to HFStorageWriter to support saving shards of tensors (#154742 ) (#155566 ) Summary: As we move towards supporting saving partial tensors natively with HFStorageWriter, there are some simple changes that need to be made to make this happen. - The current approach for distributed writes is that every rank has full tensors, but we split up the writing of these full tensors across all available ranks. We're removing this logic that was in the HFSavePlanner and instead assuming that every rank has a shard and saving every rank's local state - as a result we can probably remove the HFSavePlanner, but keeping it as a placeholder for now - the current naming of files doesn't support shards as its in the format "model-00001-of-00004.safetensors", but if every rank is writing the same file names they will overwrite eachother, so this adds a shard-00001 prefix, so that the rank files don't overwrite eachother - don't save the metadata file models.safetensors.index.json if sharding is enabled. This file expects a 1 to 1 ratio between tensor and filename, but this doesn't make sense in the sharded saving approach, so we can just get rid of this file - make the "fqn_to_file_index" map optional. This is to describe which files to save which tensors in, but if users don't want to provide this, we can just save all the tensors to one file. If they run into issues, they can choose how to split up their tensors to be more friendly with 5GB HF remote storage file size soft limit. Test Plan: test_hf_storage.py Reviewed By: saumishr Differential Revision: D75099862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155566 Approved by: https://github.com/saumishr	2025-06-10 23:37:47 +00:00
Siddharth Kotapati	2161be8497	Move unary trig ops to metal kernels (#154465 ) Move inverse trig unary ops, sinh, & cosh to metal kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/154465 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-06-10 22:56:59 +00:00
Joel Schlosser	c4b93e6579	Replace frame_traced_fn hook with get_traced_code() util (#155249 ) #153622 introduced a hook for getting the relevant code objects after frame tracing. The idea is to have vLLM use this instead of monkey-patching `inline_call_()` to determine the source code files to hash. Unfortunately, the hook runs too late; the vLLM backend needs access to the set of source code filenames while it's running. This PR replaces the newly-added hook with a utility function that a backend can call to get this information. I've made the change in vLLM and can verify that this allows the information to be queried at the right time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155249 Approved by: https://github.com/zou3519	2025-06-10 22:40:58 +00:00
dolpm	8892b782a8	[nativert] move execution planner to torch (#155374 ) Summary: att Test Plan: ci Rollback Plan: Differential Revidsion: D76167093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155374 Approved by: https://github.com/zhxchen17	2025-06-10 22:36:06 +00:00
fduwjj	ffc6cbfaf7	[symm_mem] Move all symm mem code into a dedicated folder (#155573 ) We arrive at a point when so many files are related to symmetric memory and files are scattered around in the cpp side. Let's first put all related code (symmetric memory related) into a separate folder. We can do further refactoring later if needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155573 Approved by: https://github.com/fegin, https://github.com/d4l3k	2025-06-10 22:30:11 +00:00
Nikita Shulga	3e131f7779	[CI] Move tlparse to requirements files (#155601 ) Not sure why we had it that way to begin with Pull Request resolved: https://github.com/pytorch/pytorch/pull/155601 Approved by: https://github.com/seemethere ghstack dependencies: #155476, #155493	2025-06-10 22:25:47 +00:00
Jane Xu	94da4523ec	Disable foreach tests that depend on profiler for CUDA 12.6 (#155596 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155596 Approved by: https://github.com/clee2000, https://github.com/malfet	2025-06-10 22:21:06 +00:00
atalman	672ac2ec86	Reapply "Cleanup VS 2019 refs in pytorch (#145863 )" (#152613 ) (#155478 ) This reverts commit e4f2282. I believe fix PR was landed https://github.com/pytorch/pytorch/pull/153480 that triggered the revert. Hence this is reland. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155478 Approved by: https://github.com/malfet	2025-06-10 22:20:14 +00:00
PyTorch MergeBot	3b7c5e6fa5	Revert "[inductor][triton pin] TMA shim refactor & mm, mm_scaled_grouped support (#155182 )" This reverts commit b07725a9516028a485153c4b5356b3e33b990f81. Reverted https://github.com/pytorch/pytorch/pull/155182 on behalf of https://github.com/davidberard98 due to fails on triton 3.2 (internally) ([comment](https://github.com/pytorch/pytorch/pull/155182#issuecomment-2960664845))	2025-06-10 21:53:01 +00:00
Oleksandr Stashuk	d2f06d2b06	[logs] Change autotune data into separate items (#155525 ) Summary: Split the autotune data into multiple keys and items : this is better for storage of the data and easier querying. Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 tlp buck run (sample) ``` Rollback Plan: Differential Revision: D76303514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155525 Approved by: https://github.com/jamesjwu, https://github.com/masnesral	2025-06-10 21:47:07 +00:00
GdoongMathew	14f3639e09	Convert to .md: onnx_verification.rst, onnx.rst, package.rst, (#155556 ) Fixes https://github.com/pytorch/pytorch/issues/155031 * [onnx_verification.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_verification.rst) * [onnx.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx.rst) * [package.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/package.rst) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155556 Approved by: https://github.com/AlannaBurke, https://github.com/sekyondaMeta	2025-06-10 21:40:40 +00:00
nirajkamalk	ae0f1f8984	Convert to markdown onnx rst (#155228 ) Fixes #155030 Converts the following files to MyST markdown and ensure that the doc tests are green: - [x] [onnx_dynamo_onnxruntime_backend.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_dynamo_onnxruntime_backend.rst) - [x] [onnx_dynamo.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_dynamo.rst) - [x] [onnx_ops.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_ops.rst) - [onnx_torchscript_supported_aten_ops.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_torchscript_supported_aten_ops.rst) - not changed as it is autogenerated - [onnx_torchscript.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_torchscript.rst) - fixed in #155390 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155228 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-10 21:33:07 +00:00
atalman	7a03b0d2ca	[BE] Remove CUDA 11 artifacts. Fix Check Binary workflow (#155555 ) Please see: https://github.com/pytorch/pytorch/issues/147383 1. Remove CUDA 11 build and test artifacts. One place CUDA 12.4 2. Fix Check Binary Workflow to use Stable Cuda version variable rather then hardcoded one Pull Request resolved: https://github.com/pytorch/pytorch/pull/155555 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-06-10 21:32:08 +00:00
PyTorch MergeBot	40fefe2871	Revert "[BE] Update cudnn to 9.10.1.4 (#155122 )" This reverts commit 73220d52fd67b5f4f5b15e0e0433e09733c93f31. Reverted https://github.com/pytorch/pytorch/pull/155122 on behalf of https://github.com/atalman due to wrong pr description ([comment](https://github.com/pytorch/pytorch/pull/155122#issuecomment-2960592004))	2025-06-10 21:13:18 +00:00
Alberto A. Gallegos	8a396c5635	DOC: Convert to markdown: torch.compiler_best_practices_for_backends.rst, torch.compiler_cudagraph_trees.rst, torch.compiler_custom_backends.rst, torch.compiler_dynamic_shapes.rst, torch.compiler_dynamo_deepdive.rst (#155137 ) Fixes #155037 [torch.compiler_best_practices_for_backends.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/torch.compiler_best_practices_for_backends.rst) shows error 404 cc @svekars @sekyondaMeta @AlannaBurke Pull Request resolved: https://github.com/pytorch/pytorch/pull/155137 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-10 20:51:05 +00:00
jafraustro	01b8f5e685	Convert to markdown: testing.rst, threading_environment_variables.rst, torch_cuda_memory.rst, torch_environment_variables.rst, torch_nccl_environment_variables.rst (#155523 ) Fixes #155035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155523 Approved by: https://github.com/AlannaBurke, https://github.com/svekars	2025-06-10 20:38:36 +00:00
Yidi Wu	545fbd58dc	[export] inline jit.scripted function in export (#155180 ) When we export a scripted function, we inline the original callable stored in "_torchdynamo_inline", this is the same strategy as torch.compile path. We do the same thing for script method, where a "\_\_wrapped\_\_" attribute points to the original callable in most cases. There are some corner cases we identified: top-level jit.scripted modules' method doesn't have a \_\_wrapped\_\_. In this case, we fall back to the original scripted approach. Maybe there're more such cases but need verification. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155180 Approved by: https://github.com/zou3519	2025-06-10 20:34:12 +00:00
PyTorch UpdateBot	a666cf3b38	[xla hash update] update the pinned xla hash (#154348 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154348 Approved by: https://github.com/pytorchbot	2025-06-10 20:33:31 +00:00
Colin Peppler	c9404faacb	[refactor] is_known_channels_last_contiguous* -> definitely_channels_last_contiguous* (#155499 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155499 Approved by: https://github.com/laithsakka	2025-06-10 20:29:46 +00:00
Max Podkorytov	94763f5ca7	[ROCm][Inductor][CK] add kBatch as runtime parameter to CK-tile gemms (#155389 ) Similar to old-CK gemms ### Testing Rely on existing coverage in `test/inductor/test_ck_backend.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155389 Approved by: https://github.com/chenyang78	2025-06-10 20:25:02 +00:00
Catherine Lee	ab51a93737	[CI] Set PATH during build to include location of sccache wrapped nvcc (#155464 ) Sccache wasn't working for nvcc on jammy, so manually set the path to include where nvcc is I had problems with always making nvcc a wrapper in some inductor tests where I got ``` sccache: encountered fatal error sccache: error: PCH not supported by nvcc sccache: caused by: PCH not supported by nvcc ``` and I also got an error (only on clang) when trying to set CMAKE_CUDA_COMPILER_LAUNCHER to /opt/cache/bin/sccache or sccache ``` ccache: error: failed to execute compile sccache: caused by: Compiler not supported: "nvcc warning : Support for offline compilation for architectures prior to \'<compute/sm/lto>_75\' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).\nnvcc fatal : Failed to preprocess host compiler properties.\n" ``` Non jammy cuda jobs' docker images used a different dockerfile, which set CMAKE_CUDA_COMPILER_LAUNCHER `e895e9689c/.ci/docker/ubuntu-cuda/Dockerfile (L110)` Alt solution: Given that I only get the error on clang, I could set CMAKE_CUDA_COMPILER_LAUNCHER=sccache only when not using clang Setting CUDA_NVCC_EXECUTABLE doesn't fail but also doesn't result in cache hits/misses Pull Request resolved: https://github.com/pytorch/pytorch/pull/155464 Approved by: https://github.com/malfet, https://github.com/huydhn	2025-06-10 20:23:33 +00:00
Eddie Yan	35e8f2593c	[CUDA] Fix missing bounds check in `Softmax.cu` (#154778 ) Uncovered by @ngimel, same as change in #144009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154778 Approved by: https://github.com/ngimel, https://github.com/cyyever, https://github.com/malfet	2025-06-10 20:03:54 +00:00
Aidyn-A	0ca2a79f5b	[TEST] Modernize test_sort_large (#155546 ) Since its introduction ~4 years ago, the test `test_sort_large` has always been deselected because it requires 200GB of CUDA memory. Now, as we do have GPUs this big, it gets selected, but fails with `var_mean` not being a member if `torch.Tensor` and `var_mean` accepting only floating point tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155546 Approved by: https://github.com/eqy	2025-06-10 19:59:12 +00:00
Catherine Lee	ea23eb4b98	[ez][CI] Reuse old whl: turn off on releases, add docs files to ok list (#155346 ) Add docs/*/.md and docs/*/.rst to files that are ok to reuse old whls Prevent using old whls on release branches Move check for changed files earlier to reduce api usage? Pull Request resolved: https://github.com/pytorch/pytorch/pull/155346 Approved by: https://github.com/malfet, https://github.com/huydhn	2025-06-10 19:57:40 +00:00
redwrasse	8a22551300	Fixes OpInfo gradient checks for ctc_loss (#154590 ) Fixes #67462 Re-enables `OpInfo` gradient checks for the restricted scenarios where the current `ctc_loss` implementation is accurate and consistent. The desired `ctc_loss` gradient behavior appears to be an ongoing discussion, see https://github.com/pytorch/pytorch/issues/52241. The `OpInfo` gradient checks can be updated if/as the underlying implementation advances. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154590 Approved by: https://github.com/soulitzer	2025-06-10 19:56:39 +00:00
Anthony Barbier	954ce94950	Add __main__ guards to quantization tests (#154728 ) This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs. In quantization tests: - Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run. - Raise a RuntimeError on tests which have been disabled (not run) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154728 Approved by: https://github.com/ezyang	2025-06-10 19:46:07 +00:00
Ryan Guo	07eb374e7e	[dynamo] Avoid unncessary caching source codegen (#155376 ) We only need to cache a source (e.g., `x.y.z`) into a temporary local if it's used multiple times in the codegen, otherwise we'd just be creating redundant `DUP` and `STORE_FAST tmp_...` instructions, which might degrade perf and definitely makes generated bytecode harder to read. Example: ```python import torch @torch.compile(backend="eager") def fn(x, y): return x + y fn(torch.ones(2), torch.ones(1)) ``` Original bytecode: ```verbatim [0/0] [__bytecode] 3 0 RESUME 0 [0/0] [__bytecode] [0/0] [__bytecode] 5 2 LOAD_FAST 0 (x) [0/0] [__bytecode] 4 LOAD_FAST 1 (y) [0/0] [__bytecode] 6 BINARY_OP 0 (+) [0/0] [__bytecode] 10 RETURN_VALUE ``` Modified bytecode (before this patch): ```verbatim [__bytecode] 3 0 RESUME 0 [__bytecode] 2 LOAD_GLOBAL 1 (NULL + __compiled_fn_1_578c8d9a_2a9b_4d15_bac7_267591cdee32) [__bytecode] 14 LOAD_FAST 0 (x) [__bytecode] 16 COPY 1 [__bytecode] 18 STORE_FAST 3 (tmp_1) [__bytecode] 20 LOAD_FAST 1 (y) [__bytecode] 22 COPY 1 [__bytecode] 24 STORE_FAST 4 (tmp_2) [__bytecode] 26 PRECALL 2 [__bytecode] 30 CALL 2 [__bytecode] 40 STORE_FAST 2 (graph_out_0) [__bytecode] 42 LOAD_FAST 2 (graph_out_0) [__bytecode] 44 LOAD_CONST 1 (0) [__bytecode] 46 BINARY_SUBSCR [__bytecode] 56 DELETE_FAST 2 (graph_out_0) [__bytecode] 58 RETURN_VALUE ``` Modified bytecode (after this patch): ```verbatim [__bytecode] 3 0 RESUME 0 [__bytecode] 2 LOAD_GLOBAL 1 (NULL + __compiled_fn_1_2c498af2_ce5c_49cb_abba_a0c7489b09ce) [__bytecode] 14 LOAD_FAST 0 (x) [__bytecode] 16 LOAD_FAST 1 (y) [__bytecode] 18 PRECALL 2 [__bytecode] 22 CALL 2 [__bytecode] 32 STORE_FAST 2 (graph_out_0) [__bytecode] 34 LOAD_FAST 2 (graph_out_0) [__bytecode] 36 LOAD_CONST 1 (0) [__bytecode] 38 BINARY_SUBSCR [__bytecode] 48 DELETE_FAST 2 (graph_out_0) [__bytecode] 50 RETURN_VALUE ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155376 Approved by: https://github.com/williamwen42	2025-06-10 19:38:15 +00:00
Manuel Candales	91ee9ee82d	[MPS][BE] Refactor round_decimals shader code to leverage new macro (#155316 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155316 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #155304	2025-06-10 19:29:57 +00:00
Anthony Barbier	b1b8e57cda	Add __main__ guards to ao tests (#154612 ) This is the first PR of a series in an attempt to get the content of #134592 merged as smaller PRs (Given that the original one was closed due to a lack of reviewers). This specific PR contains: - Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run. - Update ao tests. There will be follow up PRs to update the other test suites but I don't have permissions to create branches directly on pytorch/pytorch so I can't create a stack and therefore will have to create them one at the time. Cc @jerryzh168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154612 Approved by: https://github.com/jcaip	2025-06-10 18:33:09 +00:00
Shunting Zhang	0b677560e6	[inductor] use int64 for large index (#154575 ) Split reduction may need add an extra mask to avoid invalid index. Previously we always uses torch.int32 dtype. That causes problem when the tensor numel exceeds 2^31. Fix https://github.com/pytorch/pytorch/issues/154168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154575 Approved by: https://github.com/ngimel, https://github.com/jansel	2025-06-10 18:30:43 +00:00
Manuel Candales	0f47e76937	[MPS] Implement hardshrink metal kernel (#155304 ) Implements the forward and backward hardshrink operators as Metal kernels. In order to support the lambda parameter, we extend the `exec_unary_kernel` and `exec_binary_kernel` methods. Now they take an optional Scalar and an optional ScalarType argument. When the optional ScalarType is provided, it overrides the type of the Scalar. We add a new `REGISTER_UNARY_ALPHA_OP` macro, and modify the existing `REGISTER_BINARY_ALPHA_OP` to support the new feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155304 Approved by: https://github.com/malfet	2025-06-10 18:20:27 +00:00
PyTorch MergeBot	8347268edc	Revert "Make open device registration tests standalone (#153855 )" This reverts commit 8823138e47a3200c313f6bf2d21eb689d8150f39. Reverted https://github.com/pytorch/pytorch/pull/153855 on behalf of https://github.com/clee2000 due to causing some linux aarch64 tests to fail [GH job link](https://github.com/pytorch/pytorch/actions/runs/15566289293/job/43832373302) [HUD commit link](`8823138e47`), should be easy fix, rename in places where its mentioned, there might be more than just aarch64 though ([comment](https://github.com/pytorch/pytorch/pull/153855#issuecomment-2960191503))	2025-06-10 18:11:24 +00:00
Han, Chao1	cb9b479f4f	XPU enable XCCL by default (#154963 ) Enable USE_XCCL=ON by default when building PyTorch XPU binary, which is on par with NCCL for PyTorch CUDA binary build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154963 Approved by: https://github.com/cyyever, https://github.com/guangyey, https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/malfet Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-06-10 17:56:13 +00:00
Sidharth	0b6c0898e6	[dynamo] added additional_info functionality (#155526 ) There is now functionality for the developer to add a --additional-info arg to the add and update dev terminal command to include any additional info the dev might want to remark about the graph break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155526 Approved by: https://github.com/williamwen42	2025-06-10 17:40:50 +00:00
Joel Schlosser	8823138e47	Make open device registration tests standalone (#153855 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153855 Approved by: https://github.com/janeyx99	2025-06-10 17:33:26 +00:00
NikhilAPatel	c88e87f355	[Inductor] Set Triton Allocator Function For Use with New TMA API in Inductor (#155373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155373 Approved by: https://github.com/davidberard98	2025-06-10 17:09:04 +00:00
Aaron Gokaslan	73220d52fd	[BE] Update cudnn to 9.10.1.4 (#155122 ) Follow up to #152782 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155122 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/eqy	2025-06-10 16:59:00 +00:00
zhxchen17	38c4d05535	[precompile] Ensure @disable()-ed function won't trigger recompile from precompile bytecode. (#155363 ) In a precompiled bytecode, it looks like the following: ``` pre-graph bytecode ... compiled graph code ... post-graph bytecode ``` In pre-graph bytecode we have calls into helper functions like torch._dynamo.utils.call_size which will invoke @disable inside the bytecode. Normally torch.compile() will handle these frames fine, but for precompile we will load bytecode from a clean state of dynamo and we want a way to assert recompile never happen, so the current way to ensure this is by doing set_stance("fail_on_recompile") (open to any other idea to test this, but IMO this is the closest thing we have today). This approach doesn't work when util functions like call_size() is involved and this PR fixes a bunch of places to make sure "fail_on_recompile" can skip through the functions meant to be skipped during compilation. Differential Revision: [D76156867](https://our.internmc.facebook.com/intern/diff/D76156867/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155363 Approved by: https://github.com/jamesjwu, https://github.com/jansel ghstack dependencies: #155329	2025-06-10 16:13:38 +00:00
zhxchen17	ddee927f31	[precompile] Add low level C API to load precompiled dynamo code on functions. (#155329 ) While loading deserialized dynamo states back from disk, precompile will need a direct way to access ExtraState and populate guarded bytecode as cache entries. This diff adds two API at code level to load precompiled guard + bytecode entries. 1. _load_precompile_entry() will append an entry to a precompile entry list per code object. This precompile entry will be looked up before normal compiled entries. 2. _reset_precompile_entries() will clean up all the installed existing entries. This is useful to prevent a case where user call loading multiple times and explode the number of entries on the list. Differential Revision: [D76083247](https://our.internmc.facebook.com/intern/diff/D76083247/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155329 Approved by: https://github.com/jamesjwu, https://github.com/jansel	2025-06-10 16:13:38 +00:00
Nichols A. Romero	e8d29c45e0	[ROCm][TunableOp] Unit test to verify that there is only one kernel launch per PyTorch API invocation. (#155077 ) TunableOp UT covers breakage that was fixed in this PR: https://github.com/pytorch/pytorch/pull/153764 After tuning is complete, verify that there is only one kernel launch. for each PyTorch API invocation Pull Request resolved: https://github.com/pytorch/pytorch/pull/155077 Approved by: https://github.com/jeffdaily	2025-06-10 16:11:43 +00:00
Kazuaki Ishizaki	08d15d3ec1	[Docs] Convert to markdown: torch.compiler_troubleshooting.rst (#155514 ) Part of changes #155040 (parent PR #155120) Follow-up of #155351. I split the changes of `torch.compiler_troubleshooting.rst ` into #155351 and this PR due to the 2000-line limit in one PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155514 Approved by: https://github.com/svekars	2025-06-10 15:41:31 +00:00
PyTorch MergeBot	eb152ab1dd	Revert "Inductor logging + analysis of torch.profile (#149697 )" This reverts commit 060838c2312ad207c7afe2c86f8a484afea5f328. Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/clee2000 due to broke a bunch of tests internally D76299454, probably also broke rocm inductor/test_analysis.py::TestAnalysisCUDA::test_augment_trace_against_flop_counter_maxat0_cuda_float16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15545277599/job/43766911025) [HUD commit link](`060838c231`) ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-2959747153))	2025-06-10 15:38:40 +00:00
wengshiy	b44306d368	Add dont constant fold flag (#154945 ) For support https://github.com/pytorch/ao/issues/2228 > What we want to do now is to enable FP8 quantization in PyTorch. And similar as INT8 quantization, we need to insert quantize and dequantize ops into the graph. > > However we met problems with these q/dq ops both in the PyTorch core and Torchao. > > PyTorch core: > > The quantize_per_tensor op does not support FP8. We want to fix it via https://github.com/pytorch/pytorch/pull/153601. And as you commented, the op is deprecated. > Torchao: > > In the fusion pass in Inductor, we want to match the pattern fp8_weight -> torchao.dequantize_affine_float8 -> fp32_op and fuse it as fp8_weight -> weight_pack -> fp8_op. We have done so for INT8 PT2E quantization. However, the pattern matching pass is applied after a constant folding pass in Inductor: > `100ec0b34a/torch/_inductor/fx_passes/freezing_patterns.py (L69C1-L74C1)` > After constant_fold(gm), the pattern will be folded as fp32_weight -> fp32_op. Then the original pattern cannot be found any more and the FP8 semantics is lost since the pattern is entirely in fp32 now. > For INT8, the int8_weight -> quantized_decomposed.dequantize_per_channel -> fp32_op pattern won't be folded because we mark quantized_decomposed.dequantize_per_channel impure so that it won't be folded: `100ec0b34a/torch/_inductor/constant_folding.py (L139C1-L149C1)` . But for the torchao.dequantize_affine_float8, we cannot do this because > It is an op from Torchao, which is unknown to the constant folder > It is decomposed to smaller ops, so we cannot put it in the list as a single op. > So, we think an easy and short-term solution is to modify the ops in PyTorch core via https://github.com/pytorch/pytorch/pull/153601. > However, if we want to resolve the issue with Torchao, we need to > Add a method in the constant folder in Inductor to allow registration of impure ops Based on [Jansel‘s reply](https://github.com/pytorch/ao/issues/2228#issuecomment-2914560340), add dont constant fold flag on this patch Pull Request resolved: https://github.com/pytorch/pytorch/pull/154945 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-06-10 14:52:26 +00:00
Nikita Shulga	68f36683f0	[Testing] Add more models to MPSInductor tests (#155494 ) Enable all 46 HuggingFace models, only GPT2ForSequenceClassification fails to compile with a rather strange `array subscript is not an integer` error Pull Request resolved: https://github.com/pytorch/pytorch/pull/155494 Approved by: https://github.com/dcci ghstack dependencies: #155476, #155493	2025-06-10 14:40:59 +00:00
Xuanteng Huang	c8d39a1045	[docs] Add docstring indicating UB for converting inf to int (#154781 ) Fixes #154724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154781 Approved by: https://github.com/malfet	2025-06-10 14:04:50 +00:00
PyTorch MergeBot	805297981a	Revert "[Testing] Add more models to MPSInductor tests (#155494 )" This reverts commit f154f9b3040369a7979d5de7acb6fe21433eda83. Reverted https://github.com/pytorch/pytorch/pull/155494 on behalf of https://github.com/malfet due to I'm blind ([comment](https://github.com/pytorch/pytorch/pull/155494#issuecomment-2959319787))	2025-06-10 13:45:32 +00:00
Aby Mathew C	e53ddaf1f6	Adapt dtensor tests to be device agnostic (#154840 ) ##MOTIVATION This PR includes minor changes to skip some unsupported tests on Intel Gaudi devices as well as to make some of the tests more device agnostic. Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66 ##CHANGES - test_dtensor_compile.py : Make some of the tests device agnostic . ( Replace "cuda" hard codings with self.device_type) - test_dtensor.py and test_comm_mode_features.py: Skip some tests which are unsupported on Intel Gaudi devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154840 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD	2025-06-10 12:43:16 +00:00
Nikita Shulga	f154f9b304	[Testing] Add more models to MPSInductor tests (#155494 ) Enable all hugging face models Pull Request resolved: https://github.com/pytorch/pytorch/pull/155494 Approved by: https://github.com/dcci ghstack dependencies: #155476, #155493	2025-06-10 12:30:38 +00:00
Vivek Nayak	75f258dd1f	Fix spelling mistake (#155495 ) Summary: Change "primtivies" to "primitives". Test Plan: n/a Rollback Plan: Differential Revision: D76229938 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155495 Approved by: https://github.com/angelayi, https://github.com/cyyever	2025-06-10 09:06:58 +00:00
Laith Sakka	a205e8fd73	Apply all replacements on backward graph args during inductor codegen. (#155469 ) Summary: temporary mitigation for https://github.com/pytorch/pytorch/issues/155468 Test Plan: NA Rollback Plan: Differential Revision: D76096355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155469 Approved by: https://github.com/bobrenjc93	2025-06-10 08:56:18 +00:00
Wang, Chuanqi	5116293f7e	[XPU] Split triton version as 2 files to decouple triton version bump (#155313 ) Triton XPU shares its version file with the community one. When the community updates Triton version, it will temporarily break the XPU CI/CD because they use different repositories and commits. To decouple Triton version bumps between the community and XPU, we propose splitting the version into two separate files. Refer the latest community triton version bump PR #153117 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155313 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/atalman	2025-06-10 08:49:03 +00:00
Xuehai Pan	2cdcd16e83	[Easy] update pip sources for CUDA in nightly pull tool (#149143 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149143 Approved by: https://github.com/ezyang, https://github.com/cyyever ghstack dependencies: #145685	2025-06-10 08:07:30 +00:00
Xuehai Pan	0319044e92	[Easy] update pip sources for ROCm in nightly pull tool (#145685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145685 Approved by: https://github.com/ezyang	2025-06-10 08:07:30 +00:00
Wang, Chuanqi	9d2d227003	[CI] Fix XPU runner setup status issue (#155443 ) Flow with PR #155194, fix the timeout exit code issue refer https://github.com/pytorch/pytorch/actions/runs/15526078422/job/43706927778?pr=154962#step:3:74 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155443 Approved by: https://github.com/etaf, https://github.com/atalman, https://github.com/EikanWang	2025-06-10 08:06:37 +00:00
Michael Lazos	5dfe1787b5	[Inductor] Limit fusions to a node distance of 64 (#154688 ) fix for https://github.com/pytorch/pytorch/issues/154652 and https://fb.workplace.com/groups/1075192433118967/permalink/1484799079148049/ [window 128 dashboard run here w/ no regressions](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sun%2C%2001%20Jun%202025%2006%3A38%3A41%20GMT&stopTime=Sun%2C%2008%20Jun%202025%2006%3A38%3A41%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=mlazos/fuse-window&lCommit=8576f00ebfa53567d7bddc89d9882df9eb990561&rBranch=main&rCommit=9d59b516e9b3026948918e3ff8c2ef55a33d13ad) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154688 Approved by: https://github.com/eellison, https://github.com/Raymo111	2025-06-10 07:32:23 +00:00
Edward Z. Yang	8b8684466a	Add a stub AGENTS.md for Codex (#155459 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/155459 Approved by: https://github.com/albanD, https://github.com/malfet	2025-06-10 07:20:21 +00:00
David Berard	b07725a951	[inductor][triton pin] TMA shim refactor & mm, mm_scaled_grouped support (#155182 ) Follow-up to #154858. Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API. First, this refactors the TMA shim to drop args that aren't supported from Triton 3.2 to Triton 3.4: in particular, strides (Triton 3.2 version doesn't accept non-contiguous inputs, so we just infer contiguous strides in Triton 3.4) and element_ty (Triton 3.4 doesn't support this arg, so in Triton 3.2 we just infer it from base_ptr). Second, this updates mm.py & mm_scaled_grouped.py to use the TMA shim. Differential Revision: [D76318784](https://our.internmc.facebook.com/intern/diff/D76318784) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155182 Approved by: https://github.com/drisspg	2025-06-10 06:48:42 +00:00
atalman	8153340d10	[CI/CD] Remove CUDA 11.8 builds (#155509 ) This removes CUDA 11.8 from CI/CD Please see: https://github.com/pytorch/pytorch/issues/147383 TODO: Will followup of cleaning CUDA 11.8 config from scripts Pull Request resolved: https://github.com/pytorch/pytorch/pull/155509 Approved by: https://github.com/cyyever, https://github.com/huydhn, https://github.com/malfet	2025-06-10 05:16:41 +00:00
Mikayla Gawarecki	671a9d175b	Add warning for module full backward hook when no input requires gradient (#155339 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155339 Approved by: https://github.com/Skylion007	2025-06-10 04:42:06 +00:00
Animesh Jain	e25ce0f928	[invoke_subgraph] Use eager input vals to constrain input strides (#155291 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155291 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-06-10 04:06:09 +00:00
PyTorch MergeBot	660695f11d	Revert "Move non inductor workflows cuda 12.6->cuda 12.8 (#155234 )" This reverts commit ede6ead8cd8e925cb093f2b3016342e645bd728d. Reverted https://github.com/pytorch/pytorch/pull/155234 on behalf of https://github.com/clee2000 due to causing a bunch of tests to fail? ex test_nn.py::TestNNDeviceTypeCUDA::test_variable_sequence_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15545607752/job/43773157441) [HUD commit link](`ede6ead8cd`), some of the failures attributed to broken trunk on friday seem real? ([comment](https://github.com/pytorch/pytorch/pull/155234#issuecomment-2957578769))	2025-06-10 03:34:36 +00:00
CaoE	76644c9ff5	Make require_contiguous require exact strides instead of stride order (#148424 ) Make `require_contiguous` require exact strides instead of stride order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148424 Approved by: https://github.com/eellison Co-authored-by: eellison <elias.ellison@gmail.com>	2025-06-10 02:36:18 +00:00
Nikita Shulga	b916d8a583	[Testing] Shard MacOS inductor perf tests (#155493 ) One dtype per shard, as current job takes 2+ hours to finish Pull Request resolved: https://github.com/pytorch/pytorch/pull/155493 Approved by: https://github.com/dcci ghstack dependencies: #155476	2025-06-10 02:26:22 +00:00
Edward Z. Yang	52edfb2cbc	Updates to README about CUDA install dir and conda not required (#155458 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/155458 Approved by: https://github.com/malfet	2025-06-10 01:30:34 +00:00
Sahdev Zala	f34335bf33	Convert compiler rst files to markdown (#155335 ) Convert following compiler rst files to md file. torch.compiler_inductor_profiling.rst torch.compiler_ir.rst torch.compiler_nn_module.rst torch.compiler_performance_dashboard.rst torch.compiler_profiling_torch_compile.rst Fixes #155039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155335 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-10 01:12:11 +00:00
Yiming Zhou	1851f50866	[AOTI] Add int return type support for custom op in proxy executor (#155465 ) Summary: When a custom op has int return type in its schema. The returned value will be specialized and such behaviour is different from a symint return type. This diff only added support for int return type. As the returned int will be specialized and fused into downstream kernels (if being used), we can simply skip the int return type in the proxy executor. Note that in the eager run, the returned int will be specialized to the value defined in the real impl of the custom op. In exported program or in AOTI, the returned int will be specialized to the value defined in the fake impl of the custom op. So the definitions of the return value should be consistent across real and fake impl of the custom op. Otherwise the eager run and AOTI run will have different results. Test Plan: ``` buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_int_output ``` Rollback Plan: Differential Revision: D76159406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155465 Approved by: https://github.com/angelayi	2025-06-10 01:07:15 +00:00
angelayi	da50835bde	[aoti] Support c10 calls (#155256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155256 Approved by: https://github.com/malfet	2025-06-10 00:45:59 +00:00
Ting Lu	07e340e29c	Build magma-cuda 129 (#155496 ) followup for https://github.com/pytorch/pytorch/pull/155340 https://github.com/pytorch/pytorch/issues/155196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155496 Approved by: https://github.com/atalman	2025-06-10 00:32:24 +00:00
Kurt Mohler	e7698ff5cf	[MPS] Move abs op to Metal (#155474 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155474 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-06-10 00:23:59 +00:00
PyTorch MergeBot	7a48cc6990	Revert "[cuBLASLt][cuBLAS] Support 2D bias and `beta != 1.0` in cuBLASLt (#154170 )" This reverts commit b8bc2c2660e84034ff15232e2161e3ef9a6656d0. Reverted https://github.com/pytorch/pytorch/pull/154170 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it starts failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/154170#issuecomment-2957346976))	2025-06-10 00:18:23 +00:00
David Berard	a9a0501ec4	[user triton] mutation analysis for on-device TMA (#155380 ) Previously, the user-defined triton kernel mutation analysis would not detect mutation caused by TMA store, if the TMA descriptor was created via on-device TMA creation. This PR adds partial support for mutation analysis on programs that do stores via on-device TMA. On-device TMA works like this: ``` @triton.jit def kernel(A_ptr, workspace_ptr, ...): tl.extra.cuda.experimental_device_tensormap_create2d(workspace_ptr, A_ptr, ...) tl._experimental_descriptor_store(workspace_ptr, data, ...) ``` The first call (tensormap_create2d) mutates the contents of workspace_ptr to contain a data (including the fact that this TMA descriptor points to A_ptr). The second call (experimental_descriptor_store) writes to the location specified by the data in workspace_ptr: A_ptr, in this case. The approach here is to do a first pass to identify all the experimental_descriptor_stores (and collect the associated descriptor values); and then during mutation analysis, any tma creation on a mutated descriptor value (e.g. on `workspace_ptr` in the above example) will actually register as a mutation to the associated data pointer (e.g. `data` in the above example). Consider this example, which I'll used to describe the pros/cons of this approach. ``` @triton.jit def create_tma(global_ptr, workspace_ptr): tl.extra.cuda.experimental_device_tensormap_create2d(workspace_ptr, global_ptr, ...) @triton.jit def kernel(A, B, workspace_ptr): create_tma(A, workspace_ptr) workspace_B = workspace_ptr + 128 create_tma(B, workspace_B) data = tl._experimental_descriptor_load(workspace_ptr, ...) tl._experimental_descriptor_store(workspace_B, data, ...) ``` An alternative approach could be to simply modify the `tl.extra.cuda.experimental_device_tensormap_create2d` so that it returns a descriptor, and to use that descriptor in subsequent uses (i.e. to "functionalize" the uses of the tma creation API). However, this would (a) require "functionalization" through any function calls (e.g. to `create_tma`), and (b) would lead to both `A` and `B` being marked as mutated (i.e. mutation to `workspace_B` -> mutation to `workspace_ptr` -> mutation to `A`). A downside of the current approach is that it doesn't understand offsets into workspaces. e.g. if one were to recompute workspace_B instead of reusing the variable, the analysis pass would not understand that these values point to the same descriptor. Differential Revision: [D76175117](https://our.internmc.facebook.com/intern/diff/D76175117) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155380 Approved by: https://github.com/oulgen	2025-06-10 00:07:18 +00:00
Anthony Barbier	2578796e23	Fix sqlite3 in x86 Docker container. (#155211 ) Some core modules for versions of python installed in /opt depend on libraries in /usr/local but those libraries are not copied over from the base container. For example: /opt/python/cp312-cp312/bin/python3 -c "import sqlite3" ImportError: libsqlite3.so: cannot open shared object file: No such file or directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/155211 Approved by: https://github.com/huydhn	2025-06-09 23:42:02 +00:00
Kazuaki Ishizaki	5df3bf13ec	[Docs] Convert to markdown: torch.compiler_troubleshooting.rst (#155351 ) Part of changes #155040 (parent PR #155120) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155351 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-09 23:18:31 +00:00
PyTorch MergeBot	e12597090c	Revert "Update auto-tuning support for _scaled_grouped_mm (#150944 )" This reverts commit 09328eb02f5412d2211b5fd638ce82d0e03b9c1f. Reverted https://github.com/pytorch/pytorch/pull/150944 on behalf of https://github.com/davidberard98 due to breaks internal usage & complicates triton pin update - more details in https://github.com/pytorch/pytorch/pull/150944#issuecomment-2957246463 ([comment](https://github.com/pytorch/pytorch/pull/150944#issuecomment-2957248841))	2025-06-09 23:12:56 +00:00
Michael Lazos	40d02eb481	[Cutlass] Allow filtering by fast_accum for scaled_mm (#155195 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155195 Approved by: https://github.com/drisspg ghstack dependencies: #154829, #154835	2025-06-09 22:46:18 +00:00
PyTorch MergeBot	2c1a93a0ae	Revert "[Graph Partition] move cpu scalar tensor to gpu (#154464 )" This reverts commit c1f531f0b0e6faf443d90f8de2936e866c8c27c2. Reverted https://github.com/pytorch/pytorch/pull/154464 on behalf of https://github.com/clee2000 due to some of the newly added tests are failing internally, along with some other tests, D75913292 ([comment](https://github.com/pytorch/pytorch/pull/154464#issuecomment-2957201054))	2025-06-09 22:43:20 +00:00
Qasim Khan	82e6475d92	Add doc for missing functions for torch.special module (#155074 ) Fixes #132178 Added all the missing functions that had a docstring but were not present in the documentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/155074 Approved by: https://github.com/albanD	2025-06-09 22:28:26 +00:00
albanD	bdbf2792a8	Fix docs build (#155129 ) Not sure why the online doc build passes but it fails locally with these broken strings... ~Also pinning numpy version even though it is technically optional to ensure users have the right version as most users have numpy in their environment anyways.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/155129 Approved by: https://github.com/janeyx99, https://github.com/svekars	2025-06-09 22:25:20 +00:00
Nikita Shulga	034a7f6437	[BE] Raise better exception in `torch.[con]cat[enate]` (#155460 ) By replacing `TORCH_CHECK` with `TORCH_CHECK_VALUE` Also make redispatching from aliases an even simpler, by just calling respective original class Addresses feedback raised in https://github.com/pytorch/pytorch/pull/155383/files#r2133952368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155460 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-06-09 22:18:00 +00:00
Ting Lu	398fca9dcf	Add almalinux CUDA 12.9 docker build, required for magma build (#155340 ) https://github.com/pytorch/pytorch/issues/155196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155340 Approved by: https://github.com/cyyever, https://github.com/atalman	2025-06-09 22:10:24 +00:00
atalman	ede6ead8cd	Move non inductor workflows cuda 12.6->cuda 12.8 (#155234 ) Move non inductor workflows cuda 12.6->cuda 12.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155234 Approved by: https://github.com/Skylion007, https://github.com/zxiiro, https://github.com/cyyever, https://github.com/malfet	2025-06-09 22:04:19 +00:00
Gabriel Ferns	060838c231	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-06-09 21:43:21 +00:00
Roy Hvaara	b95dadd717	[MPS] Enable RProp test for non-contiguous (#155439 ) I believe this issue has already been fixed, but I don't know the hero PR. I'm relying on ci signals to verify it's fixed across macOS versions. Fixes #118117 xref #115350 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155439 Approved by: https://github.com/Skylion007	2025-06-09 21:29:09 +00:00
Roy Hvaara	3490a4f906	[MPS] Enable optimizer tests affected by addcdiv (#155437 ) Tracked in #118115. Fixed in #124442. This PR unskips the tests. xref #115350 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155437 Approved by: https://github.com/Skylion007	2025-06-09 21:27:37 +00:00
Eddie Yan	b8bc2c2660	[cuBLASLt][cuBLAS] Support 2D bias and `beta != 1.0` in cuBLASLt (#154170 ) Fixes https://github.com/pytorch/pytorch/issues/153590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154170 Approved by: https://github.com/malfet	2025-06-09 21:23:32 +00:00
Yiming Zhou	eba5fc91ac	[nativert] Move serialization to PyTorch core (#155229 ) Summary: Serialization contains utilities to deserialize a graph saved on disk in json format as defined in `torch/csrc/utils/generated_serialization_types.h` to the in-memory representation as defined in `torch/nativert/graph/Graph.h` Test Plan: buck2 run @mode/dev-nosan caffe2/test/cpp/nativert:serialization_test Rollback Plan: Differential Revision: D76012641 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155229 Approved by: https://github.com/zhxchen17	2025-06-09 21:12:30 +00:00
Max Podkorytov	1e6a653234	[ROCm][Inductor][CK] Split ck and ck-tile inductor backend(s) (#155294 ) ... and fix ck-tile instances not being generated due to incorrect caching ### Testing Added test cases for CKTILE instances ``` pytest test/inductor/test_ck_backend.py -k gemm_backends_CKTILE ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155294 Approved by: https://github.com/coconutruben	2025-06-09 20:40:26 +00:00
PyTorch MergeBot	620415e018	Revert "Add stack_trace on make_fx (#155155 )" This reverts commit d4d0ede6bacb4b3b33c0e4aa4cb0e79d34e697ec. Reverted https://github.com/pytorch/pytorch/pull/155155 on behalf of https://github.com/malfet due to Not sure why it was merged, it indeed breaks those tests in CI ([comment](https://github.com/pytorch/pytorch/pull/155155#issuecomment-2956973633))	2025-06-09 20:40:13 +00:00
Nikita Shulga	abbdf9f363	[BE][Testing] Unskip `ones_like`/`zeros_like` testing on MPS (#155476 ) But skip `double` dtype form OpInfo variants for this test Pull Request resolved: https://github.com/pytorch/pytorch/pull/155476 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-06-09 20:37:44 +00:00
eellison	ea37f72099	enable test (#155342 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155342 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #154768	2025-06-09 19:26:05 +00:00
Shangdi Yu	d4d0ede6ba	Add stack_trace on make_fx (#155155 ) Summary: Previosuly, we only add stack trace in `class _ModuleStackTracer(PythonKeyTracer)` for non-strict export. I moved this stack trace logic to the parent class `PythonKeyTracer`, this way the graph traced from Module using make_fx will have stack_trace as well. Motivation: we've observed some uses cases where users first use `make_fx` on the Module, and then run `export` on the resulting graph. If the result of `make_fx` doesn't have stack trace, the stack trace information is lost. Test Plan: ``` buck run test:test_export -- -r test_stack_trace ``` Rollback Plan: Differential Revision: D75985427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155155 Approved by: https://github.com/angelayi, https://github.com/zou3519	2025-06-09 18:31:57 +00:00
Justin Silver	2aade5ee9f	Fix weight tensor documentation #134896 (#155093 ) Fixes #134896 ## Description Remove line about 'weight' tensor needing to be of floating point type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155093 Approved by: https://github.com/AlannaBurke	2025-06-09 18:07:21 +00:00
Aaron Gokaslan	3863bbb55b	[BE]: Update cusparselt to 0.7.1 (#155232 ) Needed to support sparse operations on Blackwell, and implements new features for the library. Also optimizes library sizes vs 0.7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155232 Approved by: https://github.com/nWEIdia, https://github.com/malfet	2025-06-09 18:01:23 +00:00
PyTorch MergeBot	79bdafe5b6	Revert "Custom FX pass for inductor's backend registration (#154841 )" This reverts commit e694280d1215caf70f41575f2611bfa26c69ebdb. Reverted https://github.com/pytorch/pytorch/pull/154841 on behalf of https://github.com/clee2000 due to failing some tests internally D76135706 ([comment](https://github.com/pytorch/pytorch/pull/154841#issuecomment-2956357711))	2025-06-09 16:56:45 +00:00
IvanKobzarev	0083032e75	[aotd] Support mutations in reordering_to_mimic_autograd_engine (#155353 ) Original issue: https://github.com/pytorch/pytorch/issues/154820 Dedicated sub-issue: https://github.com/pytorch/pytorch/issues/155242 Backward graph is reordered by partitioners.py: reordering_to_mimic_autograd_engine Which only records in the backward graph compute that starts from tangents. Mutation of primals(inputs) in backward can be disconnected from backward. Handling this copy_ specifically, as we add this mutation in framework and this is the only mutation that exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155353 Approved by: https://github.com/bdhirsh, https://github.com/zou3519	2025-06-09 16:39:47 +00:00
Brian Hirsh	6c05f2fca0	[test] use JK to force graph break on slow aliasing/mutation/dynamic_shape behavior (#155257 ) Summary: test to unblock shampoo, needs cleanup Test Plan: CI Rollback Plan: steps: - jk.update: jk: pytorch/compiler:aliased_inputs_with_mutation_and_dyn_shapes_killswitch constant_bool: null consistent_pass_rate: null fractional_host_rollout: null sampling_rate: null - manual.note: content: Set it to false. Reviewed By: c00w Differential Revision: D76051868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155257 Approved by: https://github.com/c00w	2025-06-09 16:21:59 +00:00
Cui, Yifeng	4a4cac0cef	Update torch-xpu-ops commit pin (#154962 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@`a3a196`](`a3a196ccdb`) includes: - Enhanced Adaptive Average Pooling 2D Backward Kernel for performance and code simplification - Group Norm Backward Optimization with vectorization and parallel reduction - Support CL path for MaxUnpooling2d and MaxUnpooling3d - Rename USE_ONEMKL as USE_ONEMKL_XPU and set it as default ON - Refactor USE_XCCL & USE_C10D_XCCL option Pull Request resolved: https://github.com/pytorch/pytorch/pull/154962 Approved by: https://github.com/EikanWang	2025-06-09 15:54:13 +00:00
Sheng Fu	b9b84d8011	Generate unique id for tensor storage object by observing the week pointer of tensor storage object (#154859 ) Summary: PyTorch execution trace records tensor storage data in the trace. The tensor storage data includes storage id, offset, number of elements, and number of byte for each element. PARAM et-replay uses this information to allocate/free the tensors. However, the current implementation of generating tensor storage id does not guarantee it is unique. ExecutionTraceObserver maintains a lookup table to map the memory address of the tensor storage object to an unique id. If a new memory address is found, it will be put into that hash table and associate it to a new id. This implementation does not guarantee the storage object is unique since the memory that the address points to may be released and then re-allocated to a different tensor storage object. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA Differential Revision: D75749065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154859 Approved by: https://github.com/eellison, https://github.com/ngimel	2025-06-09 15:46:27 +00:00
Justin Chu	79aef14169	[ONNX] Set the name of the producing node using the value name (#155413 ) When comparing two graphs exported using different opset versions, even though the value names are the same in both graphs, the node names did not match, causing model-explorer to not be able to sync the two graphs. This change updates the names of the nodes that directly produce the output values, for better correspondence across exported graphs. ![image](https://github.com/user-attachments/assets/3c00ca18-221f-4add-8429-4bcf12069036) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155413 Approved by: https://github.com/cyyever, https://github.com/xadupre	2025-06-09 13:03:58 +00:00
Amandeep Chhabra	e15848669f	[1/n]adding torch.distributed.run option to provide destination for event logging (#154644 ) (#155268 ) Summary: Problem Statement Currently, torch distributed elastic does not support to an option specify destination for event logging from torch.distributed.run. recording events to default destination: https://fburl.com/code/7f9b0993 The default destination is "null". *Solution* adding option in torch.destributed.run to specify event_logging_destination. The default value will be "null" which is current default so it won;t affect users unless the specify it via command line. Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/f738408681-TrainingApplication_torch_distributed_run_3?job_attempt=0&version=0&tab=execution_details&env=PRODUCTION Rollback Plan: Reviewed By: kiukchung Differential Revision: D75183591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155268 Approved by: https://github.com/d4l3k	2025-06-09 10:43:52 +00:00
Yuanhao Ji	9968c854b6	[Dynamo] Replace `unimplemented` with `unimplemented_v2` in `torch/_dynamo/variables/tensor.py` (#153146 ) Part of #147913 Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/tensor.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153146 Approved by: https://github.com/williamwen42 Co-authored-by: William Wen <william.wen42@gmail.com>	2025-06-09 06:27:50 +00:00
Yiming Zhou	9b4a748e29	[nativert] Move Weights to PyTorch core (#155156 ) Summary: Moves Weights class to PyTorch core Torch Native Runtime RFC: pytorch/rfcs#72 README: https://github.com/pytorch/pytorch/blob/main/torch/nativert/OVERVIEW.md Test Plan: buck2 run mode/dev-nosan caffe2/test/cpp/nativert:weights_test Differential Revision: D75973156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155156 Approved by: https://github.com/zhxchen17	2025-06-09 05:49:32 +00:00
PyTorch MergeBot	6fb6293159	Revert "Add Intel GPU info collection to the collect env script (#137846 )" This reverts commit c6b4f98625bb6b22bb9a60112a6d58e684a97e1b. Reverted https://github.com/pytorch/pytorch/pull/137846 on behalf of https://github.com/etaf due to This is breaking tests on xpu, detail log: https://hud.pytorch.org/pr/pytorch/pytorch/154962#43700962849 ([comment](https://github.com/pytorch/pytorch/pull/137846#issuecomment-2954517883))	2025-06-09 03:13:27 +00:00
James Wu	be2ad70cfa	Fix dynamo tracing into AOTAutogradCache results in cpu tensors (#155251 ) On this line, we see that the bw_compiler that dynamo uses for AotAutograd automatically disables the backward runnable: `05dd638ee9/torch/_dynamo/backends/common.py (L76)` This disables dynamo in the bw_compiler but also disables the runnable the compiler returns. On a AOTAutogradCache hit, however, we never call the bw_compiler! So we don't disable dynamo properly. This only has an effect on certain cases of cpu tensors' backwards, where the backward is being done in python land, and dynamo unnecessarily tries to trace through the inductor generated code. It also only matters if the backward is being accessed outside of dynamo itself (say, in a graph break in eager mode), since dynamo properly disables the forward function already. ``` I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] TorchDynamo attempted to trace the following frames: [ I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * fn /home/jjwu/test.py:9 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * fn /home/jjwu/test.py:9 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] ] ``` This PR fixes the issue and adds a unit test showing that with or without cache hit, the frames dynamo is tracing is identical. Fixes https://github.com/pytorch/pytorch/issues/154536 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155251 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305	2025-06-09 02:06:16 +00:00
Parag Ekbote	2908c10259	Document the default garbage_collection_threshold value and improve the organization of cuda docs (#155341 ) Fixes #150917 As mentioned in the issue, I've updated the documentation of `garbage_collection_threshold`and improved the organization. Could you please review? Pull Request resolved: https://github.com/pytorch/pytorch/pull/155341 Approved by: https://github.com/AlannaBurke, https://github.com/ngimel	2025-06-08 22:09:35 +00:00
Abhinav Tharamel	d41f62b7a0	Fix/issue #155027 (#155252 ) Fixes #155027 Converted RST files to Markdown Pull Request resolved: https://github.com/pytorch/pytorch/pull/155252 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-08 21:17:31 +00:00
Nikita Shulga	3d82a1dfb5	Add checks for empty tensor list (#155383 ) Vibe-coded with Codex, after collecting a backtrace, see https://chatgpt.com/s/cd_68438be8a1248191adbfa0a5f000e60b Even though, check for empty tensor list exists in `at::cat` crash might happens while resolving named dimension to position, by calling `dimname_to_position(tensors[0], dim)`, see backtrace below ``` (lldb) up frame #1: 0x00000001101146dc libtorch_cpu.dylib`at::TensorBase::has_names(this=0x0000000000000000) const at TensorBase.h:559:10 556 bool has_names() const { 557 // If a user is using unnamed tensors, then we can short-circuit right here. 558 // Otherwise, impl::has_names attempts to retrieve names. -> 559 if (!impl_->has_named_tensor_meta()) { 560 return false; 561 } 562 return impl::has_names(unsafeGetTensorImpl()); (lldb) up frame #2: 0x00000001101144c4 libtorch_cpu.dylib`at::dimname_to_position(tensor=0x0000000000000000, dim=Dimname @ 0x000000016fdfe348) at NamedTensorUtils.cpp:23:3 20 int64_t dimname_to_position(const Tensor& tensor, Dimname dim) { 21 TORCH_CHECK(dim.type() != NameType::WILDCARD, 22 "Please look up dimensions by name, got: name = None."); -> 23 TORCH_CHECK(tensor.has_names(), 24 "Name ", dim, " not found in ", toDimnameRepr(tensor), "."); 25 const auto names = tensor.names(); 26 ``` TODOs: - May be move test from `test_tensor_creation.py` to OpInfo (not sure which one is more readable) - Replace `TORCH_CHECK` with `TORCH_CHECK_VALUE` and adjust unit tests Fixes https://github.com/pytorch/pytorch/issues/155306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155383 Approved by: https://github.com/cyyever, https://github.com/ezyang ghstack dependencies: #155382	2025-06-08 18:53:19 +00:00
PyTorch MergeBot	95448b2ce6	Revert "[Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371 )" This reverts commit 65b1aedd09e98fcafcdd893ca4924f4fa598fd18. Reverted https://github.com/pytorch/pytorch/pull/154371 on behalf of https://github.com/clee2000 due to see henry's comment above. This was reverted internally because it causes a memory leak and OOMs on AMD? ([comment](https://github.com/pytorch/pytorch/pull/154371#issuecomment-2954192879))	2025-06-08 17:37:29 +00:00
Narek Malkhasyan	30293b8b5e	Preserve Enum types during torch.export serialization and deserialization (#154821 ) Fixes #154674 Addresses an issue where `torch.export` does not correctly preserve Python `Enum` types during the save/load round-trip. Previously, Enum inputs were serialized by value only, causing their type to be lost after deserialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154821 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007, https://github.com/yushangdi, https://github.com/angelayi	2025-06-08 17:30:31 +00:00
PyTorch MergeBot	27df0c56b7	Revert "[inductor] use int64 for large index (#154575 )" This reverts commit 2596e3d0617852469241be8777cf46db5c83928c. Reverted https://github.com/pytorch/pytorch/pull/154575 on behalf of https://github.com/clee2000 due to broke inductor/test_op_dtype_prop.py::TestCaseCUDA::test_op_dtype_propagation_add_cuda_int32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15510656657/job/43673763835) [HUD commit link](`2596e3d061`), note for self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154575#issuecomment-2954175761))	2025-06-08 16:58:59 +00:00
Xuehai Pan	49888e6be0	[BE] Polish `Makefile` (#155425 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155425 Approved by: https://github.com/ezyang	2025-06-08 16:37:12 +00:00
Bob Ren	b981fb6744	Add docblock to torch/_dynamo/variables/builtin.py (#155402 ) Add comprehensive module docstring explaining built-in function and type variable tracking, including handling of Python built-ins, type constructors, operators, and special constructs during symbolic execution. Originally generated by claude but reviewed and edited by me. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155402 Approved by: https://github.com/Skylion007 ghstack dependencies: #155403	2025-06-08 15:24:29 +00:00
Aleksandar Samardžić	09328eb02f	Update auto-tuning support for _scaled_grouped_mm (#150944 ) 1. Enable strided inputs 2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs 3. Fix non-TMA load variant 4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor 5. Fix cases when group size along K dimension is not multiple of block size along K 6. Updated meta registration 7. Update synthetic offsets creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/150944 Approved by: https://github.com/ngimel	2025-06-08 10:18:13 +00:00
Bob Ren	1339e88105	Add docblock to torch/_dynamo/side_effects.py (#155403 ) Add comprehensive module docstring explaining side effect tracking and management, including mutation tracking, context changes, aliasing, and state preservation during symbolic execution. Originally generated by claude but reviewed and edited by me. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155403 Approved by: https://github.com/williamwen42	2025-06-08 07:02:30 +00:00
Bob Ren	0756ebcd48	Add docblock to torch/_dynamo/trace_rules.py (#155401 ) Add comprehensive module docstring explaining the tracing rules and policies that govern TorchDynamo's compilation decisions, including skip rules, inlining policies, and library-specific handling. Originally generated by claude but reviewed and edited by me. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155401 Approved by: https://github.com/williamwen42	2025-06-08 04:30:03 +00:00
Shivam Raikundalia	abf4da0d24	[Profiler] Induce Inductor Import before Profiling (#155243 ) Fixes #151829 Summary: Currently, inductor has a lazy init which causes certain aten ops to run during a profiling run. This ends up cluttering the function events especially for smaller traces. One of the attempts to fix this was to simply remove that import from the profiler entirely but it looks like the import happens somewhere downstream anyways and the event still flood our profile. To fix this, we induce the inductor import during prepare trace if the inductor is present. This way regardless of how the workload imports the inductor the actual init process will be done before tracing starts, resulting in more accurate tracing. Test Plan: Added test, also ran N7316820 manually and went from getting many events on the first run to the following output (only difference is Runtime Triggered Module Loading which is CUPTI overhead event): ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls aten::mul_ 1.40% 340.638us 99.92% 24.390ms 24.390ms 1.535us 100.00% 4.605us 4.605us 1 cudaLaunchKernel 0.60% 146.533us 98.52% 24.049ms 24.049ms 0.000us 0.00% 3.070us 3.070us 1 Runtime Triggered Module Loading 6.14% 1.500ms 6.14% 1.500ms 1.500ms 1.535us 100.00% 1.535us 1.535us 1 Runtime Triggered Module Loading 91.78% 22.403ms 91.78% 22.403ms 22.403ms 1.535us 100.00% 1.535us 1.535us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.535us 100.00% 1.535us 1.535us 1 cudaDeviceSynchronize 0.08% 20.031us 0.08% 20.031us 20.031us 0.000us 0.00% 0.000us 0.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls aten::mul_ 82.81% 484.396us 94.26% 551.378us 551.378us 1.440us 100.00% 1.440us 1.440us 1 cudaLaunchKernel 11.45% 66.982us 11.45% 66.982us 66.982us 0.000us 0.00% 0.000us 0.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.440us 100.00% 1.440us 1.440us 1 cudaDeviceSynchronize 5.74% 33.581us 5.74% 33.581us 33.581us 0.000us 0.00% 0.000us 0.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Rollback Plan: Differential Revision: D76056511 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155243 Approved by: https://github.com/ngimel	2025-06-07 23:58:50 +00:00
Catherine Lee	f1f49e56b0	[CI] remove xfail sm89 job (#155244 ) No need to collect more data Pull Request resolved: https://github.com/pytorch/pytorch/pull/155244 Approved by: https://github.com/janeyx99, https://github.com/huydhn, https://github.com/Skylion007	2025-06-07 21:04:57 +00:00
Yuki Kobayashi	11bc29856d	Fix some incorrect reST markups in the document (#154831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154831 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-06-07 19:09:46 +00:00
Shunting Zhang	2596e3d061	[inductor] use int64 for large index (#154575 ) Split reduction may need add an extra mask to avoid invalid index. Previously we always uses torch.int32 dtype. That causes problem when the tensor numel exceeds 2^31. Fix https://github.com/pytorch/pytorch/issues/154168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154575 Approved by: https://github.com/ngimel, https://github.com/jansel	2025-06-07 18:41:46 +00:00
cyy	f6e18bc105	Fix CUDA 12.8 docker tag (#155087 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/155087 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007	2025-06-07 16:39:42 +00:00
Jeff Daily	783a4c1f50	[ROCm] fix nightly wheel, second attempt (#155388 ) Fixes #155207. hipsparselt logic was still broken, but smoke test didn't catch Pull Request resolved: https://github.com/pytorch/pytorch/pull/155388 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-07 15:57:55 +00:00
Wei Wang	ab56e5add9	[CUDA][BUILD] Add back the capability to use env TORCH_CUDA_ARCH_LIST (#155314 ) Add back the capability to use env TORCH_CUDA_ARCH_LIST to control how downstream projects (which uses find_package (torch)) build. Follow up to: https://github.com/pytorch/pytorch/pull/152715 Before this PR, On a CPU only machine, building a downstream project would ignore the TORCH_CUDA_ARCH_LIST setting (if set) and go straight to the auto GPU detection mode, in which case there would be no GPU detected and an excessive list of cuda architectures may be used. This also means that there is no way to build a binary that would be targeting a different SM on the current machine a developer is using. After this PR, TORCH_CUDA_ARCH_LIST is effective for developers to control explicitly which SM architectures to build. p.s. I think this PR might have been the original intent of https://github.com/pytorch/pytorch/pull/152715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155314 Approved by: https://github.com/janeyx99, https://github.com/eqy, https://github.com/atalman	2025-06-07 15:52:39 +00:00
bobrenjc93	456f40cb09	Add docblock for autotune_cache.py (#155133 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155133 Approved by: https://github.com/aorenste	2025-06-07 14:50:09 +00:00
xinan.lin	29e6033ff3	[Break XPU] Fix failed test cases which are introduced by community for XPU. (#155317 ) Fixes #155186, Fixes #154701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155317 Approved by: https://github.com/jansel	2025-06-07 14:46:30 +00:00
Kshiteej K	694028f502	update get_default_device to also respect torch.device ctx manager (#148621 ) Fixes https://github.com/pytorch/pytorch/issues/131328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148621 Approved by: https://github.com/ezyang	2025-06-07 14:26:17 +00:00
Animesh Jain	db491825e0	[invoke_subgraph] Add logging (#155284 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155284 Approved by: https://github.com/zou3519 ghstack dependencies: #155270	2025-06-07 11:31:53 +00:00
Animesh Jain	0f3f59784d	[invoke_subgraph] Throw assertion on uncaptured speculate_subgraph (#155270 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155270 Approved by: https://github.com/zou3519	2025-06-07 11:31:53 +00:00
Boyuan Feng	c1f531f0b0	[Graph Partition] move cpu scalar tensor to gpu (#154464 ) cudagraph does not support cpu tensors. In this PR, we update the graph by explicitly moving cpu tensors to gpu when profitable, relying on graph partition to split off this data copy, and cudagraphifying the remaining gpu ops. This PR unblocked the graph partition + cudagraph on speech_transformer, leading to 39.5% speedup on inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315). Close: #119241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154464 Approved by: https://github.com/eellison	2025-06-07 06:59:39 +00:00
Mengwei Liu	386aa72003	[BE] Cleanup old ExecuTorch codegen and runtime code (#154165 ) Summary: These files are added to pytorch/pytorch before ExecuTorch is opensourced. Now is a good time to remove it from pytorch/pytorch, since the code is moved to pytorch/executorch already. Test Plan: Rely on CI jobs. Differential Revision: [D75985423](https://our.internmc.facebook.com/intern/diff/D75985423) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154165 Approved by: https://github.com/kimishpatel, https://github.com/Skylion007, https://github.com/cyyever	2025-06-07 06:54:12 +00:00
dolpm	da1f8980df	[nativert] move function schema to torch (#154948 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D75826905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154948 Approved by: https://github.com/zhxchen17	2025-06-07 05:45:30 +00:00
Xiaodong Wang	5fbaa041e7	SDPA support gfx950 (#155103 ) Summary: Seems to run, just not the optimal performance. e.g. ck_tile doesn't have those gfx942 optimizations it seems https://github.com/ROCm/composable_kernel/blob/develop/include/ck_tile/ops/fmha/block/variants.hpp#L27 Test Plan: ``` +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| Batch Size \| Sequence Length \| Heads \| Head Dim \| Flash Time (µs) \| Mem Eff Time (µs) \| Math Time (µs) \| Flex Time (µs) \| xformers Time (µs) \| Flash TFlops \| Mem Eff TFlops \| Math TFlops \| Flex TFlops \| xformers TFlops \| Speedup (Flash/Math) \| Speedup (MemEff/Math) \| Speedup (Flex/Math) \| Speedup (xformers/Math) \| xformers trace_url \| Flash trace_url \| +==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+ \| 1 \| 4096 \| 16 \| 64 \| 179.737 \| 182.874 \| 3106.6 \| 359.662 \| 205.506 \| 382.334 \| 375.776 \| 22.1205 \| 191.067 \| 334.392 \| 17.2841 \| 16.9877 \| 8.63754 \| 15.1169 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 4096 \| 32 \| 128 \| 617.271 \| 623.38 \| 7169.73 \| 998.961 \| 654.534 \| 445.312 \| 440.947 \| 38.3387 \| 275.164 \| 419.96 \| 11.6152 \| 11.5014 \| 7.17719 \| 10.9539 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 16 \| 64 \| 667.032 \| 670.118 \| 13031.8 \| 1383.42 \| 768.452 \| 412.091 \| 410.193 \| 21.0928 \| 198.694 \| 357.703 \| 19.5371 \| 19.4471 \| 9.42 \| 16.9586 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 32 \| 128 \| 2074.64 \| 2214.81 \| 29186.9 \| 3916.35 \| 2404.29 \| 529.978 \| 496.437 \| 37.6714 \| 280.749 \| 457.313 \| 14.0684 \| 13.1781 \| 7.45257 \| 12.1395 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 16 \| 64 \| 2456.6 \| 2472.38 \| 51095.8 \| 5647.01 \| 3008.09 \| 447.574 \| 444.718 \| 21.5186 \| 194.707 \| 365.518 \| 20.7994 \| 20.6666 \| 9.0483 \| 16.9861 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 32 \| 128 \| 8048.8 \| 8070.96 \| 113478 \| 15580.8 \| 9768.71 \| 546.423 \| 544.922 \| 38.7569 \| 282.274 \| 450.218 \| 14.0987 \| 14.06 \| 7.2832 \| 11.6165 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| Batch Size \| Sequence Length \| Heads \| Head Dim \| Flash Time (µs) \| Mem Eff Time (µs) \| Math Time (µs) \| Flex Time (µs) \| xformers Time (µs) \| Flash TFlops \| Mem Eff TFlops \| Math TFlops \| Flex TFlops \| xformers TFlops \| Speedup (Flash/Math) \| Speedup (MemEff/Math) \| Speedup (Flex/Math) \| Speedup (xformers/Math) \| xformers trace_url \| Flash trace_url \| +==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+ \| 1 \| 4096 \| 16 \| 64 \| 692.323 \| 697.649 \| 4241.81 \| 1562.26 \| 906.441 \| 248.148 \| 246.254 \| 40.5012 \| 109.968 \| 189.531 \| 6.12693 \| 6.08015 \| 2.71518 \| 4.67963 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 4096 \| 32 \| 128 \| 2263.22 \| 2267.38 \| 9482.64 \| 7003.8 \| 2765.5 \| 303.636 \| 303.079 \| 72.4687 \| 98.1174 \| 248.489 \| 4.1899 \| 4.18221 \| 1.35393 \| 3.42891 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 16 \| 64 \| 2553.94 \| 2572.68 \| 15909.8 \| 5697.16 \| 3284.77 \| 269.073 \| 267.112 \| 43.193 \| 120.621 \| 209.206 \| 6.22953 \| 6.18415 \| 2.79259 \| 4.84352 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 32 \| 128 \| 8187.67 \| 8201.71 \| 35449.2 \| 26424.3 \| 10364.5 \| 335.722 \| 335.147 \| 77.5413 \| 104.025 \| 265.21 \| 4.32959 \| 4.32218 \| 1.34154 \| 3.42025 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 16 \| 64 \| 9948.15 \| 9815.47 \| 62815.1 \| 23741.9 \| 12710 \| 276.31 \| 280.046 \| 43.7598 \| 115.778 \| 216.269 \| 6.31425 \| 6.39961 \| 2.64575 \| 4.94217 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 32 \| 128 \| 32187.6 \| 32035.6 \| 137832 \| 102075 \| 40623.4 \| 341.595 \| 343.216 \| 79.7716 \| 107.716 \| 270.66 \| 4.28216 \| 4.30248 \| 1.35031 \| 3.39293 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ ``` Rollback Plan: Differential Rev,ision: D75934358 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155103 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2025-06-07 03:47:29 +00:00
Stella Laurenzo	30387ab2e4	[ROCm] Adds initialization support for PyTorch when built from ROCm wheels. (#155285 ) AMD is beginning to roll out ROCm distribution via Python wheels. This patch adds the `__init__.py` hook that is necessary to bootstrap ROCm correctly on Linux and Windows when built from these wheels. See draft, developer documentation describing the mechanism here: https://github.com/ROCm/TheRock/blob/main/docs/packaging/python_packaging.md This operates to similar effect as how Torch can depend on CUDA wheels, with some differences: * ROCm libraries and checks are delegated to helpers in the `rocm_sdk` module, which knows how to find and configure access to the installed libraries. This limits the amount of plumbing and path machinations that must match up between the framework and ROCm. * When building torch against ROCm, no ROCm system install is needed: instead the proper SDK development wheel is installed and the `CMAKE_PREFIX_PATH` is obtained via `rocm-sdk path --cmake`. * It is expected that whoever produces such a build will also place a generated `_rocm_init.py` in the `torch` module with initialization logic to preload libraries, check versions, verify GPU compatibility, etc. * See [build_prod_wheels.py](https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py) for an example build script that is being used to generate nightlies in this configuration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155285 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-07 02:59:03 +00:00
Nikita Shulga	f140fac8dc	[MPS] Implement erfc (#155382 ) And migrate `erf` to Metal kernel Use `erf` approximations from https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/erf.h as previous approximation did not match the CPU implementation After that, `erfc(x) := 1.0 - erf(x)` Fixes https://github.com/pytorch/pytorch/issues/155337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155382 Approved by: https://github.com/manuelcandales, https://github.com/dcci	2025-06-07 02:35:12 +00:00
Oleksandr Stashuk	400f439670	[pt][easy] Rename metadata column (#155365 ) Summary: Fixing typo: our logging requires autotuning_data instead of autotune_data, making it consistent Test Plan: Run benchmark, observe in perfetto trace proper name Rollback Plan: Differential Revision: D76159393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155365 Approved by: https://github.com/masnesral, https://github.com/Skylion007	2025-06-07 02:25:55 +00:00
William Wen	81b0b308ca	[dynamo] constant fold torch.cuda.is_initialized (#155300 ) Fixes https://github.com/pytorch/pytorch/issues/129659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155300 Approved by: https://github.com/StrongerXi, https://github.com/jansel	2025-06-07 02:21:11 +00:00
Stella Laurenzo	10cd1de518	[ROCm] Make optional features in LoadHIP better conditioned. (#155305 ) * The `rocm-core` CMake package only started appearing in ROCm 6.4, so rework the version probing to work if it is not present. Also collapses the unneeded operating system conditioning in favor of feature probing. * Make `hipsparselt` optional: it only started appearing in ROCm 6.4 and it is not in all recent distribution channels yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155305 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-07 02:20:55 +00:00
Nikita Shulga	5596cefba6	Fix segfault during NumPy string tensor conversion (#155364 ) By checking dtype first, but add elemnt_size check as well Fixes https://github.com/pytorch/pytorch/issues/155328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155364 Approved by: https://github.com/Skylion007	2025-06-07 01:55:00 +00:00
Abhishek Kumar	be2e43264d	[CI]Update windows runner to windows-2022 (#154368 ) As per info in : actions/runner-images#12045 We need to change window runner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154368 Approved by: https://github.com/cyyever, https://github.com/atalman	2025-06-07 01:39:19 +00:00
Aaron Gokaslan	83d22256f8	[BE][Ez]: Improve typing in torch._logging (#155345 ) Add a few missing returns in torch._logging and use ruff to infer the obvious ones. LazyStr now properly checks the return type of the Callable and the args and kwargs passed to it Pull Request resolved: https://github.com/pytorch/pytorch/pull/155345 Approved by: https://github.com/ezyang	2025-06-07 00:04:39 +00:00
Jane Xu	9b4db093cb	Add C shim for at::pad and fix some typos (#155226 ) As stated, we would like a pad shim to support custom ops wanting to build in an ABI stable manner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155226 Approved by: https://github.com/desertfire	2025-06-06 23:08:39 +00:00
Francisco R Castro Garcia	cd82096973	DOC: Convert to markdown: ddp_comm_hooks.rst, debugging_environment_variables.rst, deploy.rst, deterministic.rst, distributed.algorithms.join.rst (#155298 ) Fixes #155017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155298 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-06 22:44:50 +00:00
Aaron Gokaslan	457dd79927	[BE][Ez]: Remove unnecessary accesses of dim vector (#155334 ) It's better because you return less date, encapsulate more, and no longer need special handling of symvec vs nonsym vec dim(). Also removes a few casts and fixes a few potential edge cases relating to unsigned comparisons Pull Request resolved: https://github.com/pytorch/pytorch/pull/155334 Approved by: https://github.com/ezyang	2025-06-06 21:28:25 +00:00
Kazuaki Ishizaki	c95705dac2	[Docs] Convert to markdown: torch.compiler_troubleshooting_old.rst, torch.compiler.rst (#155348 ) Part of changes #155040 (parent PR #155120) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155348 Approved by: https://github.com/svekars	2025-06-06 21:26:24 +00:00
eellison	d2a2bfcb58	Turn on new tiling by default (#154768 ) Turning on in fbcode to come. Also updates `max_tiles` to have a default value of None. The existing tiling logic doesn't really handle max_tiles=3 well, but we do in the new tiling logic, so we default to 3 in the new logic and 2 elsewhere unless max_tiles has been explicitly set. TB runners have been very unstable recently (do we need to bump batch size ?) but e.g. for a [recent torchbench](https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Tue,%2027%20May%202025%2015:38:26%20GMT&stopTime=Tue,%2003%20Jun%202025%2015:38:26%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/eellison/803/head&lCommit=8480c220db4eb3c9e2b58d85a698d0a7113a6e37&rBranch=main&rCommit=0cd18ba1ca35d87916723d445c06664615dcae12) inference run we had 15 models with a lower execution time (i.g. green) and 2 models with higher (i.e.. red) I am doing another run and will update here. Dynamic shapes is not yet turned on because there are a lot of fixes to be done in splitting that don't work yet.. See: ``` (Pdb) p expr ((s25s85)//32) (Pdb) p FloorDiv(expr, expr) ((s25s85)//(32(((s25s85)//32)))) ``` and also - unbacked shape is not multiple of itself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154768 Approved by: https://github.com/jansel	2025-06-06 21:19:35 +00:00
Animesh Jain	bc5a11b581	[easy][invoke_subgraph] Remove skip from already fixed test (#155286 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155286 Approved by: https://github.com/zou3519	2025-06-06 21:16:22 +00:00
Wei Feng	0d8c029584	[FSDP2] keep root unsharded when not specifying reshard_after_forward (#155319 ) for `fully_shard(model)` without explicitly setting `reshard_after_forward=True/False`, we keep root unsharded. When user explicitly set `reshard_after_forward`, we respect it Pull Request resolved: https://github.com/pytorch/pytorch/pull/155319 Approved by: https://github.com/mori360	2025-06-06 20:29:31 +00:00
loganthomas	4f5b34427b	DOC: Convert to markdown: torch.overrides.rst, type_info.rst, utils.rst, xpu.rst (#155088 ) Fixes #155041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155088 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-06 20:16:13 +00:00
Animesh Jain	067fd0b3ab	[dynamo][cleanup] Simplify disabling of the helper functions on tensor properties (#155259 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155259 Approved by: https://github.com/zhxchen17	2025-06-06 19:44:40 +00:00
Ke Wen	749757ac1b	[a2av] Align length of major dimension in output of 2D a2av (#155172 ) Downstream consumer of the 2D all-to-all-v is often a group GEMM. Today the GEMM often have an alignment requirement on the chunk sizes within grouped sequence, where each chunk carries the tokens headed for an expert. For example, `torch._group_mm` requires an alignment of 8. This PR adds that alignment capability, when user passes in a `major_align` argument, so that no extra padding step is needed. The key in supporting that is making the output offsets aligned to such value. (Output offsets are returned to the users in the 3rd row of `in_out_splits`, on device. The 2nd row, output splits, are unaffected by this alignment value -- i.e. reflecting true number of tokens for an expert.) The algorithm is as follows. ![502413288_678786854922438_530852083153996358_n](https://github.com/user-attachments/assets/557624a3-150e-4ab6-ba8b-1dbaa5ac01ac) In detailed implementation, we use warp scan to calculate prefix sum on the "block" illustrated above. As a result, the "block" size, i.e. `npes` is currently limited to warp size 32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155172 Approved by: https://github.com/ngimel ghstack dependencies: #153653, #153677, #155058	2025-06-06 19:39:44 +00:00
Jovian Anthony Jaison	1ccc57e428	Log backward no-op to tlparse and pt2 compile events. (#154544 ) Summary: Log backward no-op to tlparse and pt2 compile events. Test Plan: $ rm -rf /tmp/r && TORCH_TRACE=/tmp/r buck2 run //scripts/jovian:backward_noop_repro_compile Used print statements to verify we enter the logging code region. Differential Revision: D75231665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154544 Approved by: https://github.com/c00w	2025-06-06 18:08:19 +00:00
Blaine Burton Rister	2e2ea7290a	[Inductor] Support autotuning in the FX backend. (#155049 ) # Feature If `config.triton.autotune_at_compile_time` is set to `True`, autotune Triton kernels during FX conversion. Else, stick with the existing behavior of using the first precompiled config. # Test plan Added CI tests verifying that the tuner is called iff this flag is set, with and without dynamic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155049 Approved by: https://github.com/jansel	2025-06-06 17:44:14 +00:00
Ke Wen	453bc9fbdf	[a2av] 2D all-to-all-vdev (#155058 ) A 2D AllToAllv shuffle is illustrated below: (`world_size` = 2, `ne` = 2, where `ne` is number of experts per rank) ``` Source: \| Rank 0 \| Rank 1 \| \| c0 \| c1 \| c2 \| c3 \| d0 \| d1 \| d2 \| d3 \| Dest : \| Rank 0 \| Rank 1 \| \| c0 \| d0 \| c1 \| d1 \| c2 \| d2 \| c3 \| d3 \| ``` where each `c_i` / `d_i` are slices of the `input` tensor, targeting expert `i`, with length indicated by input splits (in `in_out_splits[0]`). That is, the 2D AllToAllv shuffle achieves a transpose from rank-major order at input to expert-major order at output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155058 Approved by: https://github.com/ngimel ghstack dependencies: #153653, #153677	2025-06-06 17:35:39 +00:00
Oleksandr Stashuk	64436c38c9	[logs] Add autotuning data (#154771 ) Summary: Add autotuning logging data to scuba/chrome trace. Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 tlp buck run //scripts/sashko:compilation_sample ``` Open https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/viewer?local_cache_key=00000000-0000-0000-92db-f23383ebf5b5, search for template_autotuning, see in metadata strides (see screenshot) Differential Revision: D75457770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154771 Approved by: https://github.com/masnesral, https://github.com/PaulZhang12	2025-06-06 17:12:55 +00:00
Natalia Gimelshein	706bc41c4c	pass mempool arg through emptyCache (#155315 ) Fixing typo in a previous PR #154746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155315 Approved by: https://github.com/Skylion007	2025-06-06 16:14:26 +00:00
Aleksei Nikiforov	7ae7c14143	Reduce scope of s390x CI (#155208 ) The purpose of this change is to reduce scope of s390x CI to stop it potentially blocking usual workflows for other users while still keeping nightly builds and tests for me to look at. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155208 Approved by: https://github.com/malfet	2025-06-06 16:07:34 +00:00
bobrenjc93	fc77269262	Add randint_like tensor overload for high (#154899 ) Fixes #135664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154899 Approved by: https://github.com/StrongerXi	2025-06-06 15:48:00 +00:00
PyTorch MergeBot	7e4c097b07	Revert "[inductor] Add typing to _inductor/ir.py (#149958 )" This reverts commit 529e0357c6c4e74f8cd32c29198c5f1c9f6e329d. Reverted https://github.com/pytorch/pytorch/pull/149958 on behalf of https://github.com/malfet due to Looks like it broke inductor_torchbind tests, due to more graphbreaks, see `b0fbbef136/1` ([comment](https://github.com/pytorch/pytorch/pull/149958#issuecomment-2949583209))	2025-06-06 15:19:16 +00:00
PyTorch MergeBot	b0fbbef136	Revert "Turn on new tiling by default (#154768 )" This reverts commit 7dcc77e422dcf97ce35991a138ab635a5cb88731. Reverted https://github.com/pytorch/pytorch/pull/154768 on behalf of https://github.com/malfet due to Looks like it broke inductor CPU, see `231eb9902b/1` ([comment](https://github.com/pytorch/pytorch/pull/154768#issuecomment-2949468396))	2025-06-06 14:40:03 +00:00
Nikita Shulga	231eb9902b	[MPS][BE] Extend ndim_and_dtypes to 4 elements (#155272 ) Metal arguments must be 8 bytes aliged (or may be 16 bytes), so running any strided (or typecasted) binary op with MTL_DEBUG_LAYER leads to exception ``` % MTL_DEBUG_LAYER=1 python3 ../test/test_mps.py -v -k test_output_match_add 2025-06-05 15:41:34.201 Python[86653:16826825] Metal API Validation Enabled test_output_match_add_mps_bfloat16 (__main__.TestConsistencyMPS.test_output_match_add_mps_bfloat16) ... validateComputeFunctionArguments:1083: failed assertion `Compute Function(add_strided_bfloat_bfloat): argument ndim[0] from buffer(7) with offset(0) and length(12) has space for 12 bytes, but argument has a length(16).' zsh: abort MTL_DEBUG_LAYER=1 python3 ../test/test_mps.py -v -k test_output_match_add ``` Extend it to 4 elements and pass output dtype, which will be used by binary_op later on anyway Test plan: Run abovementioned command with `MTL_DEBUG_LAYER=1` and make sure everything passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/155272 Approved by: https://github.com/angelayi, https://github.com/dcci, https://github.com/cyyever	2025-06-06 14:20:21 +00:00
Tom Ritchford	529e0357c6	[inductor] Add typing to _inductor/ir.py (#149958 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958 Approved by: https://github.com/Skylion007	2025-06-06 14:15:01 +00:00
Edward Z. Yang	348fd45065	Support detached checkout in tools/nightly.py (#154314 ) Prompt for Sonnet 3.7 in Claude Code: Only inspect tools/nightly.py, all other files are irrelevant to your task. Do not use any shell commands. Task: Add a --detach argument to this script which instead of making a new branch just directly checks out the correct commit in detached mode. With two interventions: - Branch and detach are mutually exclusive. So you should consolidate them into a single argument. Why don't we take over the 'None' option? - Do you know that nightly_version is guaranteed to be a commit hash? It seems it would be safer to explicitly pass --detach I tested by running `python tools/nightly.py checkout` and observing that my worktree was detached at this point. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154314 Approved by: https://github.com/XuehaiPan, https://github.com/malfet	2025-06-06 13:28:29 +00:00
drisspg	907aea032d	Add claude local md files (#155299 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155299 Approved by: https://github.com/ezyang	2025-06-06 13:28:26 +00:00
Aaron Gokaslan	6b1211df29	[BE]: Backport runtime_checkable perf improvements/behavior from 3.12 (#155130 ) Backports some behavior changes and performance improvements with runtime_checkable in 3.12 to older versions of Python. Should be free performance improvement on typing checking protocols since everything works on Python 3.12. The difference between the two versions of runtime_checkable is [these lines](`40e22ebb2c/src/typing_extensions.py (L800-L823)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155130 Approved by: https://github.com/rec, https://github.com/aorenste	2025-06-06 13:28:05 +00:00
Yu, Guangye	10cef1e25d	Remove torch XPU ABI=0 build logic for old compiler (#150095 ) # Motivation Follow https://github.com/pytorch/pytorch/pull/149888, this PR intends to remove ABI=0 build logic for PyTorch XPU build with old compiler (< 2025.0). For newer compilers >= 2025.0, the ABI is neutral by default without requiring additional compilation options (`-fpreview-breaking-changes`). # Additional Context This PR depends on XPU CI pass, which will be fixed by https://github.com/pytorch/pytorch/pull/149843 and https://github.com/intel/torch-xpu-ops/pull/1515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150095 Approved by: https://github.com/EikanWang, https://github.com/malfet	2025-06-06 13:13:19 +00:00
Nikita Shulga	58e5d20c57	[BE] Delete IS_SPMM_AVAILABLE() logic (#155296 ) As it's been available on all currently supported platforms Pull Request resolved: https://github.com/pytorch/pytorch/pull/155296 Approved by: https://github.com/clee2000	2025-06-06 13:12:35 +00:00
Animesh Jain	271ca679a8	[reland][dynamo] Record the pre-graph bytecode using fast record function event (#154974 ) reland of https://github.com/pytorch/pytorch/pull/154769 @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/154974 Approved by: https://github.com/Lucaskabela, https://github.com/jansel	2025-06-06 13:11:03 +00:00
PyTorch MergeBot	9656251bb1	Revert "[BE] Update cudnn to 9.10.1.4 (#155122 )" This reverts commit a14f427db68e54500ef4cd9ed34cb9537263bb74. Reverted https://github.com/pytorch/pytorch/pull/155122 on behalf of https://github.com/malfet due to Looks like it breaks a bunch of tests, see `36a722e20d/1` ([comment](https://github.com/pytorch/pytorch/pull/155122#issuecomment-2949209801))	2025-06-06 13:03:49 +00:00
中野博文	36a722e20d	[typo] Fix 'intialize' -> 'initialize' in proxy_tensor.py (#155301 ) ## Description Fixes a typo in the comment of `torch/fx/experimental/proxy_tensor.py`, changing "intialize" to "initialize". ## Issue None ## Type of change - [x] Typo fix ## Checklist - [x] My code follows the style guidelines of this project - [x] I have performed a self-review of my own code - [x] My changes generate no new warnings Pull Request resolved: https://github.com/pytorch/pytorch/pull/155301 Approved by: https://github.com/jingsh, https://github.com/ezyang, https://github.com/cyyever	2025-06-06 10:43:44 +00:00
zeshengzong	9d59b516e9	Make device check throw specific error (#155085 ) Fixes #122757 The fix is lost after revert and rebase previous PR https://github.com/pytorch/pytorch/pull/150750 (only change of tests are merged). ## Test Result ```python >>> import torch >>> >>> model_output = torch.randn(10, 5).cuda() >>> labels = torch.randint(0, 5, (10,)).cuda() >>> weights = torch.randn(5) >>> >>> loss_fn = torch.nn.CrossEntropyLoss(weight=weights) >>> loss = loss_fn(input=model_output, target=labels) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1778, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward return F.cross_entropy( ^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/functional.py", line 3476, in cross_entropy return torch._C._nn.cross_entropy_loss( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155085 Approved by: https://github.com/mikaylagawarecki	2025-06-06 07:00:04 +00:00
Wang, Chuanqi	07da8a469b	[CI] fix xpu-smi hang issue on some xpu runners (#155194 ) To workaround xpu-smi hang issue on some XPU runners, refer https://github.com/pytorch/pytorch/actions/runs/15431583674/job/43431289026?pr=154962 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155194 Approved by: https://github.com/EikanWang, https://github.com/malfet	2025-06-06 06:51:26 +00:00
Marcin Pioch	e694280d12	Custom FX pass for inductor's backend registration (#154841 ) This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-06-06 06:49:44 +00:00
Jing Xu	c6b4f98625	Add Intel GPU info collection to the collect env script (#137846 ) As title, add Intel GPU info collection to the collect env script Output examples: 1. CPU on Windows ``` C:\Users\user\miniforge3\envs\py310\lib\site-packages\torch\_subclasses\functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_numpy.cpp:81.) cpu = _conversion_method_template(device=torch.device("cpu")) Collecting environment information... PyTorch version: 2.8.0.dev20250528+cpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Microsoft Windows 11 Enterprise (10.0.22631 64-bit) GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: N/A Python version: 3.10.17 \| packaged by conda-forge \| (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.22631-SP0 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Name: 12th Gen Intel(R) Core(TM) i7-1270P Manufacturer: GenuineIntel Family: 198 Architecture: 9 ProcessorType: 3 DeviceID: CPU0 CurrentClockSpeed: 1711 MaxClockSpeed: 2200 L2CacheSize: 9216 L2CacheSpeed: None Revision: None Versions of relevant libraries: [pip3] torch==2.8.0.dev20250528+cpu [conda] torch 2.8.0.dev20250528+cpu pypi_0 pypi ``` 2. XPU on Windows ``` Collecting environment information... PyTorch version: 2.8.0a0+gitef6306e Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Microsoft Windows 10 Pro (10.0.19045 64-bit) GCC version: (GCC) 13.1.0 Clang version: Could not collect CMake version: version 3.29.3 Libc version: N/A Python version: 3.10.17 \| packaged by conda-forge \| (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.19045-SP0 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Is XPU available: True XPU used to build PyTorch: 20250101 Intel GPU driver version: * 32.0.101.6795 (20250520000000.****+) Intel GPU models onboard: Intel(R) Arc(TM) A770 Graphics Intel GPU models detected: * [0] _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.33184', total_memory=15915MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1) HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: ---------------------- Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz Manufacturer: GenuineIntel Family: 179 Architecture: 9 ProcessorType: 3 DeviceID: CPU0 CurrentClockSpeed: 2401 MaxClockSpeed: 2401 L2CacheSize: 24576 L2CacheSpeed: None Revision: 21767 ---------------------- Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz Manufacturer: GenuineIntel Family: 179 Architecture: 9 ProcessorType: 3 DeviceID: CPU1 CurrentClockSpeed: 2200 MaxClockSpeed: 2401 L2CacheSize: 24576 L2CacheSpeed: None Revision: 21767 Versions of relevant libraries: [pip3] intel_extension_for_pytorch==2.8.10+gitb3ea3a1 [pip3] numpy==2.1.2 [pip3] optree==0.13.1 [pip3] pytorch-triton-xpu==3.3.1+gitb0e26b73 [pip3] torch==2.8.0a0+gitef6306e [conda] intel-extension-for-pytorch 2.8.10+gitb3ea3a1 pypi_0 pypi [conda] mkl 2025.1.0 pypi_0 pypi [conda] mkl-dpcpp 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-blas 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-datafitting 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-dft 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-lapack 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-rng 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-sparse 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-stats 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-vm 2025.1.0 pypi_0 pypi [conda] pytorch-triton-xpu 3.3.1+gitb0e26b73 pypi_0 pypi [conda] torch 2.8.0a0+gitef6306e pypi_0 pypi ``` 3. CPU on Linux ``` /opt/python/cp312-cp312/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.) cpu = _conversion_method_template(device=torch.device("cpu")) Collecting environment information... PyTorch version: 2.8.0.dev20250528+cpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: AlmaLinux 8.10 (Cerulean Leopard) (x86_64) GCC version: (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) Clang version: Could not collect CMake version: version 4.0.0 Libc version: glibc-2.28 Python version: 3.12.10 (main, Apr 19 2025, 05:03:56) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] (64-bit runtime) Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.28 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 88 On-line CPU(s) list: 0-87 Thread(s) per core: 2 Core(s) per socket: 22 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz Stepping: 7 CPU MHz: 1000.000 CPU max MHz: 3700.0000 CPU min MHz: 1000.0000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 30976K NUMA node0 CPU(s): 0-21,44-65 NUMA node1 CPU(s): 22-43,66-87 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear flush_l1d arch_capabilities Versions of relevant libraries: [pip3] torch==2.8.0.dev20250528+cpu [conda] Could not collect ``` 5. XPU on Linux ``` Collecting environment information... PyTorch version: 2.8.0.dev20250516+xpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.31.6 Libc version: glibc-2.35 Python version: 3.10.17 \| packaged by conda-forge \| (main, Apr 10 2025, 22:19:12) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-5.15.50-051550-generic-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Is XPU available: True XPU used to build PyTorch: 20250101 Intel GPU driver version: * intel_opencl: 24.39.31294.21-1032~22.04 * level_zero: 1.17.44.0-1022~22.04 Intel GPU models onboard: * Intel(R) Data Center GPU Max 1550 * Intel(R) Data Center GPU Max 1550 * Intel(R) Data Center GPU Max 1550 * Intel(R) Data Center GPU Max 1550 Intel GPU models detected: * [0] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [1] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [2] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [3] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [4] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [5] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [6] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [7] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 224 On-line CPU(s) list: 0-223 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8480+ CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 56 Socket(s): 2 Stepping: 6 CPU max MHz: 3800.0000 CPU min MHz: 800.0000 BogoMIPS: 4000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 5.3 MiB (112 instances) L1i cache: 3.5 MiB (112 instances) L2 cache: 224 MiB (112 instances) L3 cache: 210 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-55,112-167 NUMA node1 CPU(s): 56-111,168-223 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==2.2.5 [pip3] pytorch-triton-xpu==3.3.0+git0bcc8265 [pip3] torch==2.8.0.dev20250516+xpu [conda] mkl 2025.1.0 pypi_0 pypi [conda] numpy 2.2.5 pypi_0 pypi [conda] onemkl-sycl-blas 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-dft 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-lapack 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-rng 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-sparse 2025.1.0 pypi_0 pypi [conda] pytorch-triton-xpu 3.3.0+git0bcc8265 pypi_0 pypi [conda] torch 2.8.0.dev20250516+xpu pypi_0 pypi ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137846 Approved by: https://github.com/guangyey, https://github.com/malfet Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-06-06 05:53:24 +00:00
PyTorch MergeBot	d3d64c6db0	Revert "Add pinned numpy and fix build (#155129 )" This reverts commit a3098a74d494020dbb906c05ef047013e1921662. Reverted https://github.com/pytorch/pytorch/pull/155129 on behalf of https://github.com/malfet due to Broke test_spectral_op, looks like missing xfail, see `0db3e0cf29/1` ([comment](https://github.com/pytorch/pytorch/pull/155129#issuecomment-2947951632))	2025-06-06 03:14:47 +00:00
PyTorch MergeBot	0db3e0cf29	Revert "Add Intel GPU info collection to the collect env script (#137846 )" This reverts commit e1180c7228ba8c8b16cabf78706d4a67ca189a6b. Reverted https://github.com/pytorch/pytorch/pull/137846 on behalf of https://github.com/malfet due to Breaks doc test, but should be easily fixable ([comment](https://github.com/pytorch/pytorch/pull/137846#issuecomment-2947935940))	2025-06-06 03:08:48 +00:00
Simon Fan	28796f71d0	Redo D75092426: [internal] Expose additional metadata to compilation callbacks (#155063 ) Originally https://github.com/pytorch/pytorch/pull/153596 --------------- Summary: via reverting D75708685 gate the ROCm failure Test Plan: Unit tests in OSS, sandcastle Rollback Plan: Bifferential Revision: D75894349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155063 Approved by: https://github.com/masnesral	2025-06-05 23:40:31 +00:00
Xuan Zhang	72453a6676	[PT2][comms] put `visualize_overlap` in a try-except block (#155222 ) Summary: For simple FSDP, this `visualize_overlap` function is throwing errors. Seems to be a mistake here since `visualize_overlap` is called twice here and one is in try-except and one is not, so doing the same for both places. Test Plan: :) Rollback Plan: Reviewed By: Microve Bifferential Revision: D75985733 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155222 Approved by: https://github.com/yf225	2025-06-05 23:39:48 +00:00
Brian Coutinho	9bae2fcf99	[profiler] Enable all configured activities in CUPTI Range profiler mode (#154749 ) Summary: Updates the pytorch range profiler mode (metrics mode) to support all trace activitity types. Reviewed By: sraikund16 Bifferential Revision: D75568693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154749 Approved by: https://github.com/sraikund16	2025-06-05 23:38:54 +00:00
Shangdi Yu	26f066bb61	Add AOTI model name config (#154129 ) Summary: If a model name is specified in aoti config, the generated files will use that model name as file stem. Test Plan: ``` buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_using_model_name_for_files ``` Bifferential Revision: D75102034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154129 Approved by: https://github.com/desertfire	2025-06-05 23:38:11 +00:00
Nicolas Macchioni	fa705f7912	[BE] minor refactor + some comments on behavior (#154695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154695 Approved by: https://github.com/masnesral, https://github.com/eellison	2025-06-05 23:00:46 +00:00
Jeff Daily	9e88d6c857	[ROCm] manywheel missing hipsparselt deps (#155254 ) Bundle libhipsparselt.so and auxiliary files into wheel. Dependency added by hipsparselt integration #150578. Fixes #155207. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155254 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-05 22:45:36 +00:00
Jing Xu	e1180c7228	Add Intel GPU info collection to the collect env script (#137846 ) As title, add Intel GPU info collection to the collect env script Output examples: 1. CPU on Windows ``` C:\Users\user\miniforge3\envs\py310\lib\site-packages\torch\_subclasses\functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_numpy.cpp:81.) cpu = _conversion_method_template(device=torch.device("cpu")) Collecting environment information... PyTorch version: 2.8.0.dev20250528+cpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Microsoft Windows 11 Enterprise (10.0.22631 64-bit) GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: N/A Python version: 3.10.17 \| packaged by conda-forge \| (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.22631-SP0 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Name: 12th Gen Intel(R) Core(TM) i7-1270P Manufacturer: GenuineIntel Family: 198 Architecture: 9 ProcessorType: 3 DeviceID: CPU0 CurrentClockSpeed: 1711 MaxClockSpeed: 2200 L2CacheSize: 9216 L2CacheSpeed: None Revision: None Versions of relevant libraries: [pip3] torch==2.8.0.dev20250528+cpu [conda] torch 2.8.0.dev20250528+cpu pypi_0 pypi ``` 2. XPU on Windows ``` Collecting environment information... PyTorch version: 2.8.0a0+gitef6306e Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Microsoft Windows 10 Pro (10.0.19045 64-bit) GCC version: (GCC) 13.1.0 Clang version: Could not collect CMake version: version 3.29.3 Libc version: N/A Python version: 3.10.17 \| packaged by conda-forge \| (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.19045-SP0 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Is XPU available: True XPU used to build PyTorch: 20250101 Intel GPU driver version: * 32.0.101.6795 (20250520000000.****+) Intel GPU models onboard: Intel(R) Arc(TM) A770 Graphics Intel GPU models detected: * [0] _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.33184', total_memory=15915MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1) HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: ---------------------- Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz Manufacturer: GenuineIntel Family: 179 Architecture: 9 ProcessorType: 3 DeviceID: CPU0 CurrentClockSpeed: 2401 MaxClockSpeed: 2401 L2CacheSize: 24576 L2CacheSpeed: None Revision: 21767 ---------------------- Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz Manufacturer: GenuineIntel Family: 179 Architecture: 9 ProcessorType: 3 DeviceID: CPU1 CurrentClockSpeed: 2200 MaxClockSpeed: 2401 L2CacheSize: 24576 L2CacheSpeed: None Revision: 21767 Versions of relevant libraries: [pip3] intel_extension_for_pytorch==2.8.10+gitb3ea3a1 [pip3] numpy==2.1.2 [pip3] optree==0.13.1 [pip3] pytorch-triton-xpu==3.3.1+gitb0e26b73 [pip3] torch==2.8.0a0+gitef6306e [conda] intel-extension-for-pytorch 2.8.10+gitb3ea3a1 pypi_0 pypi [conda] mkl 2025.1.0 pypi_0 pypi [conda] mkl-dpcpp 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-blas 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-datafitting 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-dft 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-lapack 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-rng 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-sparse 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-stats 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-vm 2025.1.0 pypi_0 pypi [conda] pytorch-triton-xpu 3.3.1+gitb0e26b73 pypi_0 pypi [conda] torch 2.8.0a0+gitef6306e pypi_0 pypi ``` 3. CPU on Linux ``` /opt/python/cp312-cp312/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.) cpu = _conversion_method_template(device=torch.device("cpu")) Collecting environment information... PyTorch version: 2.8.0.dev20250528+cpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: AlmaLinux 8.10 (Cerulean Leopard) (x86_64) GCC version: (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) Clang version: Could not collect CMake version: version 4.0.0 Libc version: glibc-2.28 Python version: 3.12.10 (main, Apr 19 2025, 05:03:56) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] (64-bit runtime) Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.28 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 88 On-line CPU(s) list: 0-87 Thread(s) per core: 2 Core(s) per socket: 22 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz Stepping: 7 CPU MHz: 1000.000 CPU max MHz: 3700.0000 CPU min MHz: 1000.0000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 30976K NUMA node0 CPU(s): 0-21,44-65 NUMA node1 CPU(s): 22-43,66-87 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear flush_l1d arch_capabilities Versions of relevant libraries: [pip3] torch==2.8.0.dev20250528+cpu [conda] Could not collect ``` 5. XPU on Linux ``` Collecting environment information... PyTorch version: 2.8.0.dev20250516+xpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.31.6 Libc version: glibc-2.35 Python version: 3.10.17 \| packaged by conda-forge \| (main, Apr 10 2025, 22:19:12) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-5.15.50-051550-generic-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Is XPU available: True XPU used to build PyTorch: 20250101 Intel GPU driver version: * intel_opencl: 24.39.31294.21-1032~22.04 * level_zero: 1.17.44.0-1022~22.04 Intel GPU models onboard: * Intel(R) Data Center GPU Max 1550 * Intel(R) Data Center GPU Max 1550 * Intel(R) Data Center GPU Max 1550 * Intel(R) Data Center GPU Max 1550 Intel GPU models detected: * [0] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [1] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [2] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [3] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [4] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [5] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [6] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) * [7] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 224 On-line CPU(s) list: 0-223 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8480+ CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 56 Socket(s): 2 Stepping: 6 CPU max MHz: 3800.0000 CPU min MHz: 800.0000 BogoMIPS: 4000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 5.3 MiB (112 instances) L1i cache: 3.5 MiB (112 instances) L2 cache: 224 MiB (112 instances) L3 cache: 210 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-55,112-167 NUMA node1 CPU(s): 56-111,168-223 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==2.2.5 [pip3] pytorch-triton-xpu==3.3.0+git0bcc8265 [pip3] torch==2.8.0.dev20250516+xpu [conda] mkl 2025.1.0 pypi_0 pypi [conda] numpy 2.2.5 pypi_0 pypi [conda] onemkl-sycl-blas 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-dft 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-lapack 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-rng 2025.1.0 pypi_0 pypi [conda] onemkl-sycl-sparse 2025.1.0 pypi_0 pypi [conda] pytorch-triton-xpu 3.3.0+git0bcc8265 pypi_0 pypi [conda] torch 2.8.0.dev20250516+xpu pypi_0 pypi ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137846 Approved by: https://github.com/guangyey, https://github.com/malfet Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-06-05 22:35:04 +00:00
Murray Steele	0a092c7de6	Enable CPP Extension Open Registration tests on Arm (#144774 ) Enables most tests under CPP Extension Open Registration as they pass on Arm now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144774 Approved by: https://github.com/aditew01, https://github.com/fadara01, https://github.com/malfet	2025-06-05 22:32:28 +00:00
eellison	0827464002	Replace runtime type parameterization (#155221 ) See: ``` >>> import timeit; print(f"OrderedSet[str](): {timeit.timeit('OrderedSet[str]()', setup='from torch.utils._ordered_set import OrderedSet', number=1000000):.6f}s, OrderedSet(): {timeit.timeit('OrderedSet()', setup='from torch.utils._ordered_set import OrderedSet', number=1000000):.6f}s") ``` > `OrderedSet[str]()`: 0.354622s, OrderedSet(): 0.095376s Type parameterization should be on type hint, not in runtime. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155221 Approved by: https://github.com/Skylion007, https://github.com/jansel	2025-06-05 21:43:54 +00:00
eellison	7dcc77e422	Turn on new tiling by default (#154768 ) Turning on in fbcode to come. Also updates `max_tiles` to have a default value of None. The existing tiling logic doesn't really handle max_tiles=3 well, but we do in the new tiling logic, so we default to 3 in the new logic and 2 elsewhere unless max_tiles has been explicitly set. TB runners have been very unstable recently (do we need to bump batch size ?) but e.g. for a [recent torchbench](https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Tue,%2027%20May%202025%2015:38:26%20GMT&stopTime=Tue,%2003%20Jun%202025%2015:38:26%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/eellison/803/head&lCommit=8480c220db4eb3c9e2b58d85a698d0a7113a6e37&rBranch=main&rCommit=0cd18ba1ca35d87916723d445c06664615dcae12) inference run we had 15 models with a lower execution time (i.g. green) and 2 models with higher (i.e.. red) I am doing another run and will update here. Dynamic shapes is not yet turned on because there are a lot of fixes to be done in splitting that don't work yet.. See: ``` (Pdb) p expr ((s25s85)//32) (Pdb) p FloorDiv(expr, expr) ((s25s85)//(32(((s25s85)//32)))) ``` and also - unbacked shape is not multiple of itself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154768 Approved by: https://github.com/jansel	2025-06-05 21:34:09 +00:00
tvukovic-amd	a85ad55525	[ROCm][Windows] Fix offload gpu arch list in tests (#155212 ) Added fix to get ROCM_PROPERTY_ARCH_LIST value in set_target_properties in c10/cuda and caffe2 tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/155212 Approved by: https://github.com/malfet	2025-06-05 20:30:28 +00:00
Michael Lazos	9a42f01586	[Cutlass] EVT dynamic shapes support (#154835 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154835 Approved by: https://github.com/henrylhtsang ghstack dependencies: #154829	2025-06-05 20:17:01 +00:00
Michael Lazos	5911f870c0	[Cutlass] fp8 dynamic shapes test (#154829 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154829 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-06-05 20:17:01 +00:00
Shangdi Yu	606d73bde4	Adding from_node for nodes in gm.module() (#155053 ) Summary: Adding "from_node" information that indicates which nodes are unlifted in `.module()` call. The lifted nodes will have "ExportedProgram.module().unlift()" passname in the last entry of from_node. Test Plan: ``` buck run fbcode//caffe2/test:test_export -- -r test_from_node_metadata_export ``` Rollback Plan: Reviewed By: angelayi Differential Revision: D75837494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155053 Approved by: https://github.com/angelayi	2025-06-05 20:11:56 +00:00
Yidi Wu	c8c892b4a5	[scan] disable functionalization key in backward tracing (#154343 ) Previously, we didn't disable functionalization key when materializing backward graph. This causes the torch.zeros_like call for the case where grad is None to return a functional tensor that's not tracked by the proxy tensor mode. This PR fixes it by putting the tracing code under disable functionalization ctx manager. Fixes https://github.com/pytorch/pytorch/issues/153437. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154343 Approved by: https://github.com/zou3519	2025-06-05 20:06:33 +00:00
Joel Schlosser	5e93abe3c0	Address docs for clip_grad functions (#155125 ) This PR takes the opinionated stance that `torch.nn.utils.<func>` should be the preferred API over `torch.nn.utils.clip_grad.<func>`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155125 Approved by: https://github.com/albanD, https://github.com/mikaylagawarecki, https://github.com/janeyx99	2025-06-05 19:22:09 +00:00
Nikita Shulga	dd41a3907c	[MPS] Fix unary/binary ops for 2**32+ elem tensors (#155183 ) By using `TensorIterator::with_32bit_indexing()` primitive Add `bind_tensors` helper function that correctly sets up MPS tensors originating from TensorIterator TODO: Add comments to bind_tensors as well asunit test, based on ``` python -c "import torch;print((torch.rand(1, 1024, 1024, dtype=torch.bfloat16, device='mps') + torch.rand(5000, 1, 1, dtype=torch.bfloat16, device='mps')).sin())" ``` Fixes https://github.com/pytorch/pytorch/issues/154828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155183 Approved by: https://github.com/cyyever, https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #155150, #155178, #155184	2025-06-05 18:57:14 +00:00
PyTorch MergeBot	05dd638ee9	Revert "Add dont constant fold flag (#154945 )" This reverts commit 196c95d463367f15999c0cddc9eb89031e9988ab. Reverted https://github.com/pytorch/pytorch/pull/154945 on behalf of https://github.com/malfet due to This broke halide test sanity, see `a3098a74d4/1` ([comment](https://github.com/pytorch/pytorch/pull/154945#issuecomment-2945598901))	2025-06-05 18:25:59 +00:00
albanD	a3098a74d4	Add pinned numpy and fix build (#155129 ) Not sure why the online doc build passes but it fails locally with these broken strings... Also pinning numpy version even though it is technically optional to ensure users have the right version as most users have numpy in their environment anyways. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155129 Approved by: https://github.com/janeyx99, https://github.com/svekars	2025-06-05 17:44:18 +00:00
henrylhtsang	2481c4b2ea	[cutlass backend] add teraflops and increase rep for benchmark script (#154944 ) Differential Revision: [D75840023](https://our.internmc.facebook.com/intern/diff/D75840023/) I think I will continue to use do_bench for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154944 Approved by: https://github.com/mlazos	2025-06-05 17:20:29 +00:00
David Berard	be2ab96347	Inductor unit tests: cuda 12.6 -> 12.8 (#155056 ) Fixes #154938 When we update the Triton version in CI, we'll require cuda >= 12.8 for certain AOTI tests to pass: these AOTI tests try to run nvcc on the triton-generated PTX, and triton-generated PTX is PTX 8.7, which requires CUDA 12.8 Regarding the revert & reland: * This PR causes the python 3.13 version to be bumped from 3.13.2 to 3.13.3. test_deopt_from_append_list starts unexpectedly passing on 3.13.3, so I originally modified the test in https://github.com/pytorch/pytorch/pull/155167 to xfail only for <=3.13.2 * However there was a land race with https://github.com/pytorch/pytorch/pull/150796, which introduced another test that passes only for >=3.13.3. Resolution: * @guilhermeleobas reverted https://github.com/pytorch/pytorch/pull/150796 so I will reland this (and I've merged the test_deopt_from_append_list change into this PR. And based on Guilherme's feedback, I'm just skipping the test instead of selectively failing/passing the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155056 Approved by: https://github.com/atalman, https://github.com/nWEIdia	2025-06-05 17:17:27 +00:00
Animesh Jain	cadcb5d368	[inductor] disable compiler on the compiled_module_main (#155169 ) Fixes https://github.com/pytorch/pytorch/issues/154536 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155169 Approved by: https://github.com/jamesjwu, https://github.com/bdhirsh	2025-06-05 16:37:45 +00:00
Animesh Jain	13ea0f2c0a	[dynamo][dynamic] Recompilation hint for nn module integer attributes (#154867 ) For program like this ``` class Mod(torch.nn.Module): def __init__(self): super().__init__() self.c = 0 def forward(self, x): self.c += 1 return x * self.c ``` You can check the recompile reasons at https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpzv9z6Q/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 ![image](https://github.com/user-attachments/assets/856a95fd-0533-4abc-a213-1f73ae2cb766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154867 Approved by: https://github.com/zou3519	2025-06-05 16:37:22 +00:00
Aaron Gokaslan	a14f427db6	[BE] Update cudnn to 9.10.1.4 (#155122 ) Follow up to #152782 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155122 Approved by: https://github.com/malfet, https://github.com/atalman	2025-06-05 16:07:25 +00:00
atalman	cd361fc247	[CI] Migrate focal (ubuntu 20.04) images to jammy (ubuntu 22.04) (#154437 ) Fixes https://github.com/pytorch/pytorch/issues/154157 Inductor Workflows where moved from focal to jammy here: https://github.com/pytorch/pytorch/pull/154153 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154437 Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/davidberard98, https://github.com/huydhn	2025-06-05 15:24:07 +00:00
Jane Xu	e895e9689c	Update docs build to specify <3.13 in CONTRIBUTING (#155140 ) Python 3.13 removed the deprecated imghdr module, so our docs build does not compile with 3.13+. Mention it in our contributing guide so people know before committing to the wrong version oop. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155140 Approved by: https://github.com/drisspg, https://github.com/cyyever ghstack dependencies: #155126	2025-06-05 15:16:48 +00:00
Jane Xu	2f3f8339ec	[BE] Document device memory apis in correct module (#155126 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155126 Approved by: https://github.com/msaroufim, https://github.com/Skylion007	2025-06-05 15:16:48 +00:00
Narek Malkhasyan	7999735d23	[CUDA][MPS] Fix torch.arange bound validation for large float inputs (#154320 ) Fixes #153133 Fixes an inconsistency in torch.arange on CUDA and MPS backends when using float32 and large input values. Previously, invalid ranges (e.g., start > end with a positive step) could silently return empty tensors due to precision loss in validation logic. The fix introduces double precision validation for checking whether the step sign is consistent with the range direction. This ensures torch.arange behaves consistently with CPU for large float32 inputs, and raises an appropriate error when the range is invalid. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154320 Approved by: https://github.com/malfet	2025-06-05 14:51:25 +00:00
Nikita Shulga	ed661a5f11	[MPS] Fix complex scalar binding to Metal tensors (#155184 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155184 Approved by: https://github.com/dcci ghstack dependencies: #155150, #155178	2025-06-05 14:34:57 +00:00
kiersten-stokes	9bf6593e96	Fix docstring for `torch.UntypedStorage.from_file` (#155067 ) Fixes #130629 Happy to revert the second commit if we think it's making the test too fragile for the future Pull Request resolved: https://github.com/pytorch/pytorch/pull/155067 Approved by: https://github.com/malfet	2025-06-05 14:30:49 +00:00
PyTorch MergeBot	a1057cda31	Revert "Add CPython generator/contextlib tests (#150796 )" This reverts commit d5f642211f14593c8c78af98a1fb7cfb63039ce5. Reverted https://github.com/pytorch/pytorch/pull/150796 on behalf of https://github.com/guilhermeleobas due to This is breaking tests on trunk. https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=3.13&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/150796#issuecomment-2944469866))	2025-06-05 13:51:54 +00:00
wengshiy	196c95d463	Add dont constant fold flag (#154945 ) For support https://github.com/pytorch/ao/issues/2228 > What we want to do now is to enable FP8 quantization in PyTorch. And similar as INT8 quantization, we need to insert quantize and dequantize ops into the graph. > > However we met problems with these q/dq ops both in the PyTorch core and Torchao. > > PyTorch core: > > The quantize_per_tensor op does not support FP8. We want to fix it via https://github.com/pytorch/pytorch/pull/153601. And as you commented, the op is deprecated. > Torchao: > > In the fusion pass in Inductor, we want to match the pattern fp8_weight -> torchao.dequantize_affine_float8 -> fp32_op and fuse it as fp8_weight -> weight_pack -> fp8_op. We have done so for INT8 PT2E quantization. However, the pattern matching pass is applied after a constant folding pass in Inductor: > `100ec0b34a/torch/_inductor/fx_passes/freezing_patterns.py (L69C1-L74C1)` > After constant_fold(gm), the pattern will be folded as fp32_weight -> fp32_op. Then the original pattern cannot be found any more and the FP8 semantics is lost since the pattern is entirely in fp32 now. > For INT8, the int8_weight -> quantized_decomposed.dequantize_per_channel -> fp32_op pattern won't be folded because we mark quantized_decomposed.dequantize_per_channel impure so that it won't be folded: `100ec0b34a/torch/_inductor/constant_folding.py (L139C1-L149C1)` . But for the torchao.dequantize_affine_float8, we cannot do this because > It is an op from Torchao, which is unknown to the constant folder > It is decomposed to smaller ops, so we cannot put it in the list as a single op. > So, we think an easy and short-term solution is to modify the ops in PyTorch core via https://github.com/pytorch/pytorch/pull/153601. > However, if we want to resolve the issue with Torchao, we need to > Add a method in the constant folder in Inductor to allow registration of impure ops Based on [Jansel‘s reply](https://github.com/pytorch/ao/issues/2228#issuecomment-2914560340), add dont constant fold flag on this patch Pull Request resolved: https://github.com/pytorch/pytorch/pull/154945 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-06-05 13:42:44 +00:00
PyTorch MergeBot	e01fde8213	Revert "[reland][dynamo] Record the pre-graph bytecode using fast record function event (#154974 )" This reverts commit bee9c70c5d4b681ec1f2adf92eca1205b372634a. Reverted https://github.com/pytorch/pytorch/pull/154974 on behalf of https://github.com/malfet due to Broke inductor tests, see `3c72b9fd8f/1` ([comment](https://github.com/pytorch/pytorch/pull/154974#issuecomment-2944370617))	2025-06-05 13:36:21 +00:00
PyTorch MergeBot	3c72b9fd8f	Revert "SDPA support gfx950 (#155103 )" This reverts commit b9312c56bf5f277e341c0185da748e3475d0807f. Reverted https://github.com/pytorch/pytorch/pull/155103 on behalf of https://github.com/malfet due to looks like it broke mi300 tests, see `9a4c08ddfc/1` ([comment](https://github.com/pytorch/pytorch/pull/155103#issuecomment-2944331460))	2025-06-05 13:33:17 +00:00
PyTorch MergeBot	523b637cbe	Revert "[test][dynamo] skip test_deopt_from_append_list on python>=3.13.3 (#155167 )" This reverts commit 1c828786c28b8cd2a6be2397cc2af65e3266c5fa. Reverted https://github.com/pytorch/pytorch/pull/155167 on behalf of https://github.com/malfet due to This broke a bunch of 3.13 tests, see `fa3c38c7ae/1` ([comment](https://github.com/pytorch/pytorch/pull/155167#issuecomment-2944318067))	2025-06-05 13:27:40 +00:00
PyTorch MergeBot	f60b2712dd	Revert "Inductor unit tests: cuda 12.6 -> 12.8 (#155056 )" This reverts commit bb43ced6e2c9e1cdc17923826aaf58466c2ffd4b. Reverted https://github.com/pytorch/pytorch/pull/155056 on behalf of https://github.com/malfet due to This broke a bunch of 3.13 tests, see `fa3c38c7ae/1` ([comment](https://github.com/pytorch/pytorch/pull/155167#issuecomment-2944318067))	2025-06-05 13:27:40 +00:00
Roy Hvaara	9a4c08ddfc	[MPS] Parametrize `test_scaled_dot_product_attention_autocast` (#155005 ) Also moving comments inside the function scope for some of my previous regression tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155005 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-06-05 13:24:53 +00:00
zeshengzong	fa3c38c7ae	Add tensor overlap check for `cross` (#154999 ) Fixes #132031 ## Test Result ```python In [1]: import torch ...: torch.manual_seed(0) ...: torch.cuda.manual_seed(0) ...: a = torch.randn(3, 4) ...: b = torch.randn(3, 4) ...: torch.cross(a, b, out=a) --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Cell In[1], line 6 4 a = torch.randn(3, 4) 5 b = torch.randn(3, 4) ----> 6 torch.cross(a, b, out=a) RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154999 Approved by: https://github.com/lezcano	2025-06-05 10:00:01 +00:00
Ivan Zaitsev	5b65628906	Workflow to tag trunk commits with `trunk/{commit-sha}` tags (#155170 ) This PR adds workflow to automate tagging commits on the `main` branch. The workflow includes validation and retry with exponential backoff. The rationale for this is to work around the github limitation on using workflow_dispatch (requires branch or tag). We want to use workflow_dispatch to rerun CI workflows with parameters (trunk, pull, etc). --- ### Testing Tested using almost identical workflow in a personal repo (the difference is in repository_owner check and backoff settings). * successful tag push: https://github.com/izaitsevfb/deleteme/actions/runs/15454729765/job/43504630765 * validation: PR commit (fails) https://github.com/izaitsevfb/deleteme/actions/runs/15454743572/job/43504669720 * tagging of the old commit on main: https://github.com/izaitsevfb/deleteme/actions/runs/15453805748/job/43501885903 * tag already exists: https://github.com/izaitsevfb/deleteme/actions/runs/15454756077/job/43504706980 * invalid sha on workflow dispatch: https://github.com/izaitsevfb/deleteme/actions/runs/15454611077/job/43504286858 * retry with exponential backoff on failure (via tag rule blocklist): https://github.com/izaitsevfb/deleteme/actions/runs/15454768346/job/43504743486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155170 Approved by: https://github.com/huydhn	2025-06-05 09:50:58 +00:00
Animesh Jain	bee9c70c5d	[reland][dynamo] Record the pre-graph bytecode using fast record function event (#154974 ) reland of https://github.com/pytorch/pytorch/pull/154769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154974 Approved by: https://github.com/Lucaskabela, https://github.com/jansel	2025-06-05 07:25:04 +00:00
Boyuan Feng	be16f21ca6	[Graph Partition] add symints to get_graph_inputs (#154679 ) During `codegen_inputs`, we check whether there are undefined symbols: `65b1aedd09/torch/_inductor/codegen/wrapper.py (L1668-L1674)` Previously, for graph partition inputs, we do not explicitly add symints. `65b1aedd09/torch/_inductor/codegen/wrapper.py (L3265-L3272)` We relied on sizes/strides of TensorBox for codegen symint inputs. For example, a tensor with shape `[s0, 2]` will implicitly codegen `s0` as an input here. This works fine in most cases since backed symint has to come from some tensor shapes. `65b1aedd09/torch/_inductor/codegen/wrapper.py (L1624-L1632)` In rare cases, this does not work. One example is saved tensors for backward where a tensor may have shape `[2s0, 2]`. Since `2s0` is an expression but not a symbol, `codegen_input_symbol_assignment` would not handle `s0` and later there would be an error when `_verify_input_symbol_assignment`. The fix is add symints to `get_graph_inputs`. An alternative way is to update `codegen_input_symbol_assignment` but I want to minimize the change to graph partition only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154679 Approved by: https://github.com/eellison	2025-06-05 06:46:28 +00:00
PyTorch MergeBot	d3c8f36ba0	Revert "[Intel GPU] Make SDPA output has the same stride as Query. (#154340 )" This reverts commit 0f10df71a66cb1b0c3659381b7db8e06d95f0d67. Reverted https://github.com/pytorch/pytorch/pull/154340 on behalf of https://github.com/etaf due to This PR breaks hugging face E2E run on XPU. ([comment](https://github.com/pytorch/pytorch/pull/154340#issuecomment-2942954192))	2025-06-05 06:46:24 +00:00
David Berard	bb43ced6e2	Inductor unit tests: cuda 12.6 -> 12.8 (#155056 ) Fixes #154938 When we update the Triton version in CI, we'll require cuda >= 12.8 for certain AOTI tests to pass: these AOTI tests try to run nvcc on the triton-generated PTX, and triton-generated PTX is PTX 8.7, which requires CUDA 12.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155056 Approved by: https://github.com/atalman, https://github.com/nWEIdia ghstack dependencies: #155167	2025-06-05 05:59:06 +00:00
David Berard	1c828786c2	[test][dynamo] skip test_deopt_from_append_list on python>=3.13.3 (#155167 ) Not sure why, apparently this test starts passing on python 3.13.3 (while it fails on python <=3.13.2) and it's causing unexpected passes on xfail-ed tests when newer versions of python are used, e.g. in #155056. Verified locally in a python 3.13.1 vs. python 3.13.3 conda env. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155167 Approved by: https://github.com/williamwen42	2025-06-05 05:59:06 +00:00
PyTorch MergeBot	93012d2290	Revert "[forward fix] add support for MemoryFormat after type tightening (#154658 )" This reverts commit 0fdd568b785812da86e69d65632de77d2ee945c7. Reverted https://github.com/pytorch/pytorch/pull/154658 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154658#issuecomment-2942752048))	2025-06-05 05:01:40 +00:00
PyTorch MergeBot	5130ac64f4	Revert "Add randint_like tensor overload for high (#154899 )" This reverts commit 72fe1d5f42aa9bffa876932a3b4fcae052b99168. Reverted https://github.com/pytorch/pytorch/pull/154899 on behalf of https://github.com/seemethere due to Failing internal tests see https://fburl.com/diff/bai044ob ([comment](https://github.com/pytorch/pytorch/pull/154899#issuecomment-2942740661))	2025-06-05 04:54:05 +00:00
drisspg	80703ca332	[FlexAttention] Allow dispatch to SAC for flex (#150080 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150080 Approved by: https://github.com/zou3519	2025-06-05 04:34:27 +00:00
James Wu	fa63de0866	Handle empty linemaps in PyCodeCache (#155064 ) Some functions have empty linemaps, and if you call `PyCodeCache.stack_frames_for_code` on code in the wrong order, you'll end up triggering a too many values to unpack issue: https://github.com/pytorch/pytorch/issues/154536 Specifically, if you populate PyCodeCache's linemap via caching, and then request the stack frames of a inductor generated output file that has an empty linemap, this function will try to unpack too many arguments. Test plan: ``` import os os.environ["TORCHINDUCTOR_FX_GRAPH_CACHE"] = "1" os.environ["TORCHINDUCTOR_AUTOGRAD_CACHE"] = "1" import torch @torch.compile def fn(x: torch.Tensor): (x_grad,) = torch.autograd.grad(x.sum(), x) return x_grad x = torch.randn(10, 10, requires_grad=True) result = fn(x) ``` Run this twice and see that everything works as expected. It's hard to exactly pinpoint a good unit test for this: it requires a whole lot of moving parts to get the issue to trigger because: - The callsite in question in dynamo, without caching, will always run before generating the code, so cls.linemaps[path] will be None most of the time - The inductor generated output needs to call back into dynamo via `assert_size_stride` - In our test case, the CompiledBackward needs to not have linemaps, and also be called in the middle of a graph break while compiling a different cached function. Caching switches the order the PyCodeCache.linemap is populated (i.e. either before or after the graph break is evaluated), which causes the issue. All these things need to interact together to create the bug, so it's a bit difficult to write a simple unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155064 Approved by: https://github.com/bdhirsh	2025-06-05 03:54:35 +00:00
fduwjj	450180fbcd	[c10d][fr] Add the log of thread name and thread id into fr (#155142 ) There is an ask from internal head users to have thread id and thread name inside fr. This would be useful to users when it comes to cases when we launches collectives not just on main thread as well. Differential Revision: [D75973919](https://our.internmc.facebook.com/intern/diff/D75973919) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155142 Approved by: https://github.com/kwen2501	2025-06-05 03:33:01 +00:00
Xiaodong Wang	b9312c56bf	SDPA support gfx950 (#155103 ) Summary: Seems to run, just not the optimal performance. e.g. ck_tile doesn't have those gfx942 optimizations it seems https://github.com/ROCm/composable_kernel/blob/develop/include/ck_tile/ops/fmha/block/variants.hpp#L27 Test Plan: ``` +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| Batch Size \| Sequence Length \| Heads \| Head Dim \| Flash Time (µs) \| Mem Eff Time (µs) \| Math Time (µs) \| Flex Time (µs) \| xformers Time (µs) \| Flash TFlops \| Mem Eff TFlops \| Math TFlops \| Flex TFlops \| xformers TFlops \| Speedup (Flash/Math) \| Speedup (MemEff/Math) \| Speedup (Flex/Math) \| Speedup (xformers/Math) \| xformers trace_url \| Flash trace_url \| +==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+ \| 1 \| 4096 \| 16 \| 64 \| 179.737 \| 182.874 \| 3106.6 \| 359.662 \| 205.506 \| 382.334 \| 375.776 \| 22.1205 \| 191.067 \| 334.392 \| 17.2841 \| 16.9877 \| 8.63754 \| 15.1169 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 4096 \| 32 \| 128 \| 617.271 \| 623.38 \| 7169.73 \| 998.961 \| 654.534 \| 445.312 \| 440.947 \| 38.3387 \| 275.164 \| 419.96 \| 11.6152 \| 11.5014 \| 7.17719 \| 10.9539 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 16 \| 64 \| 667.032 \| 670.118 \| 13031.8 \| 1383.42 \| 768.452 \| 412.091 \| 410.193 \| 21.0928 \| 198.694 \| 357.703 \| 19.5371 \| 19.4471 \| 9.42 \| 16.9586 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 32 \| 128 \| 2074.64 \| 2214.81 \| 29186.9 \| 3916.35 \| 2404.29 \| 529.978 \| 496.437 \| 37.6714 \| 280.749 \| 457.313 \| 14.0684 \| 13.1781 \| 7.45257 \| 12.1395 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 16 \| 64 \| 2456.6 \| 2472.38 \| 51095.8 \| 5647.01 \| 3008.09 \| 447.574 \| 444.718 \| 21.5186 \| 194.707 \| 365.518 \| 20.7994 \| 20.6666 \| 9.0483 \| 16.9861 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 32 \| 128 \| 8048.8 \| 8070.96 \| 113478 \| 15580.8 \| 9768.71 \| 546.423 \| 544.922 \| 38.7569 \| 282.274 \| 450.218 \| 14.0987 \| 14.06 \| 7.2832 \| 11.6165 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| Batch Size \| Sequence Length \| Heads \| Head Dim \| Flash Time (µs) \| Mem Eff Time (µs) \| Math Time (µs) \| Flex Time (µs) \| xformers Time (µs) \| Flash TFlops \| Mem Eff TFlops \| Math TFlops \| Flex TFlops \| xformers TFlops \| Speedup (Flash/Math) \| Speedup (MemEff/Math) \| Speedup (Flex/Math) \| Speedup (xformers/Math) \| xformers trace_url \| Flash trace_url \| +==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+ \| 1 \| 4096 \| 16 \| 64 \| 692.323 \| 697.649 \| 4241.81 \| 1562.26 \| 906.441 \| 248.148 \| 246.254 \| 40.5012 \| 109.968 \| 189.531 \| 6.12693 \| 6.08015 \| 2.71518 \| 4.67963 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 4096 \| 32 \| 128 \| 2263.22 \| 2267.38 \| 9482.64 \| 7003.8 \| 2765.5 \| 303.636 \| 303.079 \| 72.4687 \| 98.1174 \| 248.489 \| 4.1899 \| 4.18221 \| 1.35393 \| 3.42891 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 16 \| 64 \| 2553.94 \| 2572.68 \| 15909.8 \| 5697.16 \| 3284.77 \| 269.073 \| 267.112 \| 43.193 \| 120.621 \| 209.206 \| 6.22953 \| 6.18415 \| 2.79259 \| 4.84352 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 8192 \| 32 \| 128 \| 8187.67 \| 8201.71 \| 35449.2 \| 26424.3 \| 10364.5 \| 335.722 \| 335.147 \| 77.5413 \| 104.025 \| 265.21 \| 4.32959 \| 4.32218 \| 1.34154 \| 3.42025 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 16 \| 64 \| 9948.15 \| 9815.47 \| 62815.1 \| 23741.9 \| 12710 \| 276.31 \| 280.046 \| 43.7598 \| 115.778 \| 216.269 \| 6.31425 \| 6.39961 \| 2.64575 \| 4.94217 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ \| 1 \| 16384 \| 32 \| 128 \| 32187.6 \| 32035.6 \| 137832 \| 102075 \| 40623.4 \| 341.595 \| 343.216 \| 79.7716 \| 107.716 \| 270.66 \| 4.28216 \| 4.30248 \| 1.35031 \| 3.39293 \| \| \| +--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+ ``` Rollback Plan: Differential Revision: D75934358 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155103 Approved by: https://github.com/yoyoyocmu	2025-06-05 03:26:38 +00:00
Wei Wang	a01bb9da14	[CI][CUDA] Re-enable the test-nan-assert on CUDA12 (#154448 ) We need to reenable this test because there are recent changes that could be relevant to test_nan_assert. I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls. Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip). Workaround #153479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154448 Approved by: https://github.com/kwen2501	2025-06-05 02:09:31 +00:00
PyTorch MergeBot	5e03433443	Revert "Inductor logging + analysis of torch.profile (#149697 )" This reverts commit e5afbe31245287a92fe328c404b3557e5c5eca73. Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/malfet due to Broke rocm, see `642687af29/1` ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-2942415600))	2025-06-05 01:38:13 +00:00
Nikita Shulga	642687af29	[MPS][BE] Some refactor in preparation for 64-bit iterators (#155178 ) set input/output tensors only once Get rid of `is_storage_dense` predicate, as `iter.is_contiguous` serves the same purpose Pull Request resolved: https://github.com/pytorch/pytorch/pull/155178 Approved by: https://github.com/dcci, https://github.com/cyyever ghstack dependencies: #155150	2025-06-05 01:24:31 +00:00
Laith Sakka	3398d1d459	support bmm and mm_plus_mm in generated templates cache (#154904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154904 Approved by: https://github.com/drisspg, https://github.com/eellison, https://github.com/jansel ghstack dependencies: #154891, #154892	2025-06-05 00:36:01 +00:00
Guilherme Leobas	21f45f7afb	Add CPython int/float tests (#150795 ) Tests: * test_int.py * test_int_literal.py * test_float.py Minor changes were made to each test to run them inside Dynamo One can reproduce the changes by downloading the tests from CPython and applying the diff: ```bash for f in "test_int" "test_int_literal" "test_float"; do wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py" git apply "test/dynamo/cpython/3_13/${f}.diff" done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150795 Approved by: https://github.com/williamwen42	2025-06-05 00:28:53 +00:00
Guilherme Leobas	d5f642211f	Add CPython generator/contextlib tests (#150796 ) Tests: * test_generator.py * test_generator_stop.py * test_contextlib.py Minor changes were made to each test to run them inside Dynamo. We intentionally didn't copy the binary files stored in `python/Lib/test/archivetestdata` for security reasons. There's a single test that requires a binary file and it is skipped because of that. The tests were downloaded from CPython 3.13 and the diff was generated using `git diff` to apply the changes: ```bash for f in "test_contextlib" "test_generators" "test_generator_stop"; do wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py" git apply "test/dynamo/cpython/3_13/${f}.diff" done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150796 Approved by: https://github.com/williamwen42	2025-06-05 00:18:29 +00:00
Thomas Bohnstingl	fb5a787a8f	[HOP] Added clone for outputs of create_bw_fn that are aliasing the inputs (#153932 ) This PR fixes an issue with the new way of creating the bw graph introduced for cond. In particular, there is an issue if the bw function simply aliases the inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153932 Approved by: https://github.com/ydwu4	2025-06-04 23:52:52 +00:00
Laith Sakka	b0a2ca65ef	support more prologue functions in generated templates cache (#154892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154892 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #154891	2025-06-04 23:45:36 +00:00
Laith Sakka	51b4c51973	add missing check for caching triton template caching (#154891 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154891 Approved by: https://github.com/eellison	2025-06-04 23:45:36 +00:00
Shivam Raikundalia	1083bc749d	[Memory Snapshot] Add Flag to Toggle Global and Local Callbacks for Annotations (#154932 ) Summary: There are some cases where we want only local annotations for memory snapshot such as executing inside the cudastream callback, which cannot execute CUDA operators. Thus the cuda errors happen: Exception in RecordFunction callback: CUDA error: operation not permitted However, we need to have an option to turn on the globally so that on-demand snapshot can get annotations. Additionally, there may be some cases in which auto-trace will also want annotations using record functions so we expose the flag to the auto-trace as well. Test Plan: Run MVAI executable and see that the errors go away Rollback Plan: Differential Revision: D75831687 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154932 Approved by: https://github.com/mzzchy, https://github.com/sanrise	2025-06-04 23:15:19 +00:00
Tushar Jain	7cf5b36ec2	Release GIL in PG destructor (#154976 ) Summary: Gloo PG doesn't release GIL, which results in python code hanging until the destructor completes. The destructor waits for all work on the PG to complete which can take a long time. Test Plan: Ran ``` $ pytest --log-cli-level=INFO -vs torchft/local_sgd_integ_test.py ``` with a large timeout on the async work. Call to `gil_scoped_release` doesn't show up in the gdb stack trace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154976 Approved by: https://github.com/d4l3k, https://github.com/dcci, https://github.com/fduwjj	2025-06-04 23:10:55 +00:00
Animesh Jain	c881f2ddf3	[reland][dynamo] Mark a vt unspecialized nn module variable source earlier (#155099 ) Reland of https://github.com/pytorch/pytorch/pull/154780 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/155099 Approved by: https://github.com/williamwen42	2025-06-04 23:05:36 +00:00
Nikita Shulga	992be94dab	[MPS][BE] Better error messages (#155150 ) "Can't be indexed using 32-bit iterator" is not really helpful error This PR distinguishes between error from old indexing helper function as well as to binaryTensorIterator Adds the same warning to unary op, otherwise it just runs and returns incorrect value Test plan (manual, don't have machine with enough RAM to run it reliable in CI): ``` % python -c "import torch;print(torch.rand(1, 1024, 1024, dtype=torch.bfloat16, device='mps') + torch.rand(5000, 1, 1, dtype=torch.bfloat16, device='mps'))" RuntimeError: add can't be indexed using 32-bit iterator for shape [1048576, 5000] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155150 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-06-04 22:53:51 +00:00
Mu-Chu Lee	f5e2e4c4f1	[Inductor] Include math and torch in launcher scope (#154673 ) Summary: For grid computation, if we have sympy, it is possible we have math and torch used. We include the math and torch module in the launcher scope to make sure those grid get computed correctly. Test Plan: Check phabricator for internal cmd. Differential Revision: D75642931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154673 Approved by: https://github.com/Skylion007, https://github.com/davidberard98	2025-06-04 22:32:19 +00:00
Mikayla Gawarecki	671553bd23	Update documentation wording for transformer-related layers (#155123 ) <img width="947" alt="Screenshot 2025-06-04 at 1 33 53 PM" src="https://github.com/user-attachments/assets/4dbb66b3-43f4-4d04-afb5-dc80cec0f2cd" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/155123 Approved by: https://github.com/albanD, https://github.com/jbschlosser	2025-06-04 22:20:32 +00:00
Sidharth	6f23ca53bb	[dynamo] sample gb_registry json file for website testing purposes (#155160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155160 Approved by: https://github.com/StrongerXi, https://github.com/williamwen42	2025-06-04 22:14:48 +00:00
angelayi	c8566a0b98	[export] Use patching in test (#155132 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/155132 Approved by: https://github.com/pianpwk	2025-06-04 21:41:26 +00:00
ANotFox	65a5eb8d27	Fix for ambiguity in linalg.norm()'s ord argument of +2 & -2 (#155148 ) Fixes #136453 ### Description --- Fixed the ambiguity by referencing a hyperlink to wikipedia's SVD/Singular Values section as per past discussion (by other contributors) on the above thread. In the ord argument, for values `+2` and `-2`, the `singular value` now points to [this section of singular values on the wiki SVD page](https://en.wikipedia.org/wiki/Singular_value_decomposition#Singular_values,_singular_vectors,_and_their_relation_to_the_SVD). ### Why not mention SVD --- For conciseness (expanding 'largest singular value' -> 'largest singular value of a SVD' is too much, i think, wrt rest of the table) --- I hope this is satisfactory. Please let me know if I have missed anything essential; cheers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155148 Approved by: https://github.com/Skylion007, https://github.com/lezcano	2025-06-04 21:15:20 +00:00
Thomas Bohnstingl	b084e1b81c	[HOP] Rework Autograd DispatchKey for scan and map (#153336 ) This PR introduces the `py_autograd_impl` instead of the `DispatchKey.Autograd` for some HOPs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153336 Approved by: https://github.com/ydwu4	2025-06-04 20:54:02 +00:00
Sidharth	0404785f3b	[dynamo] [3/3] added cmd_update_gb_type which supports updating an existing gb_type properties and optional arg to change gb_type name (#154985 ) The user can now use the terminal to update the registry whenever they update an existing gb_type's properties. Additionally, if the user changes the gb_type description itself, they can update the registry as well. Terminal command template for updating existing gb_type: python [path to gb_id_mapping.py] update "existing_gb_type" [path to file where user added callsite] Terminal command template for updating existing gb_type name (can also be used if the user changed the other properties as well including the gb_type name): python [path to gb_id_mapping.py] update "existing_gb_type" [path to file where user added callsite] --new_gb_type "new_name_for_existing_gb_type" Pull Request resolved: https://github.com/pytorch/pytorch/pull/154985 Approved by: https://github.com/williamwen42	2025-06-04 20:10:02 +00:00
Gabriel Ferns	e5afbe3124	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-06-04 20:03:46 +00:00
Ben Koopman	4d576442e9	Fix incorrect get_default_qat_qconfig in prepare_qat_fx docs. (#155100 ) Fixes #144522 ## Description FX QAT docs for prepare_qat_fx incorrectly used get_default_qat_qconfig when it should use get_default_qat_qconfig_mapping for a qconfig_mapping. Previous example code incorrectly used `get_default_qat_qconfig`, resulting in a qconfig being incorrectly passed to `prepare_qat_fx`. `prepare_qat_fx` requires a `qconfig_mapping`, not a single `qconfig`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155100 Approved by: https://github.com/jerryzh168	2025-06-04 18:51:40 +00:00
Sidharth	6c8241c089	[dynamo] [2/3] added add_new_gb_type functionality (#154886 ) The user can now use the terminal to update the registry whenever they create a new unimplemented_v2() callsite. Terminal command template: python [path to gb_id_mapping.py] add "new_gb_type" [path to file where user added callsite] Before the user added a new gb_type: <img width="619" alt="Screenshot 2025-06-02 at 1 33 54 PM" src="https://github.com/user-attachments/assets/7258cab1-a184-4200-9d56-7b21d243d6d8" /> After the user added a new gb_type: <img width="366" alt="Screenshot 2025-06-02 at 1 34 47 PM" src="https://github.com/user-attachments/assets/5c383e94-268c-4f6d-9111-7b18c856222e" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154886 Approved by: https://github.com/williamwen42 ghstack dependencies: #154738	2025-06-04 18:44:37 +00:00
Sidharth	681a8189d7	[dynamo] [1/3] updated gbid mapping for initial registry creation (#154738 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154738 Approved by: https://github.com/williamwen42	2025-06-04 18:44:37 +00:00
Bin Bao	197080337b	[AOTI] Extend torchgen to generate C shim with version number (#147745 ) Summary: While it is ok to add a new arg with defaul value to a fallback op in Python, it will be BC-breaking for the C shim. This PR adds an automatic approach to update C shim files when specifying a version number with a list of new args for the modified op. See https://github.com/pytorch/pytorch/pull/154848 as an example on how to do that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147745 Approved by: https://github.com/yushangdi	2025-06-04 18:40:34 +00:00
Shangdi Yu	1d67849e43	[AOTInductor] Activate CPU test for package and update weights (#155078 ) Summary: looks like CPU is enabled for update_constant_buffer in D71177509 enable these tests as well. Test Plan: ``` buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_without_weight" -v buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_user_managed_weight" -v buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_update_weights" -v ``` Rollback Plan: Differential Revision: D75908993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155078 Approved by: https://github.com/angelayi	2025-06-04 17:57:20 +00:00
fduwjj	956716880f	[c10d][gloo] Enable using c10::Half for gloo (#153862 ) Testing with https://github.com/pytorch/gloo/pull/446 and we see that the numerical issues reported in https://github.com/pytorch/pytorch/issues/152300 is indeed resolved and we added a unit test for it. Also update submodule gloo to reflect the change on the gloo side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153862 Approved by: https://github.com/d4l3k, https://github.com/clee2000, https://github.com/malfet	2025-06-04 17:53:08 +00:00
Xuan Zhang	9eb7e67727	[PT2][memory] correct wait tensor output size (#153569 ) This PR correctly handles the output buffer size of wait tensor nodes. ![image](https://github.com/user-attachments/assets/fdcc5eb7-58cf-42a2-84b2-ce949cb9db92) See [this doc](https://docs.google.com/document/d/1lkKulwIb-fYL_p8jn1SD6Lh1PoAKBgpBsU5sAH80leI/edit?tab=t.0#bookmark=id.w3n4k1y4rdz8) with testing details [internal only] Pull Request resolved: https://github.com/pytorch/pytorch/pull/153569 Approved by: https://github.com/eellison	2025-06-04 17:49:25 +00:00
Ke Wen	34c6371d24	Add NVSHMEM to PYTORCH_EXTRA_INSTALL_REQUIREMENTS (#154568 ) NVSHMEM 3.2.5 (released Mar 2025) have both cu11 and cu12 builds. See: https://pypi.nvidia.com/nvidia-nvshmem-cu12/ https://pypi.nvidia.com/nvidia-nvshmem-cu11/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/154568 Approved by: https://github.com/atalman ghstack dependencies: #154538	2025-06-04 17:43:24 +00:00
James Wu	b3e666ae17	[easy] Bump STATIC_CUDA_LAUNCHER_VERSION=1 (#154861 ) This turns on STATIC_CUDA_LAUNCHER internally for a some low risk entitlements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154861 Approved by: https://github.com/Skylion007, https://github.com/eellison	2025-06-04 17:38:06 +00:00
sandishkumarhn	e9c31fb86d	[torch.compile] handle a custom __delattr__ method correctly (#150899 ) Fixes #150765 - handle a custom __delattr__ method correctly Test: ``` import torch class MyObject: def __init__(self, val): self.val = val # Flag to track deletion attempts instead of using print self.deletion_attempted = False def __delattr__(self, attr): if attr == "val": # Set flag instead of printing self.deletion_attempted = True else: super().__delattr__(attr) @torch.compile(fullgraph=True, backend="eager") def test(input_tensor): instance_a = MyObject(1) instance_b = MyObject(2) del instance_a.val del instance_b.val exists_a = hasattr(instance_a, 'val') exists_b = hasattr(instance_b, 'val') deletion_attempted_a = instance_a.deletion_attempted deletion_attempted_b = instance_b.deletion_attempted return input_tensor + 1, exists_a, exists_b, deletion_attempted_a, deletion_attempted_b # Run the test result = test(torch.ones(1)) print(f"Result tensor: {result[0]}") print(f"val attribute still exists on instance_a: {result[1]}") print(f"val attribute still exists on instance_b: {result[2]}") print(f"Deletion was attempted on instance_a: {result[3]}") print(f"Deletion was attempted on instance_b: {result[4]}") ``` output: ``` (base) sany@sandishs-Laptop pytorch % python3 test_delattr_fix.py Result tensor: tensor([2.]) val attribute still exists on instance_a: True val attribute still exists on instance_b: True Deletion was attempted on instance_a: True Deletion was attempted on instance_b: True ``` ``` (pytorch-dev) sany@sandishs-Laptop pytorch % python3 -m pytest test/dynamo/test_repros.py::ReproTests::test_delattr_return -v ========================================================= test session starts ========================================================= platform darwin -- Python 3.12.5, pytest-8.3.5, pluggy-1.5.0 -- /Library/Frameworks/Python.framework/Versions/3.12/bin/python3 cachedir: .pytest_cache rootdir: /Users/sany/git/pytorch configfile: pytest.ini plugins: typeguard-4.3.0 collected 1 item Running 1 items in this shard test/dynamo/test_repros.py::ReproTests::test_delattr_return PASSED [0.0659s] [100%] ========================================================== 1 passed in 1.71s ========================================================== (pytorch-dev) sany@sandishs-Laptop pytorch % ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150899 Approved by: https://github.com/jansel, https://github.com/StrongerXi	2025-06-04 17:27:20 +00:00
PyTorch MergeBot	4405dc1487	Revert "Always set CPU affinity for benchmark jobs (#154569 )" This reverts commit 629fca295e1257c2c54d1b6316ed4fa00e6044d6. Reverted https://github.com/pytorch/pytorch/pull/154569 on behalf of https://github.com/anijain2305 due to potentially causing compile time regressions, unsure ([comment](https://github.com/pytorch/pytorch/pull/154569#issuecomment-2940737778))	2025-06-04 16:52:15 +00:00
dependabot[bot]	8f08f90b61	Bump pillow from 10.0.1 to 10.3.0 in /.github/requirements (#154416 ) Bumps [pillow](https://github.com/python-pillow/Pillow) from 10.0.1 to 10.3.0. - [Release notes](https://github.com/python-pillow/Pillow/releases) - [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst) - [Commits](https://github.com/python-pillow/Pillow/compare/10.0.1...10.3.0) --- updated-dependencies: - dependency-name: pillow dependency-version: 10.3.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-04 09:37:13 -07:00
Nikita Shulga	aed938f3a8	Enable check_gomp for Ubuntu OSes (#155119 ) And ARM platform Pull Request resolved: https://github.com/pytorch/pytorch/pull/155119 Approved by: https://github.com/atalman	2025-06-04 15:57:08 +00:00
PyTorch MergeBot	20912673a6	Revert "Add __main__ guards to jit tests (#154725 )" This reverts commit 1a55fb0ee87eaa8b376aaa82d95d213fe0fbe64b. Reverted https://github.com/pytorch/pytorch/pull/154725 on behalf of https://github.com/malfet due to This added 2nd copy of raise_on_run to common_utils.py which caused lint failures, see https://github.com/pytorch/pytorch/actions/runs/15445374980/job/43473457466 ([comment](https://github.com/pytorch/pytorch/pull/154725#issuecomment-2940503905))	2025-06-04 15:42:52 +00:00
PyTorch MergeBot	6f93ce3c86	Revert "[Cutlass] fp8 dynamic shapes test (#154829 )" This reverts commit 36596ad2a009a0906848fa264954d4b200efc50e. Reverted https://github.com/pytorch/pytorch/pull/154829 on behalf of https://github.com/seemethere due to This is failing internal tests see, [fburl.com/diff/3gomp7i3](https://fburl.com/diff/3gomp7i3). Please re-land this as a co-dev diff ([comment](https://github.com/pytorch/pytorch/pull/154829#issuecomment-2940494361))	2025-06-04 15:36:27 +00:00
PyTorch MergeBot	3fa3dbdb1f	Revert "[Cutlass] EVT dynamic shapes support (#154835 )" This reverts commit 4224a7df01a9607830da771fd4884c8eba150630. Reverted https://github.com/pytorch/pytorch/pull/154835 on behalf of https://github.com/seemethere due to This is part of a stack that is failing internal tests see, [fburl.com/diff/3gomp7i3](https://fburl.com/diff/3gomp7i3). Please re-land this as a co-dev diff ([comment](https://github.com/pytorch/pytorch/pull/154835#issuecomment-2940463211))	2025-06-04 15:33:09 +00:00
Jeff Daily	3ce5102927	[ROCm] fix CI failures from inductor periodic (#154896 ) Similar idea as https://github.com/pytorch/pytorch/pull/154497, but for ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154896 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-04 15:28:43 +00:00
PyTorch MergeBot	a99a01a677	Revert "[dynamo] Mark a vt unspecialized nn module variable source earlier (#154780 )" This reverts commit cc96febb979da16b0a0b758020b330a49c72b7e7. Reverted https://github.com/pytorch/pytorch/pull/154780 on behalf of https://github.com/seemethere due to This fails internal testing see, https://fburl.com/diff/b0yuxk4w ([comment](https://github.com/pytorch/pytorch/pull/154780#issuecomment-2940381691))	2025-06-04 15:03:34 +00:00
PyTorch MergeBot	a0f2544502	Revert "[dynamo][dynamic] Recompilation hint for nn module integer attributes (#154867 )" This reverts commit 6c2f941e250ba34a920f476c8a9ee30e6153fc15. Reverted https://github.com/pytorch/pytorch/pull/154867 on behalf of https://github.com/seemethere due to This fails internal testing see, https://fburl.com/diff/b0yuxk4w ([comment](https://github.com/pytorch/pytorch/pull/154780#issuecomment-2940381691))	2025-06-04 15:03:34 +00:00
Anthony Barbier	1a55fb0ee8	Add __main__ guards to jit tests (#154725 ) This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs. In jit tests: - Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run. - Raise a RuntimeError on tests which have been disabled (not run) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154725 Approved by: https://github.com/Skylion007	2025-06-04 14:44:08 +00:00
Anthony Barbier	3f34d26040	Add __main__ guards to distributed tests (#154628 ) This is the first PR of a series in an attempt to re-submit #134592 as smaller PRs. In distributed tests: - Ensure all files which should call run_tests do call run_tests. - Raise a RuntimeError on tests which have been disabled (not run) - Remove any remaining uses of "unittest.main()"" Cc @wconstab @clee2000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154628 Approved by: https://github.com/Skylion007	2025-06-04 14:39:57 +00:00
Anthony Barbier	c8d44a2296	Add __main__ guards to fx tests (#154715 ) This PR is part of a series attempting to re-submit #134592 as smaller PRs. In fx tests: - Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run. - Raise a RuntimeError on tests which have been disabled (not run) - Remove any remaining uses of "unittest.main()"" Pull Request resolved: https://github.com/pytorch/pytorch/pull/154715 Approved by: https://github.com/Skylion007	2025-06-04 14:38:50 +00:00
Anthony Barbier	cf9cad31df	Add __main__ guards to tests (#154716 ) This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs. Add missing `if __name__ == "__main__":` guards to some tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154716 Approved by: https://github.com/Skylion007	2025-06-04 14:38:13 +00:00
Kshitij Khode	ca0c2985d3	[ONNX] Allow exporter to export SDPA to Attention onnx operator (#154596 ) Fixes [#149662](https://github.com/pytorch/pytorch/issues/149662) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154596 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-06-04 14:29:44 +00:00
zeshengzong	31d12b3955	Fix avg_pool2d param kernel_size descripthon (#154353 ) Fixes part of #153149 ## Test Result ![image](https://github.com/user-attachments/assets/216ffd2b-dd2b-4cf6-9fca-aeed075be5e7) ![image](https://github.com/user-attachments/assets/820cd184-1f8e-4a7a-b64e-15dfb9c7dad2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154353 Approved by: https://github.com/colesbury	2025-06-04 11:55:01 +00:00
soulitzer	2af78d368f	Skip another test file that doesn't run gradcheck for slow gradcheck (#154852 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154852 Approved by: https://github.com/albanD	2025-06-04 07:47:09 +00:00
fengqing.lu@intel.com	0f10df71a6	[Intel GPU] Make SDPA output has the same stride as Query. (#154340 ) Fixes [#153903](https://github.com/pytorch/pytorch/issues/153903). Currently the output tensor of SDPA XPU is always defined as contiguous stride, while CPU/CUDA flash_attention and cudnn_attention allocate output tensor with stride the same as Query. This PR aligns XPU's behavior with CUDA/CPU to make XPU compatible to CPU/CUDA's modeling code. The function `alloc_with_matching_layout` is copied from cudnn `8c16d0e404/aten/src/ATen/native/cudnn/MHA.cpp (L874)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154340 Approved by: https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/guangyey	2025-06-04 07:16:56 +00:00
Yiming Zhou	1e20745532	[ez][AOTI] Fix index offset for Optional Tensor Return (#155073 ) Summary: As title. See added test for more context. Test Plan: buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_optional_tensor_output_2 Rollback Plan: Differential Revision: D75900658 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155073 Approved by: https://github.com/angelayi	2025-06-04 06:22:46 +00:00
angelayi	d2bfd97d71	[export] Refactor pt2 save/load (#152495 ) Refactor the pt2 archive saving to consolidate the format of torch.export.save and torch._inductor.package.package_aoti. This PR adds the following functions, which torch.export.save and AOTI packaging calls into: ```python package_pt2( f: FileLike, *, exported_programs: Optional[Union[ExportedProgram, dict[str, ExportedProgram]]] = None, aoti_files: Optional[Union[list[str], dict[str, list[str]]]] = None, extra_files: Optional[dict[str, Any]] = None, ) -> FileLike @dataclass class PT2ArchiveContents: exported_programs: dict[str, ExportedProgram] aoti_runners: dict[str, AOTICompiledModel] extra_files: dict[str, Any] load_pt2(f: FileLike) -> PT2ArchiveContents ``` Power users directly call into these APIs if they want to bundle multiple exported programs, aoti files, or extra metadata. This is how the pt2 archive looks like ([spec](https://docs.google.com/document/d/1RQ4cmywilnFUT1VE-4oTGxwXdc8vowCSZsrRgo3wFA8/edit?tab=t.0)): ``` ├── archive_format ├── version ├── .data ├── data │ ├── aotinductor │ │ └── model1 │ │ ├── model1.cpp │ │ ├── model1.so # currently AOTI automatically moves weights in here, TODO to move it out │ │ ├── cg7domx3woam3nnliwud7yvtcencqctxkvvcafuriladwxw4nfiv.cubin │ │ └── cubaaxppb6xmuqdm4bej55h2pftbce3bjyyvljxbtdfuolmv45ex.cubin │ ├── weights │ │ ├── model1.pt # TODO to dedup weights between model1/model2 │ │ └── model2.pt │ └── constants │ │ ├── model1.pt # TODO to dedup weights between model1/model2 │ │ └── model2.pt │ └── sample_inputs │ ├── model1.pt # TODO to dedup weights between model1/model2 │ └── model2.pt ├── extra │ └── user_metadata.txt └── models ├── model1.json └── model2.json ``` Future todos: - unbundle the weights -- instead of .pt, we can use bin files, which will also allow us to dedup weights if we store multiple models - update aoti_compile_and_package to also save the exported program - integrate TNR with this packaging flow Pull Request resolved: https://github.com/pytorch/pytorch/pull/152495 Approved by: https://github.com/yushangdi	2025-06-04 06:04:29 +00:00
Yuanhao Ji	75b24c273b	Export `torch::utils::tensor_to_numpy` (#154178 ) Fixes #154105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154178 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/youkaichao	2025-06-04 05:48:27 +00:00
fengqing.lu@intel.com	7b074346e0	[Intel GPU] Support f32 intermediate dtype, headdim size <=576 and f32 causal mask for SDPA (#152091 ) In OneDNN v3.7, SDPA has below defects: 1. The dtype of intermediate value is the same as QKV, while Pytorch uses FP32 dtype for intermediate value to make sure better accuracy. 2. Only support headdim size <= 256. 3. Don't support implict causal mask when QKV is FP32. We need to build an attention mask explicitly with aten ops. In OneDNN v3.8, they have update for these defects. Since these are tiny changes, I decided to put them in single PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152091 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/drisspg	2025-06-04 05:18:36 +00:00
fduwjj	4d93985d13	[c10d] Separate monitoring thread into a class in PGNCCL (#153977 ) This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (https://github.com/pytorch/pytorch/issues/146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review. We did our first attemp in https://github.com/pytorch/pytorch/pull/153668 but we also want to try to see if we can make monitoring thread a class. This PR is doing the first step to make monitoring thread a class. The next step to also extract watchdog to be a separate class so that we know its dependency. What we did in this PR: 1. Move all related variables and methods into a class named `HeartbeatMonitor`. 2. Correct some errors in the original logics inside monitoring thread loop. 3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now. Today there are two major functions inside heartbeat monitoring thread today: 1. Check the heartbeat of watchdog thread every 8 minutes. If no heartbeat detected and we are sure monitoring thread has not been stopped, we will kill the program by SIG_ABORT. 2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. (We do this only in the default PG) Differential Revision: [D75799278](https://our.internmc.facebook.com/intern/diff/D75799278) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153977 Approved by: https://github.com/kwen2501, https://github.com/d4l3k	2025-06-04 04:07:07 +00:00
tvukovic-amd	ec35a36820	[ROCm][Windows] Fix building tests for multiple architectures (#154979 ) Fixing building C10_CUDA_ALL_TEST_FILES and Caffe2_HIP_TEST_SRCS for multiple architectures Pull Request resolved: https://github.com/pytorch/pytorch/pull/154979 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-06-04 03:53:21 +00:00
bobrenjc93	72fe1d5f42	Add randint_like tensor overload for high (#154899 ) Fixes #135664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154899 Approved by: https://github.com/StrongerXi ghstack dependencies: #154863	2025-06-04 03:37:09 +00:00
Nikita Shulga	6b0c6f2856	[BE] Delete pre-CUDA-10.1 code from SparseCUDABlas (#155079 ) As latest PyTorch is no longer buildable against it CUDA-10, so this is essentially a dead code Made small change to hipify script to rename `cusparseGetErrorString` to `hipsparseGetErrorString` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155079 Approved by: https://github.com/atalman, https://github.com/cyyever	2025-06-04 03:29:24 +00:00
Nikita Shulga	9f39028629	[MPS][BE] Move sigmoid op to Metal (#155080 ) Fixes https://github.com/pytorch/pytorch/issues/154895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155080 Approved by: https://github.com/dcci, https://github.com/cyyever ghstack dependencies: #154936, #155002, #155081	2025-06-04 03:28:11 +00:00
Blaine Burton Rister	437df54cc8	[Inductor] Fix a few FX conversion bugs. (#154958 ) # Feature This PR fixes two bugs with Inductor's FX backend. 1. When extracting offsets from `ReinterpretView`'s, we accidentally took the offset of the parent layout instead of the view's layout. This case is triggered when multiple kernels write into the same buffer due to `torch.cat`. 2. In certain rare cases, `V.graph.graph_inputs` can contain a constant input value. In case this happens, create a new `sympy.Symbol` for the input, for compatibility with the existing `SymbolBuffer` abstraction mapping to an FX placeholder. This case is triggered when calling `torch._inductor.compile` on certain modules coming from `torch.export`. # Test plan Added a couple of tests exposing these bugs. 1. Concat with multiple kernels writing to the same buffer. 3. `Export` -> `torch._inductor.compile` with a constant input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154958 Approved by: https://github.com/jansel	2025-06-04 03:09:44 +00:00
Justin Chu	3e57de1251	[ONNX] Create support for rotary embeddings (#154745 ) This PR registers the RotaryEmbedding op in the `torch.ops.onnx` name spaces and allows the exporter to recognize and export onnx operators. ## Design ONNX operators of their respective opset version is implemented in torch/onnx/ops/_impl.py, and are registered in the torch.ops.onnx namespace following the following rule: `OpType-version => torch.ops.onnx.OpType.opset{version}` For example, `RotaryEmbedding-23` becomes `torch.ops.onnx.RotaryEmbedding.opset23` This name is parsed by the exporter to create an onnx node in the graph without having to go through translation. When users use the ops in the model, we provide more convenient, unversioned functions under `torch.onnx.ops` that will dispatch to the implementations based on user input (type and provided attributes). For example, users can directly call `torch.onnx.ops.rotary_embedding()` to use the op natively in their pytorch models. I chose snake case naming to make the functions more pythonic and aligned with other torch apis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154745 Approved by: https://github.com/titaiwangms	2025-06-04 03:07:43 +00:00
mori360	37e6bf8adf	Switch to _apply_to_tensors for dataclass input (#154897 ) Fixes https://github.com/pytorch/pytorch/issues/153077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154897 Approved by: https://github.com/weifengpy	2025-06-04 02:19:52 +00:00
Natalia Gimelshein	34e3930401	fix numpy compatibility for 2d small list indices (#154806 ) Will fix #119548 and linked issues once we switch from warning to the new behavior, but for now, given how much this syntax was used in our test suite, we suspect a silent change will be disruptive. We will change the behavior after 2.8 branch is cut. Numpy behavior was changed at least in numpy 1.24 (more than 2 years ago) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154806 Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/albanD	2025-06-04 01:58:52 +00:00
Feng Tian	e2760544fa	[PT] expose FlightRecord API for building (#154866 ) Summary: as titled Test Plan: CI Rollback Plan: Differential Revision: D75803611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154866 Approved by: https://github.com/fduwjj, https://github.com/d4l3k	2025-06-04 01:25:52 +00:00
Nikita Shulga	d8e4c1c363	[BE] Define `REGISTER_UNARY_TI_DISPATCH` (#155081 ) That creates _kernel_mps function that takes iterator and calls stub for it Pull Request resolved: https://github.com/pytorch/pytorch/pull/155081 Approved by: https://github.com/dcci ghstack dependencies: #154936, #155002	2025-06-04 01:15:37 +00:00
PyTorch MergeBot	50de6ae253	Revert "[BE][Ez]: Fully type nn.utils.clip_grad (#154801 )" This reverts commit 9ce2732b685da527308dc2dc4b2eeb4e252f57d1. Reverted https://github.com/pytorch/pytorch/pull/154801 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154801#issuecomment-2937886337))	2025-06-04 00:41:27 +00:00
eellison	40a8770154	Incorporate coalesce analysis in codegen (#153751 ) This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in https://github.com/pytorch/pytorch/issues/149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153751 Approved by: https://github.com/jansel ghstack dependencies: #153723, #153730, #153748	2025-06-04 00:22:57 +00:00
Animesh Jain	6c2f941e25	[dynamo][dynamic] Recompilation hint for nn module integer attributes (#154867 ) For program like this ``` class Mod(torch.nn.Module): def __init__(self): super().__init__() self.c = 0 def forward(self, x): self.c += 1 return x * self.c ``` You can check the recompile reasons at https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpzv9z6Q/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 ![image](https://github.com/user-attachments/assets/856a95fd-0533-4abc-a213-1f73ae2cb766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154867 Approved by: https://github.com/zou3519 ghstack dependencies: #154780	2025-06-04 00:05:53 +00:00
xinan.lin	cbdacd32fe	[AOTI][Intel GPU] Support multi_arch_kernel_binary option for XPU. (#154514 ) Following the design of #154413, this PR add XPU support for generating kernel binary files that support multiple archs. Fixes #154682, Fixes #154683, Fixes 154689, Fixes #154685 , Fixes #154690, Fixes #154681 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154514 Approved by: https://github.com/desertfire, https://github.com/EikanWang	2025-06-03 23:02:00 +00:00
xinan.lin	8f0e3f446d	[Inductor UT] Reuse test_fused_attention.py for Intel GPU. (#154110 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154110 Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #154091	2025-06-03 23:01:05 +00:00
xinan.lin	6c40e6606f	[Inductor] Add attention pattern for model DistilBert in transformers==4.44.2. (#154091 ) This PR add a attention fusion pattern that match the attention of DistilDistilBert in transformers==4.44.2 at `953196a43d/src/transformers/models/distilbert/modeling_distilbert.py (L212)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154091 Approved by: https://github.com/jansel, https://github.com/eellison	2025-06-03 23:01:05 +00:00
Michael Lazos	4224a7df01	[Cutlass] EVT dynamic shapes support (#154835 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154835 Approved by: https://github.com/henrylhtsang ghstack dependencies: #154775, #154761, #154829	2025-06-03 22:20:34 +00:00
Michael Lazos	36596ad2a0	[Cutlass] fp8 dynamic shapes test (#154829 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154829 Approved by: https://github.com/henrylhtsang, https://github.com/eellison ghstack dependencies: #154775, #154761	2025-06-03 22:20:33 +00:00
Michael Lazos	1c2b9cecd2	[Cutlass] Support bias arg for fp8 GEMM (#154761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154761 Approved by: https://github.com/drisspg ghstack dependencies: #154775	2025-06-03 22:20:27 +00:00
Michael Lazos	5735729597	[Cutlass] Cleanup gemm_template evt handling (#154775 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154775 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-06-03 22:20:18 +00:00
Yiming Zhou	71499fee6b	[3/3] Add build rule and test for Graph in nativert (#154532 ) We split the large PR for adding Graph.h and Graph.cpp to nativert into 3 smaller PRs: 1. Add header file 2. Add source file 3. Add test and build rules Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 4 classes have been introduced: `Graph`, `Node`, `Value`, `Type` - `Type` represents the kind of a `Value` - `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`. - `Node` represents a single unit of execution, typically a PyTorch op. - `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis. Differential Revision: [D75495273](https://our.internmc.facebook.com/intern/diff/D75495273/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154532 Approved by: https://github.com/SherlockNoMad ghstack dependencies: #154530, #154531	2025-06-03 21:52:05 +00:00
Yiming Zhou	b4c399d445	[2/3] Add source file for Graph in nativert (#154531 ) We split the large PR for adding Graph.h and Graph.cpp to nativert into 3 smaller PRs: 1. Add header file 2. Add source file 3. Add test and build rules. Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 4 classes have been introduced: `Graph`, `Node`, `Value`, `Type` - `Type` represents the kind of a `Value` - `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`. - `Node` represents a single unit of execution, typically a PyTorch op. - `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis. Differential Revision: [D75492405](https://our.internmc.facebook.com/intern/diff/D75492405/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154531 Approved by: https://github.com/SherlockNoMad ghstack dependencies: #154530	2025-06-03 21:51:52 +00:00
Yiming Zhou	55873dcb0d	[1/3] Add header file for Graph in nativert (#154530 ) We split the large PR for adding Graph.h and Graph.cpp to `nativert` into 3 smaller PRs: 1. Add header file 2. Add source file 3. Add test and build rules. Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 4 classes have been introduced: `Graph`, `Node`, `Value`, `Type` - `Type` represents the kind of a `Value` - `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`. - `Node` represents a single unit of execution, typically a PyTorch op. - `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis. Differential Revision: [D75491860](https://our.internmc.facebook.com/intern/diff/D75491860/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154530 Approved by: https://github.com/SherlockNoMad	2025-06-03 21:51:47 +00:00
LifengWang	69a57d9486	add JSON output support for operator benchmark (#154410 ) To better support the integration of operator benchmark performance data into the OSS benchmark database for the dashboard, I’ve added a JSON output format that meets the required specifications: https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database#output-format Since the current operator benchmark already has a flag `--output-json` to support saving the results into a JSON file, I add a new flag `--output-json-for-dashboard` for this feature. At the same time, I renamed the `--output-dir` to `--output-csv` for a clearer and more intuitive expression. An example of the JSON output of the operator benchmark. ``` [ { "benchmark": { "name": "PyTorch operator benchmark - add_M1_N1_K1_cpu", "mode": "inference", "dtype": "float32", "extra_info": { "input_config": "M: 1, N: 1, K: 1, device: cpu" } }, "model": { "name": "add_M1_N1_K1_cpu", "type": "micro-benchmark", "origins": [ "pytorch" ] }, "metric": { "name": "latency", "unit": "us", "benchmark_values": [ 2.074 ], "target_value": null } }, { "benchmark": { "name": "PyTorch operator benchmark - add_M64_N64_K64_cpu", "mode": "inference", "dtype": "float32", "extra_info": { "input_config": "M: 64, N: 64, K: 64, device: cpu" } }, "model": { "name": "add_M64_N64_K64_cpu", "type": "micro-benchmark", "origins": [ "pytorch" ] }, "metric": { "name": "latency", "unit": "us", "benchmark_values": [ 9.973 ], "target_value": null } }, ] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154410 Approved by: https://github.com/huydhn	2025-06-03 21:29:24 +00:00
Scott Wolchok	8e1474d3c6	[inductor] small cleanups in torch/_inductor/codegen/mps.py (#154921 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154921 Approved by: https://github.com/jansel, https://github.com/Skylion007	2025-06-03 20:57:25 +00:00
cyy	debd095149	Avoid index integer overflow in gemm_notrans_ (#154809 ) Use uint64_t index types to avoid ``` torch_np/numpy_tests/core/test_einsum.py::TestEinsum::test_einsum_broadcast /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24: runtime error: signed integer overflow: 9223365439786057728 + 13194139533312 cannot be represented in type 'long' #0 0x7f30d26166ba in std::enable_if<std::is_same_v<long, long>, void>::type at::native::cpublas::(anonymous namespace)::gemm_notrans_<long, long, long>(long, long, long, long, long const, long, long const, long, long, long, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24 #1 0x7f30d26166ba in void at::native::cpublas::(anonymous namespace)::gemm_core_<long, long, long>(at::native::TransposeType, at::native::TransposeType, long, long, long, long, long const, long, long const, long, long, long, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:451:12 #2 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const, long, void const, long, c10::Scalar const&, void, long)::$_2::operator()() const::'lambda2'()::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3 #3 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const, long, void const, long, c10::Scalar const&, void, long)::$_2::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154809 Approved by: https://github.com/soulitzer	2025-06-03 19:28:34 +00:00
karthickai	10c3e6ec43	[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353 ) Fixes #151930 This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages. The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg. In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging. Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py). - Verified both successful and failing assertion cases include the operator name. - Verified that generated Triton code contains the op name inside the asserts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353 Approved by: https://github.com/jansel, https://github.com/shunting314	2025-06-03 19:21:15 +00:00
Animesh Jain	cc96febb97	[dynamo] Mark a vt unspecialized nn module variable source earlier (#154780 ) I am working on providing some skip guard helper functions to allow users to reduce guard overhead. This is a refactor to allow that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154780 Approved by: https://github.com/StrongerXi, https://github.com/jansel	2025-06-03 19:19:47 +00:00
David Berard	ea7b233015	[flex attention][triton pin] triton_helpers shim for TMA apis (#154858 ) Triton 3.4 will remove the experimental TMA apis: https://github.com/triton-lang/triton/pull/6488 To allow compatibility across different triton versions, we implement a shim layer which calls the new API if available, and otherwise falls back to the experimental API. Test: `python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_causal_mask_cuda` which previously fails w/ triton-lang/tritoncda4229558c5dca7f7c4734bedd3e596ebcae0b8, but now passes. Note: we'll need to apply this for other things in inductor, this just does it for flex attention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154858 Approved by: https://github.com/NikhilAPatel, https://github.com/drisspg	2025-06-03 19:15:48 +00:00
atalman	85fb13d0d1	[BE] Cleanup cuda 12.4 artifacts from scripts and workflows (#154893 ) Remove artifacts. CUDA 12.4 was deprecated. hence no need to keep this code around Pull Request resolved: https://github.com/pytorch/pytorch/pull/154893 Approved by: https://github.com/nWEIdia, https://github.com/malfet, https://github.com/tinglvv	2025-06-03 18:43:40 +00:00
David Berard	c014e9d7cd	[inductor][test] test_padding.py: use inductor TestCase instead of dynamo TestCase (#154935 ) test_pad_3d_tensor fails if you run it multiple times in a row, because the cache is populated and inductor skips the logic that increments the counter. To fix this, switch these tests to use inductor's TestCase / run_tests instead of dynamo's - this way, a fresh inductor cache is used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154935 Approved by: https://github.com/Skylion007	2025-06-03 18:36:44 +00:00
Jane Xu	e8183f8d3d	add #pragma once to stable/library.h (#154920 ) This shoulda been there and it was an oversight that it was not! We do not want the same translation unit to process this header multiple times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154920 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-06-03 18:34:53 +00:00
Ryan Guo	6f7694f18f	[dynamo] Reconstruct defaultdict properly (#154931 ) `DefaultDictVariable` inherited `ConstDictVariable.reconstruct`, causing dynamo to reconstruct a `DefaultDictVariable` into a dict rather than defaultdict. This patch fixes that. Fixes #138412. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154931 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #154930	2025-06-03 18:18:40 +00:00
Ryan Guo	467235027c	[AOTDispatch] Use the proper meta function for `_amp_foreach_non_finite_check_and_unscale_` (#154930 ) As title, this fixes part of #138412. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154930 Approved by: https://github.com/zou3519	2025-06-03 18:18:40 +00:00
Svetlana Karslioglu	462579af11	Update merge_rules.yaml (#155008 ) - add new docs reviewers Pull Request resolved: https://github.com/pytorch/pytorch/pull/155008 Approved by: https://github.com/malfet	2025-06-03 18:09:23 +00:00
Nikita Shulga	f714599c57	[MPS][BE] Extend torch.special. to integer dtypes (#155002 ) By changing the functor to looks as follows ```metal struct xlog1py_functor { template <typename T, enable_if_t<is_floating_point_v<T>, bool> = true> inline T operator()(const T a, const T b) { return static_cast<T>(c10:🤘:xlog1py(a, b)); } template <typename T, enable_if_t<is_integral_v<T>, bool> = true> inline float operator()(const T a, const T b) { return c10:🤘:xlog1py(float(a), float(b)); } }; ``` Repeat the same for `zeta`, `chebyshev_polynomial_[tuvw]_functor` and `hermite_polynomial_h[e]_functor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155002 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #154936	2025-06-03 17:52:41 +00:00
bubuss	31405a69fb	[typing] Add missing type annotations to torch.nn.init module (#154504 ) ## Summary Adds missing type annotations to `torch.nn.init` and removes `# mypy: allow-untyped-defs` since all functions are now properly typed. ## Changes - Added missing type annotations to initialization functions in the module. - Added missing typing imports: `Any`, `Callable`, `Union` - Removed `# mypy: allow-untyped-defs` comment - Create Literal types for kaiming initialization mode and nonlinearity. - Created `__all__` ## Why Better IDE support, catches type errors earlier, and brings the module up to PyTorch's typing standards. No runtime changes - purely additive typing improvements. Tested with existing test suite and lintrunner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154504 Approved by: https://github.com/Skylion007	2025-06-03 17:33:32 +00:00
cora-codes	40142978d7	Add type annotation to orthogonal_ (#154927 ) Trivial charge, but I want pyright to stop yelling at me Pull Request resolved: https://github.com/pytorch/pytorch/pull/154927 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-06-03 17:00:02 +00:00
albanD	1f131fe56b	Update bug-report.yml (#154857 ) Update issue template for binary data and numerical notes. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/154857 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-06-03 16:13:07 +00:00
fduwjj	ff92b42fc3	[c10d][gloo] Integrate vendor generic FR into gloo (#152614 ) This is a first quick prototyping for FR integration for gloo. Few features gaps: - Input/Output numels for each collective - Whether to use c10::Event or where to use it. - Where to dump the FR traces. (The dump api is provided in this PR) Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614 Approved by: https://github.com/d4l3k ghstack dependencies: #154929	2025-06-03 16:12:54 +00:00
Howard Huang	283f876ab6	[PP] Fix disabled flaky tests (#154856 ) Fix https://github.com/pytorch/pytorch/issues/154373, https://github.com/pytorch/pytorch/issues/154391, https://github.com/pytorch/pytorch/issues/154408, https://github.com/pytorch/pytorch/issues/154443, https://github.com/pytorch/pytorch/issues/154481 Because MultiProcContinousTest [now executes the tests with 8 GPUs instead of 2](https://github.com/pytorch/pytorch/pull/153653), our PP tests comparing gradients have become flakier due to the longer pipeline. The gradients are still close but we need to relax the tolerance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154856 Approved by: https://github.com/Skylion007	2025-06-03 15:55:29 +00:00
Alanna Burke	250e9af4da	Removing per torch.compile audit. (#154572 ) Removing https://pytorch.org/docs/stable/torch.compiler_best_practices_for_backends.html per torch.compile audit Pull Request resolved: https://github.com/pytorch/pytorch/pull/154572 Approved by: https://github.com/williamwen42, https://github.com/svekars	2025-06-03 15:41:52 +00:00
Ke Wen	3685b10170	Turn on compile with NVSHMEM (#154538 ) Before: `USE_NVSHMEM=1` need to be explicit set in build environment. After: `USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154538 Approved by: https://github.com/ngimel	2025-06-03 15:24:24 +00:00
Ruisi Zhang	a1a268aff5	[dtensor] fix simplefsdp mixed-precision training bugs (#154975 ) This is a follow-up on the previous dtensor redistribute PR: https://github.com/pytorch/pytorch/pull/150740, which enables SimpleFSDP's mixed-precision training. In the most recent integration in TorchTitan: https://github.com/pytorch/torchtitan/pull/1250, we found some discrepancies between SimpleFSDP's `fully_shard` and `replicate` modes when MPT is enabled. After debugging, I found the problem is in dtensor redistribute --`local_tensor` is taken out again from the original `input`. Thus, the dtensor used for communication has its original precision instead of using `forward_dtype`. This PR fixes this issue and corrects previously added test cases. After fixing the bug, the loss curves of `fully_shard` and `replicate` mode match perfectly. ![loss](https://github.com/user-attachments/assets/a8faddae-a476-48c0-a411-3fe04d2233bd) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154975 Approved by: https://github.com/tianyu-l	2025-06-03 14:47:36 +00:00
eellison	2608927cfb	Solve for tilings (#153748 ) Find variables that coalesce the reads and writes and score the total size. If uncoalesced memory expressions are found, look for additional tiling of variables which will coalesce memory accesses. For instance - for the following expression: `(32*p0) // 2048`, tiling p0 by 64 will make this expression coalesced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153748 Approved by: https://github.com/jansel ghstack dependencies: #153723, #153730	2025-06-03 14:37:30 +00:00
David Svantesson	812deecaab	Add option to define OpenBLAS version for manylinux Dockerfile_2_28_aarch64 (#150106 ) Adds optional variable OPENBLAS_VERSION to `.ci/docker/common/install_openblas.sh` used to define which version of OpenBLAS to install. Adds argument to `Dockerfile_2_28_aarch64` image. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150106 Approved by: https://github.com/aditew01, https://github.com/fadara01, https://github.com/malfet Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com>	2025-06-03 14:35:54 +00:00
eellison	0adbde4d35	Analyze coalesced mem (#153730 ) Analyze memory expressions to see if they contain a coalescing symbol. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153730 Approved by: https://github.com/jansel ghstack dependencies: #153723	2025-06-03 14:29:06 +00:00
Nikita Shulga	e9266f807a	[BE] Use vendored packaging for testing (#154946 ) As the rest of the torch uses it, test should rely on it as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/154946 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-06-03 14:22:53 +00:00
Nikita Shulga	9cdce682a1	[MPS][BE] Reimplement log1p as Metal shader (#154936 ) That should make it faster than MPSGraph implementation, but also improves accuracy for small inputs, by using the algorithm described in [What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#1202), i.e. $log(1+x) = \frac{x * log(1+x)}{(1 + x) - 1}$ if $1 +x \neq 1$ else just $x$ Also tried using first 3 elements of Taylor series in Horner's form which also seems to work fine, i.e. $log(1+x) \approx x * (1 -x (\frac{1}{2} - \frac{x}{3}))$ Replaced less accurate log1p implementation in `c10/metal/special_math.h` with generic one. Parametrize and modify regression test to check for accuracy of small values TODOs: - Do proper implementation for complex values as well, perhaps using `0408ba0a76/mlx/backend/metal/kernels/utils.h (L339)` - May be implement it using Remez-like algorithm documented here `207f3b2b25/lib/msun/src/s_log1pf.c (L37)` - Or use llvm's implementation from `f393986b53/libclc/clc/lib/generic/math/clc_log1p.inc (L22)` - Benchmark which algorithm is faster and delivers better accuracy Pull Request resolved: https://github.com/pytorch/pytorch/pull/154936 Approved by: https://github.com/dcci, https://github.com/Skylion007	2025-06-03 14:10:13 +00:00
eellison	00dfd3891e	[Tiling rewrite pt1] Normalize reads and writes to common iter space (#153723 ) In order to take the globally best tiling, we need to normalize all the node read and writes to a common iteration space. This first pr finds a common split among nodes in a fused scheduler node, and then normalizes reads and writes to the common split. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153723 Approved by: https://github.com/jansel	2025-06-03 14:04:34 +00:00
Animesh Jain	635b73e697	[dynamo][guards] Flush cache to more accurately measure guard overhead (#154764 ) We observed that guard overhead at runtime using profiler traces was higher than reported in this profiling function at the compile time. After investigation, we found that f_locals are already in cache and that was causing the guard overhead to be way smaller while profiling during the compilation. To be more realistic, we flush the cache here. Profiling the guard overhead during compilation (in addition to at runtime) allows faster iteration time, and logging in tlparse and internal databases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154764 Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/StrongerXi	2025-06-03 11:50:57 +00:00
Aidyn-A	71a0af8a14	[TEST][Quantization] Skip test_learnable due to hypothesis (#152819 ) As per comment in https://github.com/pytorch/pytorch/issues/111471#issuecomment-1866933243 the tests are failing due to hypothesis. This PR adds a skip to those tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152819 Approved by: https://github.com/eqy	2025-06-03 11:23:15 +00:00
bobrenjc93	ea5b9eca74	Combine sticky pgo key with job id (#154863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154863 Approved by: https://github.com/Mingming-Ding	2025-06-03 07:58:38 +00:00
Boyuan Feng	a4da1d4a47	[Graph Partition] support standalone_compile (#154698 ) For graph partition, `write_get_raw_stream_header_once` is done once so the autotune code may not have the header. This PR additionally calls `write_get_raw_stream_header` in `codegen_device_guard_enter` before `get_raw_stream` is used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154698 Approved by: https://github.com/oulgen	2025-06-03 07:40:42 +00:00
fduwjj	d91c85babb	[c10d][fr] Split cuda and non-cuda fr logic into two cpp file (#154929 ) During the integration fr with gloo I found that put all logic inside one cpp with both build Macro does not work in the current linkage set up in the bazil file. If we put the cpp in the libtorch_cpu, then cuda side build will fail, if we put both we get complaint about ld.lld: error: duplicate symbol: typeinfo for c10d::DebugInfoWriter. To fix this, we need to move the common logic into another header file and we use different cpp file for cpu and cuda so that fr can be used in both cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154929 Approved by: https://github.com/kwen2501	2025-06-03 07:00:14 +00:00
Bin Bao	13044b2b04	Move c10/macros/Export.h to torch/standalone (#154850 ) Summary: The goal of this PR and future follow-up PRs is to group a set of header files required by AOTInductor Standalone in a separate directory, ensuring they are implemented in a header-only manner. Test Plan: CI Bifferential Revision: D75756619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154850 Approved by: https://github.com/janeyx99	2025-06-03 06:18:59 +00:00
PyTorch MergeBot	a7e496a896	Revert "[dynamo] Record the pre-graph bytecode using fast record function event (#154769 )" This reverts commit 409c396a48584de1ab14e1be6957663d548ad89e. Reverted https://github.com/pytorch/pytorch/pull/154769 on behalf of https://github.com/seemethere due to This fails internal tests see [fburl.com/diff/67gyp7gp](https://fburl.com/diff/67gyp7gp) ([comment](https://github.com/pytorch/pytorch/pull/154769#issuecomment-2933629894))	2025-06-03 06:13:49 +00:00
PyTorch MergeBot	b86aaaae0b	Revert "[dynamo][guards] Flush cache to more accurately measure guard overhead (#154764 )" This reverts commit 7dee89913072f1499c5265d8e92d23c30fc6a7f1. Reverted https://github.com/pytorch/pytorch/pull/154764 on behalf of https://github.com/seemethere due to This fails internal tests see [fburl.com/diff/67gyp7gp](https://fburl.com/diff/67gyp7gp) ([comment](https://github.com/pytorch/pytorch/pull/154769#issuecomment-2933629894))	2025-06-03 06:13:49 +00:00
henrylhtsang	d375e64279	[cutlass backend][forward fix] hex the cutlass key instead of decode (#154885 ) This is mainly following how it is done for torch_key. Error was: ``` UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154885 Approved by: https://github.com/jingsh, https://github.com/mlazos	2025-06-03 06:00:16 +00:00
Yuanhao Ji	8af447224e	Improve error message for `torch.fft.ihfft2` when input's dtype is complex (#149692 ) Fixes #149625 For the case mentioned in the issue, will get: ``` RuntimeError: Only supports floating-point dtypes, but found: ComplexDouble ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149692 Approved by: https://github.com/malfet	2025-06-03 05:54:56 +00:00
henrylhtsang	295ea202f6	[inductor] Add kernel_hash_key to ChoiceCaller (#154470 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154470 Approved by: https://github.com/mlazos	2025-06-03 04:01:49 +00:00
cyy	388912dd94	Remove AttributeError constructor (#154808 ) It is a private API and uses C vsnprintf, which is not type safe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154808 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-06-03 03:49:09 +00:00
PyTorch MergeBot	ef92653022	Revert "Remove AttributeError constructor (#154808 )" This reverts commit 3239da0c732c4ad736df7081ea44c1cd79c01145. Reverted https://github.com/pytorch/pytorch/pull/154808 on behalf of https://github.com/cyyever due to Need format code ([comment](https://github.com/pytorch/pytorch/pull/154808#issuecomment-2933286113))	2025-06-03 03:40:41 +00:00
Wei Feng	b3cb0e83de	[FSDP2] respect reshard_after_forward=True for root model (#154704 ) resolve https://github.com/pytorch/pytorch/issues/154655 `fully_shard(root, reshard_after_forward=True)` didn't really reshard parameters after forward, because we assumed root model will be used in backward immeidately. The assumption becomes invalid in 2 cases * we have 3 roots for CLIP, T5, FLUX. we should reshard parameters are CLIP and T5 immeidately after their forward for recommendation model, we may have mutiple root for dense part Change default beahvior to always respect `reshard_after_forward=True` Differential Revision: [D75663200](https://our.internmc.facebook.com/intern/diff/D75663200) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154704 Approved by: https://github.com/mori360	2025-06-03 03:12:45 +00:00
Anatoly Myachev	ff35c0cdfd	[inductor] Change `_constexpr_to_value` -> `_unwrap_if_constexpr` (#154905 ) To adapt to the changes from: `f480e2f697` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154905 Approved by: https://github.com/davidberard98	2025-06-03 03:10:56 +00:00
cyy	e3cf73ee49	Move remaining CI jobs to VS 2022 (#154811 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/154811 Approved by: https://github.com/huydhn	2025-06-03 02:21:24 +00:00
Yuanyuan Chen	3239da0c73	Remove AttributeError constructor (#154808 ) It is a private API and uses C vsnprintf, which is not type safe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154808 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-06-03 02:18:51 +00:00
David Berard	28cb3c0fe5	[test][inductor] attempt to fix duplicate registration issue (#154865 ) Fixes #154216 In #154216, there's a duplicate registration error thrown from registering `test::foo` twice. I expect that this is caused by having two tests that both register a `test::foo` op in the same test file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154865 Approved by: https://github.com/NikhilAPatel, https://github.com/jingsh	2025-06-03 01:11:47 +00:00
David Berard	6cb6da6ea2	[triton pin][test] relax codecache test checks for number of triton artifacts (#154879 ) Triton has added another artifact that gets generated (triton-lang/triton#6992), so `test_cache_load_function` started failing as there are now 8 (instead of 7) artifacts. Instead of figuring out a way to check exactly which set of artifacts will get generated, I instead modified the test to just check that there are _at least_ 6 artifacts, to account for different platforms (intel/amd/nvidia) and different triton versions (which may or may not have a `.source` artifact) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154879 Approved by: https://github.com/oulgen, https://github.com/masnesral	2025-06-03 00:52:54 +00:00
Isuru Fernando	7f44b589be	[dynamo] fix pruning locals with ShapeEnvSource (#154752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154752 Approved by: https://github.com/zhxchen17	2025-06-03 00:35:11 +00:00
David Berard	47a142c3c2	[triton pin][tests] update inductor/profiler launch_(enter\|exit)_hooks tests (#154894 ) Fixes #154223 Triton has updated launch_(enter\|exit)_hooks so that they are now in `knobs`. @danzimm already fixed this in #152457 - this just updates the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154894 Approved by: https://github.com/jingsh, https://github.com/NikhilAPatel	2025-06-03 00:14:14 +00:00
Catherine Lee	731acbfb0b	[CI] Reuse old whl on PRs (#154662 ) Turn off main branch only gating for reusing old whls Pull Request resolved: https://github.com/pytorch/pytorch/pull/154662 Approved by: https://github.com/huydhn	2025-06-03 00:10:39 +00:00
Georgia Phillips	af9f18e87e	[nativert] Free stale execution frames (#154636 ) Summary: This was implemented in SR due to caching of runtime instances building up and causing some memory usage spikes after some large amount traffic went through the model, and then once traffic went down, SR was still caching all the previous usage. We need something similar on the Sigmoid side to make sure the static dispatch modules aren't hogging memory. Currently, all ExecutionFrame objects are being cached, and never freed if stale. Test Plan: Added extra execution frames in tmp commit D75257998 and ran local replayer test to confirm extra execution frames get cleaned up down to min size, which is set at 8 {F1978532047} Also tested by modifying load_net_predictor (modifications also in D75257998) to run benchmarkNumIterations twice - once with benchmarkNumThreads, and once with only one thread. Also set clearing interval at one second. Verified that execution frames get cleared when we drop down to one thread. {F1978558984} ``` buck2 test 'mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:model_runner_test -- ModelRunnerTest.Basic_InterpreterCuda_Multithread_Cleanup --run-disabled --print-passing-details ``` Bifferential Revision: D75257992 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154636 Approved by: https://github.com/zhxchen17, https://github.com/dolpm	2025-06-02 23:44:12 +00:00
PyTorch MergeBot	37eb909c94	Revert "[Inductor] Add attention pattern for model DistilBert in transformers==4.44.2. (#154091 )" This reverts commit 7b25ff7cf2e6096c103da0068e417216a41be7a9. Reverted https://github.com/pytorch/pytorch/pull/154091 on behalf of https://github.com/seemethere due to I root caused this PR to some failures, I tried to resolve with https://github.com/pytorch/pytorch/pull/154923 but it looks like there are more failures with my fix ([comment](https://github.com/pytorch/pytorch/pull/154091#issuecomment-2932848880))	2025-06-02 23:22:43 +00:00
PyTorch MergeBot	ac65e94f45	Revert "[Inductor UT] Reuse test_fused_attention.py for Intel GPU. (#154110 )" This reverts commit 2dfc0e33273fe50dcbb3d363da02c8cc485b4adc. Reverted https://github.com/pytorch/pytorch/pull/154110 on behalf of https://github.com/seemethere due to This is part of a stack with failures internally, I tried to resolve with https://github.com/pytorch/pytorch/pull/154923 but it looks like there are more failures ([comment](https://github.com/pytorch/pytorch/pull/154110#issuecomment-2932845168))	2025-06-02 23:20:11 +00:00
PyTorch MergeBot	e3af628b0d	Revert "Add CPython exception tests (#150789 )" This reverts commit 67fb9b7cc3f7d2ebbb104296f2b11776f4adbb22. Reverted https://github.com/pytorch/pytorch/pull/150789 on behalf of https://github.com/seemethere due to This is failing upstream in trunk, see `67fb9b7cc3` ([comment](https://github.com/pytorch/pytorch/pull/150789#issuecomment-2932823586))	2025-06-02 23:12:15 +00:00
Animesh Jain	7dee899130	[dynamo][guards] Flush cache to more accurately measure guard overhead (#154764 ) We observed that guard overhead at runtime using profiler traces was higher than reported in this profiling function at the compile time. After investigation, we found that f_locals are already in cache and that was causing the guard overhead to be way smaller while profiling during the compilation. To be more realistic, we flush the cache here. Profiling the guard overhead during compilation (in addition to at runtime) allows faster iteration time, and logging in tlparse and internal databases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154764 Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/StrongerXi ghstack dependencies: #154769	2025-06-02 23:01:58 +00:00
Animesh Jain	409c396a48	[dynamo] Record the pre-graph bytecode using fast record function event (#154769 ) ![image](https://github.com/user-attachments/assets/1d06618b-1c14-4ed5-ab7b-dcfecbb4d632) Adds another event in the profiler traces. This can help us find models where pre-graph bytecode is very expensive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154769 Approved by: https://github.com/zou3519, https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel	2025-06-02 22:33:27 +00:00
eellison	f6b83d4cc6	sort iteration over index vars (#154846 ) Fix for https://github.com/pytorch/pytorch/issues/154741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154846 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh	2025-06-02 22:06:00 +00:00
Catherine Lee	d6420d4f85	[CI] Reuse old whl: replace the version (#154773 ) Replace the git version, so whl name goes from `torch-something+git<old commit>` to `torch-something+git<new commit>` Renamed a bunch of variables to hopefully be more clear Tested on `ef210ad54b` * Removed gating that prevents it from running on PRs (which is going to be merged soon) * Removed gating that checks for which files can be changed (since this PR has stuff outside of the acceptable list) * The above two allow the whl to be reused, and I added assert 1 == 2 in common_utils and checked that jobs failed (meaning they were using updated code despite not building) Checked that the whl in the docker image has the right commit sha, didn't check torch.__version__ though Pull Request resolved: https://github.com/pytorch/pytorch/pull/154773 Approved by: https://github.com/malfet	2025-06-02 22:02:41 +00:00
Catherine Lee	e1644e40a7	[ez][TD] Fix TD indexer workflow (#154868 ) Update docker image, and fix gpu flag env var Example failure: https://github.com/pytorch/pytorch/actions/runs/15381170311/job/43272174443 Tested on `9cb28f03e5` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154868 Approved by: https://github.com/Skylion007	2025-06-02 21:33:19 +00:00
Henry Tsang	104c31598f	[cutlass backend][ez] Make load config from local more resilient (#154740 ) Differential Revision: D75693211 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154740 Approved by: https://github.com/ColinPeppler	2025-06-02 21:12:12 +00:00
Guilherme Leobas	731e635c95	Add CPython math/cmath tests (#150794 ) Tests: * test_math.py * test_cmath.py Minor changes were made to each test to run them inside Dynamo One can reproduce the changes by downloading the tests from CPython and applying the diff: ```bash for f in "test_math" "test_cmath"; do wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py" git apply "test/dynamo/cpython/3_13/${f}.diff" done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150794 Approved by: https://github.com/zou3519	2025-06-02 20:49:44 +00:00
Guilherme Leobas	67fb9b7cc3	Add CPython exception tests (#150789 ) ---- * test_baseexception.py * test_exceptions.py * test_exception_variations.py * test_raise.py * test_sys.py Minor changes were made to each test to run them inside Dynamo One can reproduce the changes by downloading the tests from CPython and applying the diff: ```bash for f in "test_raise" "test_sys" "test_exceptions" "test_baseexception" "test_exception_variations"; do wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py" git apply "test/dynamo/cpython/3_13/${f}.diff" done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150789 Approved by: https://github.com/zou3519	2025-06-02 20:44:41 +00:00
Wei Wang	48807d568e	[CI][CUDA] Migrate remaining cu118 jobs to cu128 (#154169 ) Contributing to the fix of #147383 and #154119 Additional steps required: `3218b1b684/.github/workflows/lint.yml` cu118 needs to be updated. Make install_cuda.sh accept both 12.8 and 12.8.* as CUDA_VERSION argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154169 Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman, https://github.com/tinglvv	2025-06-02 20:22:14 +00:00
Ryan Guo	9d3ad82ca7	[dynamo] Remove all `skipIfTorchDynamo` in `test_tensor_creation_ops.py` (#154693 ) Looks like they are no longer needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154693 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2025-06-02 20:14:35 +00:00
bobrenjc93	984b1a80e3	[ez] add docs for *eager_then_compile stances (#154818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154818 Approved by: https://github.com/williamwen42 ghstack dependencies: #154802, #154826, #154822, #154823, #154805	2025-06-02 19:04:35 +00:00
bobrenjc93	28f27886eb	Vary batch size when running dynamic shapes benchmarks (#154805 ) This better measures the actual runtime performance of dynamic shapes where we aren't guaranteed to have similar shapes as the original hint. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154805 Approved by: https://github.com/Skylion007 ghstack dependencies: #154802, #154826, #154822, #154823	2025-06-02 18:56:18 +00:00
bobrenjc93	33f2d0ff45	add reference to stances from dynamic shapes doc (#154823 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154823 Approved by: https://github.com/Skylion007, https://github.com/williamwen42 ghstack dependencies: #154802, #154826, #154822	2025-06-02 18:47:19 +00:00
bobrenjc93	d99e9568ec	Add docs for how to mark as unbacked (#154822 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154822 Approved by: https://github.com/Skylion007 ghstack dependencies: #154802, #154826	2025-06-02 18:30:57 +00:00
Animesh Jain	1258aac1c2	[dynamo] Upcast torch.Size + tuple to be of size torch.Size (#154830 ) Fixes https://github.com/pytorch/pytorch/issues/154432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154830 Approved by: https://github.com/StrongerXi, https://github.com/Skylion007, https://github.com/williamwen42	2025-06-02 17:57:23 +00:00
bobrenjc93	9fe1b40d17	[ez] add dynamic sources docs (#154826 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154826 Approved by: https://github.com/Skylion007 ghstack dependencies: #154802	2025-06-02 17:53:30 +00:00
PyTorch MergeBot	69e22301da	Revert "[inductor] Add kernel_hash_key to ChoiceCaller (#154470 )" This reverts commit 7a79de1c0f31200f95a48a9e69fbd2df2a3c735d. Reverted https://github.com/pytorch/pytorch/pull/154470 on behalf of https://github.com/seemethere due to Failing internal inductor tests, author is aware and suggested revert. D75767762 ([comment](https://github.com/pytorch/pytorch/pull/154470#issuecomment-2931717432))	2025-06-02 17:43:23 +00:00
Oguz Ulgen	113224b530	Enable non blocking remote cache write (#154837 ) Test Plan: Ran ``` buck2 run mode/opt //scripts/oulgen:runner ``` twice and got https://fburl.com/scuba/pt2_remote_cache/u7u1uqh1 Differential Revision: D75770423 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154837 Approved by: https://github.com/jamesjwu	2025-06-02 17:36:43 +00:00
PyTorch MergeBot	67067512a1	Revert "[BE] Cleanup old ExecuTorch codegen and runtime code (#154165 )" This reverts commit 515c19a3856e953c0fe23a0ed4fa844f8eea34d8. Reverted https://github.com/pytorch/pytorch/pull/154165 on behalf of https://github.com/seemethere due to This is failing when attempting to test against executorch main internally, author has acknowledged that this should be reverted ([comment](https://github.com/pytorch/pytorch/pull/154165#issuecomment-2931489616))	2025-06-02 16:28:46 +00:00
Joona Havukainen	981bdb39ca	Enable ConvTranspose3D for FP32 and Complex64 (#154696 ) Fixes #154615 Enables using ConvTranspose3D since it seems support exists both on MacOS 14 and 15. For the half dtypes the discrepancy of CPU and GPU implementations is too large to conclude whether there is a bug in the implementation or not without a more rigorous study on what bounds are there to the expected error. So they are left unsupported for now and an assert is added to notify the user if the op is called with fp16 or bf16 inputs. Tests for ConvTranspose3D were enabled for the supported data types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154696 Approved by: https://github.com/malfet	2025-06-02 16:24:03 +00:00
angelayi	77d85a4629	Symintify baddbmm (#154656 ) Previously we would specialize on the shape in this if-statement Pull Request resolved: https://github.com/pytorch/pytorch/pull/154656 Approved by: https://github.com/pianpwk	2025-06-02 15:23:14 +00:00
angelayi	e22be781b7	Symintify repeat_interleave (#154660 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/154660 Approved by: https://github.com/pianpwk	2025-06-02 15:19:39 +00:00
cyy	f6275bf0fe	Bump pocketfft submodule to the latest (#154845 ) Fixes #154843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154845 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-06-02 14:54:13 +00:00
Anthony Shoumikhin	dfd6849e77	Update lint_urls.sh (#154838 ) Do not match empty urls pieces like "https://" Add headers for better handling urls like "https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300x-data-sheet.pdf" Pull Request resolved: https://github.com/pytorch/pytorch/pull/154838 Approved by: https://github.com/Skylion007	2025-06-02 14:50:34 +00:00
PyTorch UpdateBot	c65e9ad77a	Update slow tests (#154347 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154347 Approved by: https://github.com/pytorchbot	2025-06-02 11:30:56 +00:00
Pearu Peterson	ff4515fde5	Add optional check_pinning argument to _validate_sparse_compressed_tensor/coo_args (#154759 ) As in the title. A prerequisite to https://github.com/pytorch/pytorch/pull/154638 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/154759 Approved by: https://github.com/amjames, https://github.com/ngimel ghstack dependencies: #154610	2025-06-02 10:17:07 +00:00
Pearu Peterson	3f3c1f419f	User-controlled sparse tensor validation when loading data from external storage (#154610 ) This PR lets users to control sparse tensor invariants validation (that can be expensive, especially, for sparse tensors with many indices) when loading data from external sources. By default, the validation of sparse tensor invariants is disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154610 Approved by: https://github.com/amjames, https://github.com/ngimel	2025-06-02 10:17:07 +00:00
PyTorch UpdateBot	9258cfc227	[audio hash update] update the pinned audio hash (#154776 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154776 Approved by: https://github.com/pytorchbot	2025-06-02 05:36:13 +00:00
Wei Wang	16d05e130c	[CI][CUDA][UCC] Update test_c10d_ucc.py - remove xfailIfLinux because it now succeeds (#150979 ) pytest -v test/distributed/test_c10d_ucc.py -k test_save_load ============================================================================================== test session starts ============================================================================================== platform linux -- Python 3.12.3, pytest-8.1.1, pluggy-1.5.0 -- /usr/bin/python cachedir: .pytest_cache hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/opt/pytorch/pytorch/.hypothesis/examples')) rootdir: /opt/pytorch/pytorch configfile: pytest.ini plugins: anyio-4.9.0, hypothesis-6.130.13, flakefinder-1.1.0, rerunfailures-15.0, xdist-3.6.1, xdoctest-1.0.2, typeguard-4.3.0 collected 63 items / 62 deselected / 1 selected Running 1 items in this shard test/distributed/test_c10d_ucc.py::DistributedDataParallelTest::test_save_load_checkpoint PASSED [65.2581s] [100%] ================================================================================== 1 passed, 62 deselected in 68.78s (0:01:08) @ptrblck @eqy @tinglvv @atalman @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/150979 Approved by: https://github.com/eqy	2025-06-02 03:24:35 +00:00
Zeming Lin	cd3d2b75b3	Update README.md - James has the wrong github link. (#151473 ) Unless I'm wrong, the James on the pytorch paper is not the account linked to in the README.md. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151473 Approved by: https://github.com/albanD	2025-06-02 01:53:44 +00:00
Mengwei Liu	515c19a385	[BE] Cleanup old ExecuTorch codegen and runtime code (#154165 ) Summary: These files are added to pytorch/pytorch before ExecuTorch is opensourced. Now is a good time to remove it from pytorch/pytorch, since the code is moved to pytorch/executorch already. Test Plan: Rely on CI jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154165 Approved by: https://github.com/kimishpatel, https://github.com/Skylion007, https://github.com/cyyever	2025-06-02 01:47:02 +00:00
angelayi	0d0058d90d	Fix flaky test in test_custom_ops (#152484 ) Hopefully fixes https://github.com/pytorch/pytorch/issues/151301, https://github.com/pytorch/pytorch/issues/151281 by making the ops have different names Pull Request resolved: https://github.com/pytorch/pytorch/pull/152484 Approved by: https://github.com/zou3519	2025-06-02 01:45:28 +00:00
Aaron Gokaslan	80af98c6c3	[BE]: Update nlohmann submodule to 3.12.0 (#154817 ) This is mostly compiler fixes, C++20 fixes, and clang-tidy fixes. Should be entirely backwards compatible with our current version Pull Request resolved: https://github.com/pytorch/pytorch/pull/154817 Approved by: https://github.com/jansel, https://github.com/malfet	2025-06-02 01:29:58 +00:00
Aaron Gokaslan	2b2245d5db	[BE]: Replace printf with fmtlib call (#154814 ) Safer, faster, more concise, and better type checking. Also add a few misc changes in the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154814 Approved by: https://github.com/jansel	2025-06-01 22:27:08 +00:00
Aaron Gokaslan	206e9d5160	[BE]: Update cpp-httplib submodule to 0.20.1 (#154825 ) Updates cpp-httplib to 0.20.1. This mostly updates OSS with a bunch of CMake, CXX compiler errors, and bugfixes from upstream. It's a header only library so should be pretty straightforward to upgrade Pull Request resolved: https://github.com/pytorch/pytorch/pull/154825 Approved by: https://github.com/malfet	2025-06-01 21:44:23 +00:00
Aaron Gokaslan	064bb3cebc	[BE]: Replace a couple of call sites with fmtlib printf (#154533 ) This is faster, and memory safe implementation of printf functions coming from fmtlib. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154533 Approved by: https://github.com/cyyever, https://github.com/jansel	2025-06-01 21:16:34 +00:00
Nikita Shulga	0350c7e72c	[BE] Introduce torch.AcceleratorError (#152023 ) Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError` `torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's [`PyErr_SetString`](`cb8a72b301/Python/errors.c (L282)`), namely - Convert cstr into Python string with `PyUnicode_FromString` - Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](`cb8a72b301/Python/errors.c (L32)`) - Set `error_code` property using `PyObject_SetAttrString` - decref all temporary references Test that it works and captures CPP backtrace (in addition to CI) by running ```python import os os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1' import torch x = torch.rand(10, device="cuda") y = torch.arange(20, device="cuda") try: x[y] = 2 print(x) except torch.AcceleratorError as e: print("Exception was raised", e.args[0]) print("Captured error code is ", e.error_code) ``` which produces following output ``` Exception was raised CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) [clone .cold] from CUDAException.cpp:0 #7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0 #8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0 #9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0 #10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0 #11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0 #12 at::native::abs(at::Tensor const&) from ??:0 #13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0 #14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0 #15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #17 at::_ops::abs::call(at::Tensor const&) from ??:0 #18 at::native::isfinite(at::Tensor const&) from ??:0 #19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0 #20 at::_ops::isfinite::call(at::Tensor const&) from ??:0 #21 torch::autograd::THPVariable_isfinite(_object, _object, _object) from python_torch_functions_2.cpp:0 #22 PyObject_CallFunctionObjArgs from ??:0 #23 _PyObject_MakeTpCall from ??:0 #24 _PyEval_EvalFrameDefault from ??:0 #25 _PyObject_FastCallDictTstate from ??:0 #26 _PyStack_AsDict from ??:0 #27 _PyObject_MakeTpCall from ??:0 #28 _PyEval_EvalFrameDefault from ??:0 #29 _PyFunction_Vectorcall from ??:0 #30 _PyEval_EvalFrameDefault from ??:0 #31 _PyFunction_Vectorcall from ??:0 #32 _PyEval_EvalFrameDefault from ??:0 #33 _PyFunction_Vectorcall from ??:0 #34 _PyEval_EvalFrameDefault from ??:0 #35 PyFrame_GetCode from ??:0 #36 PyNumber_Xor from ??:0 #37 PyObject_Str from ??:0 #38 PyFile_WriteObject from ??:0 #39 _PyWideStringList_AsList from ??:0 #40 _PyDict_NewPresized from ??:0 #41 _PyEval_EvalFrameDefault from ??:0 #42 PyEval_EvalCode from ??:0 #43 PyEval_EvalCode from ??:0 #44 PyUnicode_Tailmatch from ??:0 #45 PyInit__collections from ??:0 #46 PyUnicode_Tailmatch from ??:0 #47 _PyRun_SimpleFileObject from ??:0 #48 _PyRun_AnyFileObject from ??:0 #49 Py_RunMain from ??:0 #50 Py_BytesMain from ??:0 #51 __libc_init_first from ??:0 #52 __libc_start_main from ??:0 #53 _start from ??:0 Captured error code is 710 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152023 Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel ghstack dependencies: #154436	2025-06-01 21:02:43 +00:00
Nikita Shulga	f7c09f864a	[Docs] Reformat sparse example (#154785 ) Not sure why, but rst fails to colorize multiline inputs, but works fine for single line commands Test plan: \| [Before](https://docs.pytorch.org/docs/main/sparse.html#construction) \| [After](https://docs-preview.pytorch.org/pytorch/pytorch/154785/sparse.html#construction) \| \| ------------- \| ------------- \| \| <img width="466" alt="image" src="https://github.com/user-attachments/assets/96a5c52a-1804-4d05-a5cf-c10221aaddf6" /> \| <img width="477" alt="image" src="https://github.com/user-attachments/assets/99565288-5c0b-4e8e-bd60-f016ebc207b5" /> \| Fixes https://github.com/pytorch/pytorch/issues/154779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154785 Approved by: https://github.com/janeyx99, https://github.com/Skylion007	2025-06-01 20:56:14 +00:00
JungHoyoun	c2e9115757	Fix typo in dcp module (#154815 ) Fixed the docstring in `validate_checkpoint_id` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154815 Approved by: https://github.com/Skylion007	2025-06-01 18:18:45 +00:00
bobrenjc93	b90fc2ec27	[ez] delete code that died a long time ago (#154802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154802 Approved by: https://github.com/Skylion007	2025-06-01 14:57:03 +00:00
Aaron Gokaslan	0cd18ba1ca	[BE][Ez] Update deprecated pybind11 functions (#154798 ) * getType() is deprecated, replace it with new/proper static method. These are backwards compatible with old pybind11 versions we support. So break this off before we upgrade to pybind11 3.0 where these methods are dropped in #154115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154798 Approved by: https://github.com/jansel, https://github.com/cyyever	2025-06-01 06:17:50 +00:00
Aaron Gokaslan	bfae151269	[BE][Ez]: Remove unneeded mypy suppressions (#154800 ) Improvements in typing have made this suppression unnecessary Pull Request resolved: https://github.com/pytorch/pytorch/pull/154800 Approved by: https://github.com/cyyever, https://github.com/jansel	2025-06-01 06:10:41 +00:00
Natalia Gimelshein	9cbbc2593b	test for 146431 (#154786 ) Adds test for #146431 that was fixed by #154746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154786 Approved by: https://github.com/Skylion007, https://github.com/galv Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2025-06-01 04:17:54 +00:00
cyy	5616fa4a68	[Submodule] Bump flatbuffers to v24.12.23 (#143964 ) This sub-module has not been updated for a long time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143964 Approved by: https://github.com/Skylion007	2025-06-01 02:25:57 +00:00
Aaron Gokaslan	c33fc9dae3	[BE][Ez]: Update VulkanMemoryAllocator to 3.3.0 (#154796 ) Last update to this submodule was 3 years ago, and the API is pretty stable and this is a minor version release update. Part of a bunch of PRs to eradicate low CMake required versions Pull Request resolved: https://github.com/pytorch/pytorch/pull/154796 Approved by: https://github.com/jansel	2025-06-01 00:30:56 +00:00
Aaron Gokaslan	9ce2732b68	[BE][Ez]: Fully type nn.utils.clip_grad (#154801 ) Full types clip_grad and exposed typing annotations that were hidden by a bad decorator Pull Request resolved: https://github.com/pytorch/pytorch/pull/154801 Approved by: https://github.com/jansel	2025-05-31 23:06:45 +00:00
Aaron Gokaslan	dbad6d71c7	[BE][Ez]: Unskip conv1d MPS test (#154795 ) Fixes issue I noticed where conv1d test is skipped for complex types unconditionally Pull Request resolved: https://github.com/pytorch/pytorch/pull/154795 Approved by: https://github.com/jansel	2025-05-31 23:01:19 +00:00
Aaron Gokaslan	b85c460749	[BE][Ez]: Update NVTX submodule to 3.2.1 (#154797 ) Update NVTX3 submodule to 3.2.1. * Mostly improved compiler support, Python support, and better CMake and C++ support. * Also has a few new APIs to support fancy new features. * This is header only library so should be an easy non-invasive change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154797 Approved by: https://github.com/jansel	2025-05-31 23:01:13 +00:00
Pearu Peterson	6a781619bf	Temporarily disable sparse tensor validation when loading from external storage. (#154758 ) As in the title per https://github.com/pytorch/pytorch/issues/153143#issuecomment-2917793067 . The plan is to workout a solution that will allow (1) disabling pinned memory check to fix the original issue and (2) switching off the sparse tensor validation for maximal performance in loading sparse tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154758 Approved by: https://github.com/amjames, https://github.com/ngimel	2025-05-31 19:45:44 +00:00
FindHao	c99e91b1d7	[BE]Enhance _get_clean_triton.py to auto-generate launch_params if missing (#154666 ) Previously, @Chillee wrote a script https://github.com/pytorch/pytorch/pull/125811 to remove inductor dependency for inductor compiled triton kernels. We'd like to automate the process of obtaining the launch parameters. Added functionality to the torch/utils/_get_clean_triton.py to automatically generate the launch_params file if it does not exist and the auto_generate_params flag is set to True. This includes running the input file in a subprocess with the appropriate environment variable. Updated the get_clean_triton function and the main script to support this new feature, allowing users to disable auto-generation via a command-line argument. # Test Plan test embedding op in TritonBench ``` # generate inductor compiled triton kernels TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_FX_GRAPH_CACHE=0 python run.py --op embedding --mode fwd --precision fp32 --metrics nsys_rep --only inductor_embedding --num-inputs 1 --input-id 11 # run the script to get rid of inductor dependency. By default, triton_only_repro.py is the output file name. python ~/pytorch/torch/utils/_get_clean_triton.py ~/tritonbench/torch_compile_debug/run_2025_05_29_14_47_50_497790-pid_849274/torchinductor/model__0_forward_1.0/output_code.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154666 Approved by: https://github.com/davidberard98	2025-05-31 19:27:56 +00:00
xu-shawn	c014e4bcaa	Fix typo in vec256 interleave2 (#154784 ) Fix a typo where the elements in a vector are mislabeled Pull Request resolved: https://github.com/pytorch/pytorch/pull/154784 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-05-31 14:17:10 +00:00
fan.mo	daff263062	[Functorch] Support Functorch for PrivateUse1 backend (#154700 ) This PR enable that functorch to be used in 3rd party backends. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154700 Approved by: https://github.com/zou3519	2025-05-31 07:28:45 +00:00
Nicolas Macchioni	15e9119a69	[BE] install_triton_wheel.sh update for internal dev (#154637 ) internal devgpu gets mad at `pip install ...` but `python3 -m pip install ...` is fine Pull Request resolved: https://github.com/pytorch/pytorch/pull/154637 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-05-31 06:57:56 +00:00
Animesh Jain	7368eeba5e	[dynamo][guards] Prevent LENGTH guard on nn modules (#154763 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154763 Approved by: https://github.com/williamwen42	2025-05-31 05:32:31 +00:00
henrylhtsang	7a79de1c0f	[inductor] Add kernel_hash_key to ChoiceCaller (#154470 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154470 Approved by: https://github.com/mlazos	2025-05-31 03:09:37 +00:00
PyTorch MergeBot	bd10ea4e6c	Revert "Use 3.27 as the minimum CMake version (#153153 )" This reverts commit ad26ec6abe51d528124bc5fbbacaa87aef077ab8. Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923997777))	2025-05-31 02:14:24 +00:00
Peter Y. Yeh	43390d8b13	ROCm Sparsity through HipSparseLT (#150578 ) TLDR: - This pull request introduces support for hipSPARSELt in ROCm, current usage would be semi-structure sparsity. - Require ROCm 6.4 && gfx942/gfx950. - The average performance uplift (compare to dense operation) is ~ 20% in ROCm 6.4 but expect further performance lift along the way. ### Dense vs. Sparse Performance Comparison #### NT (Row-major) Average Uplift: `1.20` \| M \| N \| K \| hipsparselt-bench (us) \| hipblaslt-bench get all (us) \| Uplift \| \|-------\|--------\|--------\|-------------------------\|-------------------------------\|--------\| \| 14336 \| 8 \| 4096 \| 20.05 \| 25.3 \| 1.26 \| \| 4096 \| 8 \| 14336 \| 21.07 \| 25.28 \| 1.20 \| \| 3072 \| 3072 \| 10240 \| 299.05 \| 351.82 \| 1.18 \| \| 3072 \| 1536 \| 768 \| 18.56 \| 20.05 \| 1.08 \| \| 3072 \| 17664 \| 768 \| 163.13 \| 173.91 \| 1.07 \| \| 3072 \| 196608 \| 768 \| 1717.30 \| 1949.63 \| 1.14 \| \| 3072 \| 24576 \| 768 \| 206.84 \| 242.98 \| 1.17 \| \| 3072 \| 6144 \| 768 \| 53.90 \| 56.88 \| 1.06 \| \| 3072 \| 98304 \| 768 \| 833.77 \| 962.28 \| 1.15 \| \| 768 \| 1536 \| 768 \| 8.53 \| 19.65 \| 2.30 \| \| 768 \| 17664 \| 768 \| 46.02 \| 46.84 \| 1.02 \| \| 768 \| 196608 \| 768 \| 463.15 \| 540.46 \| 1.17 \| \| 768 \| 24576 \| 768 \| 54.32 \| 59.55 \| 1.10 \| \| 768 \| 6144 \| 768 \| 19.47 \| 20.15 \| 1.03 \| \| 768 \| 98304 \| 768 \| 231.88 \| 258.73 \| 1.12 \| --- #### NN (Row-major) Average Uplift: `1.13` \| M \| N \| K \| hipsparselt-bench (us) \| hipblaslt-bench get all (us) \| Uplift \| \|-----\|--------\|-------\|-------------------------\|-------------------------------\|--------\| \| 768 \| 1536 \| 3072 \| 27.50 \| 28.78 \| 1.05 \| \| 768 \| 17664 \| 3072 \| 125.06 \| 158.94 \| 1.27 \| \| 768 \| 196608 \| 3072 \| 1568.38 \| 1767.12 \| 1.13 \| \| 768 \| 24576 \| 3072 \| 171.05 \| 203.49 \| 1.19 \| \| 768 \| 6144 \| 3072 \| 58.72 \| 60.39 \| 1.03 \| \| 768 \| 98304 \| 3072 \| 787.15 \| 887.60 \| 1.13 \| ------------------------- This pull request introduces support for hipSPARSELt in ROCm, alongside various updates and improvements to the codebase and test suite. The changes primarily involve adding configuration flags, updating conditional checks, and ensuring compatibility with hipSPARSELt. ### ROCm and hipSPARSELt Support: * [`BUILD.bazel`](diffhunk://#diff-7fc57714ef13c3325ce2a1130202edced92fcccc0c6db34a72f7b57f60d552a3R292): Added `@AT_HIPSPARSELT_ENABLED@` substitution to enable hipSPARSELt support. * [`aten/CMakeLists.txt`](diffhunk://#diff-0604597797bb21d7c39150f9429d6b2ace10b79ab308514ad03f76153ae8249bR104-R110): Introduced a conditional flag to enable hipSPARSELt support based on ROCm version. * [`aten/src/ATen/CMakeLists.txt`](diffhunk://#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777R37): Added `AT_HIPSPARSELT_ENABLED` configuration. * [`aten/src/ATen/cuda/CUDAConfig.h.in`](diffhunk://#diff-8bb82da825ca87c28233abacffa1b0566c73a54990b7a77f3f5108d3718fea15R11): Defined `AT_HIPSPARSELT_ENABLED` macro. * `caffe2/CMakeLists.txt`, `cmake/Dependencies.cmake`, `cmake/public/LoadHIP.cmake`: Included hipSPARSELt in the ROCm dependencies. [[1]](diffhunk://#diff-c5ee05f1e918772792ff6f2a3f579fc2f182e57b1709fd786ef6dc711fd68b27R1380) [[2]](diffhunk://#diff-12e8125164bbfc7556b1781a8ed516e333cc0bf058acb7197f7415be44606c72L1084-R1084) [[3]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5R153) ### Codebase Updates: * [`aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp`](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6): Added hipSPARSELt support checks and initialization functions. Updated various methods to conditionally handle hipSPARSELt. [[1]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6) [[2]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R22-R67) [[3]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R78-R85) [[4]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R97-R109) [[5]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R183-R188) [[6]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L134-R200) [[7]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R213-R222) [[8]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L217-R285) ### Test Suite Updates: * [`test/test_sparse_semi_structured.py`](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65): Added checks for hipSPARSELt availability and updated test conditions to skip tests not supported on ROCm. [[1]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65) [[2]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR228) [[3]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR239) [[4]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR250) [[5]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR579) [[6]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR624) [[7]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR661) [[8]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR695) [[9]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR730) [[10]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR755) [[11]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR771) [[12]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR809) [[13]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR844) [[14]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cL840-R854) [[15]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR1005) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150578 Approved by: https://github.com/jeffdaily	2025-05-31 02:03:40 +00:00
cyy	ad26ec6abe	Use 3.27 as the minimum CMake version (#153153 ) Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783. It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153 Approved by: https://github.com/malfet	2025-05-31 01:54:35 +00:00
PyTorch MergeBot	3e71016459	Revert "Aten vector default constructors set to 0, add fnmadd and fnmsub (#154298 )" This reverts commit 489afa829a248ca64c4b2dffe2e6d601b8816cf9. Reverted https://github.com/pytorch/pytorch/pull/154298 on behalf of https://github.com/izaitsevfb due to breaks linux-jammy-aarch64-py3.10 / build ([comment](https://github.com/pytorch/pytorch/pull/154298#issuecomment-2923966688))	2025-05-31 01:51:59 +00:00
Robert Burke	489afa829a	Aten vector default constructors set to 0, add fnmadd and fnmsub (#154298 ) Test Plan: The only functional change is zero-initialization instead of undefined-initialization. If tests pass, I think it should be fine. Differential Revision: D75345074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154298 Approved by: https://github.com/swolchok	2025-05-31 01:32:45 +00:00
dolpm	472773c7f9	[nativert] move OpKernelKind enum to torch (#154756 ) Summary: att Test Plan: ci Differential Revision: D75703996 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154756 Approved by: https://github.com/zhxchen17, https://github.com/cyyever	2025-05-31 01:31:29 +00:00
Natalia Gimelshein	f01e628e3b	Resubmit Remove MemPoolContext (#154042 ) (#154746 ) Summary: Per title Test Plan: Added tests + existing tests Differential Revision: D75695030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154746 Approved by: https://github.com/malfet	2025-05-31 01:21:54 +00:00
Siddharth Kotapati	932733e0e6	Fix memory leaks in mps_linear_nograph (#154765 ) Fixes some memory leaks which were identified as part of the investigation of https://github.com/pytorch/pytorch/issues/154329. This doesn't appear to be the whole solution but wanted to merge this anyway since it's a quick fix In my tests I see roughly 3MB of unexpected memory growth before this change, and after this change I see 2.2MB of memory growth Pull Request resolved: https://github.com/pytorch/pytorch/pull/154765 Approved by: https://github.com/malfet	2025-05-31 00:46:12 +00:00
PyTorch MergeBot	108422ac26	Revert "Use 3.27 as the minimum CMake version (#153153 )" This reverts commit 78624679a876a21acb14bf075ba6beccff21b9a0. Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923785799))	2025-05-31 00:28:03 +00:00
mori360	da4aacabac	Add h100_distributed label (#154562 ) Add h100_distributed label, testing distributed 3D composability tests on 8*H100 GPU node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154562 Approved by: https://github.com/seemethere	2025-05-31 00:17:43 +00:00
David Berard	9b5308cd58	[upstream triton] support build with setup.py in ./python/ or in ./ (#154635 ) Upstream triton has moved setup.py from python/ to ./. This PR allows versions to be buildable by checking the location of setup.py and choosing the cwd of the build commands based on the location. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154635 Approved by: https://github.com/atalman	2025-05-31 00:15:43 +00:00
Catherine Lee	b019a33f8f	[ez][CI] Reuse old whl: remove old zip/whl (#154770 ) Forgot that unzip doesn't get rid of the zip so the old one is still there Unrelated: figure out how to update the git version Pull Request resolved: https://github.com/pytorch/pytorch/pull/154770 Approved by: https://github.com/ZainRizvi, https://github.com/malfet	2025-05-31 00:13:24 +00:00
PyTorch MergeBot	0fab32290a	Revert "[draft export] avoid storing intermediate real tensors in proxies (#154630 )" This reverts commit 5acb8d50801e6d110790993464611314dd1bd54b. Reverted https://github.com/pytorch/pytorch/pull/154630 on behalf of https://github.com/malfet due to This still ooms, at least occasionally see `78624679a8/1` ([comment](https://github.com/pytorch/pytorch/pull/154630#issuecomment-2923759745))	2025-05-31 00:07:56 +00:00
Yidi Wu	faf973da5e	[refactor] move materialize_as_graph to _higher_order_ops/utils.py (#154070 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154070 Approved by: https://github.com/zou3519	2025-05-31 00:06:44 +00:00
cyy	78624679a8	Use 3.27 as the minimum CMake version (#153153 ) Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783. It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153 Approved by: https://github.com/malfet	2025-05-31 00:01:52 +00:00
Pian Pawakapan	5f1c3c67b2	[pgo] log dynamic whitelist in PT2 Compile Events (#154747 ) Summary: logs the whitelist to PT2 Compile Events Test Plan: loggercli codegen GeneratedPt2CompileEventsLoggerConfig Reviewed By: bobrenjc93 Differential Revision: D75617963 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154747 Approved by: https://github.com/angelayi	2025-05-30 23:54:24 +00:00
Aaron Gokaslan	bbda22e648	[BE][Ez]: Optimize unnecessary lambda with operator (#154722 ) Automated edits performed by FURB118. Operator is implemented in C and way faster when passed to another C method like sorted, max etc as a `key=` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154722 Approved by: https://github.com/jansel	2025-05-30 23:47:10 +00:00
Catherine Lee	0f3db20132	[ez][CI] Do not reuse old whl if deleting files (#154731 ) Thankfully very few commits actually delete files so I don't think has affected anything Pull Request resolved: https://github.com/pytorch/pytorch/pull/154731 Approved by: https://github.com/Skylion007	2025-05-30 22:35:13 +00:00
David Berard	eb93c0adb1	[inductor][AMD] support special kwargs in AMD triton configs (#154605 ) Context: AMD triton kernels can be launched with special kwargs, like `waves_per_eu`. Triton configs with these kwargs look like this: ``` triton.Config({ "BLOCK_SIZE": 64, "waves_per_eu": 2, }) ``` in comparison, nvidia's special kwargs are explicit parameters on the config, e.g. num_warps: ``` triton.Config( {"BLOCK_SIZE": 64}, num_warps=4, ) ``` Problem: this causes custom triton kernels w/ PT2 to error out, because there's a kwarg in the triton.Config that doesn't appear in the kernel signature. Solution: When splicing in the constexpr values into the arg list, ignore any values in the config kwargs list if they don't appear in the function signature. Differential Revision: [D75599629](https://our.internmc.facebook.com/intern/diff/D75599629/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D75599629/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/154605 Approved by: https://github.com/njriasan	2025-05-30 22:24:32 +00:00
PyTorch MergeBot	1193bf0855	Revert "convert inductor codecache to use getArtifactLogger (#153766 )" This reverts commit 5b6fd277f954b789649501e21e9689a42d565e13. Reverted https://github.com/pytorch/pytorch/pull/153766 on behalf of https://github.com/malfet due to I want to revert this change as I'm 90+% certain it somehow broke testing ([comment](https://github.com/pytorch/pytorch/pull/153766#issuecomment-2923620806))	2025-05-30 22:20:07 +00:00
Justin Chu	26aa8dcf27	[ONNX] Simplify onnx test dependencies (#154732 ) Simplify onnx test dependencies and bump onnxscript to 0.3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154732 Approved by: https://github.com/Skylion007	2025-05-30 21:58:04 +00:00
Pian Pawakapan	5acb8d5080	[draft export] avoid storing intermediate real tensors in proxies (#154630 ) Handles GC for non-strict draft export; GPU memory usage shouldn't be much more than eager mode + input tensors now. While trying to do draft export CPU offloading, I found out GC is feasible, because in non-strict, there's 2 places holding references to a `.real_tensor` attribute: 1) the FakeTensors in fake tensor prop, but these are held by the actual variables in the model's forward call, and so the real tensor gets gc-ed along with the fake one when the variable goes out of scope. 2) A clone of the fake tensor in 1) stored in `proxy.node.meta["val"]`, which was added in https://github.com/pytorch/pytorch/pull/150948. But we didn't actually need to store them on intermediate values; the placeholders are enough for retracing/lowering. Avoiding storing the intermediate values in 2), the values in 1) should be naturally GC-ed, and the real-tensor memory usage for non-strict should be pretty similar to eager computation? Strict still OOMs; dynamo still holds these in variable tracking, and not sure how to GC those. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154630 Approved by: https://github.com/angelayi, https://github.com/yushangdi	2025-05-30 21:06:55 +00:00
Feny Patel	abc2264e8f	remove another instance of mtia_workloadd from pytorch (#154739 ) Summary: ^ Test Plan: CIs Differential Revision: D75692171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154739 Approved by: https://github.com/sraikund16	2025-05-30 20:50:46 +00:00
Paul Zhang	22a4cabd19	[Inductor] Add NaN assert to returned values from generated code (#154455 ) Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace. Test Plan: NaN asserts properly generated in example gemm script: vars = (buf1, primals_2, buf2, primals_1, ) for var in vars: if isinstance(var, torch.Tensor): assert not var.isnan().any().item() assert not var.isinf().any().item() Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455 Approved by: https://github.com/eellison	2025-05-30 20:32:56 +00:00
Aaron Gokaslan	ed1ff7d0fb	[BE][Ez]: Update mimalloc submodule to 2.2.3 (#154720 ) Updating minor version of mimalloc. The old version is more than 2 years old, and the newer release has performance fixes and compiler fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154720 Approved by: https://github.com/jansel	2025-05-30 20:17:13 +00:00
Aaron Gokaslan	2f03673ebf	[BE][Ez]: Enable ClangFormat aten/src/core/Formatting.cpp (#154719 ) Follow up to #152830 . Noticed the file was excluded from fromatting, opt in to clang-format since it's really close anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154719 Approved by: https://github.com/jansel	2025-05-30 19:52:43 +00:00
Alessandro Sangiorgi	f57754e815	[Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#154618 ) This is a follow-up PR of the reverted one https://github.com/pytorch/pytorch/pull/148981 re-opening for visibility : Modified TorchInductor’s autotuning flow so that each best_config JSON file also includes the Triton “base32” (or base64) cache key. Motivation Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config. The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging. Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set store_cubin = True since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154618 Approved by: https://github.com/jansel	2025-05-30 19:30:25 +00:00
Isalia20	d6edefefbf	[CUDA] Fixes for backwards in memefficient attn for large tensors (#154663 ) followup to #154029. @ngimel Backwards had the same problem as well so this PR fixes it and adds support for logsumexp computation in the forward pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154663 Approved by: https://github.com/ngimel	2025-05-30 19:30:07 +00:00
Alexander Grund	d89d213118	Fix test_tensorboard when started w/o tensorboard package (#154709 ) If `TEST_TENSORBOARD == False` then `DataType` is not defined or imported. However it is used unconditionally when defining the test with `parametrize` which leads to an NameError crashing the test execution on start. Provide a Dummy to make it syntactially correct. Tests will be skipped on start. ``` File "/dev/shm/build/pytorch-v2.2.1/test/test_tensorboard.py", line 885, in <module> class TestTensorProtoSummary(BaseTestCase): File "/dev/shm/build/pytorch-v2.2.1/test/test_tensorboard.py", line 889, in TestTensorProtoSummary (torch.float16, DataType.DT_HALF), ^^^^^^^^ NameError: name 'DataType' is not defined Got exit code 1, retrying... test_tensorboard 1/1 failed! [Errno 2] No such file or directory: '/dev/shm/build/pytorch-v2.2.1/.pytest_cache/v/cache/stepcurrent/test_tensorboard_0_0dba8bc00bbe233f' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154709 Approved by: https://github.com/Skylion007	2025-05-30 19:18:43 +00:00
atalman	22641f42b6	[Binary-builds]Use System NCCL by default in CI/CD. (#152835 ) Use System NCCl by default. The correct nccl version is already built into the Manylinux docker image. Will followup with PR on detecting if user has NCCL installed and enabling USE_SYSTEM_NCCL by default in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152835 Approved by: https://github.com/malfet	2025-05-30 18:51:48 +00:00
Ryan Guo	967937872f	[dynamo] Remove dead code path for `torch.Tensor.view(*shape)` (#154646 ) This was introduced in early days of Dynamo, and looks like it's been fixed since -- the regression test `test_transpose_for_scores` passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154646 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #154645	2025-05-30 18:50:58 +00:00
Ryan Guo	f9dc20c7a3	[dynamo] Fix syntax error in aot graph from kwarg-less `torch.Tensor.[random_\|uniform_]` calls (#154645 ) As title, fixes #151432, see more context in the issue discussion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154645 Approved by: https://github.com/zou3519	2025-05-30 18:50:58 +00:00
PyTorch MergeBot	fb67fa9968	Revert "[Inductor] Add NaN assert to returned values from generated code (#154455 )" This reverts commit aec3ef100844631cb7c4ce2725157984eb9cebfe. Reverted https://github.com/pytorch/pytorch/pull/154455 on behalf of https://github.com/malfet due to Looks like it broke inductor/test_compile_subprocess.py::CpuTests::test_AllenaiLongformerBase, see `35fc5c49b4/1`(default%2C%20&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/154455#issuecomment-2923154249))	2025-05-30 18:45:01 +00:00
PyTorch MergeBot	35fc5c49b4	Revert "[internal] Expose additional metadata to compilation callbacks (#153596 )" This reverts commit f889dea97dad3cc506d43e379a469334417040c8. Reverted https://github.com/pytorch/pytorch/pull/153596 on behalf of https://github.com/izaitsevfb due to introduces bunch of callback-related failures on rocm ([comment](https://github.com/pytorch/pytorch/pull/153596#issuecomment-2923139061))	2025-05-30 18:39:27 +00:00
Aaron Gokaslan	b6b9311f4f	[BE][Ez]: Fix typo in dynamo utils #154639 (#154748 ) Fixes a typo in #154639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154748 Approved by: https://github.com/ngimel	2025-05-30 18:39:01 +00:00
Guilherme Leobas	bbdf469f0e	Add CPython dict tests (#150791 ) Tests: * test_dict.py * test_ordered_dict.py * test_userdict.py Minor changes were made to each test to run them inside Dynamo One can reproduce the changes by downloading the tests from CPython and applying the diff: ```bash for f in "test_dict" "test_ordered_dict" "test_userdict"; do wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py" git apply "test/dynamo/cpython/3_13/${f}.diff" done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150791 Approved by: https://github.com/zou3519	2025-05-30 18:17:09 +00:00
Aaron Gokaslan	2120eeb8de	[BE][Ez]: Improve dynamo utils typing with TypeIs and TypeGuard (#154639 ) Adds some additional TypeIs and TypeGuard to some _dynamo utils for additional type narrowing Pull Request resolved: https://github.com/pytorch/pytorch/pull/154639 Approved by: https://github.com/jansel	2025-05-30 18:09:50 +00:00
zeshengzong	1b569e5490	Fix load_state_dict description (#154599 ) Fixes #141364 Fix missing description in `assign` param ## Test Result ### Before ![image](https://github.com/user-attachments/assets/5928c691-4e31-463b-aa0a-86eb8bb452e5) ### After ![image](https://github.com/user-attachments/assets/036631a2-0f20-4a71-95c3-2c0fd732293e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154599 Approved by: https://github.com/colesbury, https://github.com/mikaylagawarecki	2025-05-30 18:08:59 +00:00
Shivam Raikundalia	30ac7f4d4e	[EZ/Memory Snapshot] Remove Handle even if compile_context not set (#154664 ) Summary: When setting the memory snapshot callback we register and unregister callbacks for performance reasons. For ease of use, it makes sense to just remove all callbacks regardless of which flags are enabled. The enable stays behind a feature flag, this just changes the disable to ignore the flag itself. Test Plan: Ran without any flags and saw all callbacks removed. Differential Revision: D75636035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154664 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2025-05-30 18:08:37 +00:00
dolpm	65d8dba735	[nativert] move layout planner settings to torch (#154668 ) Summary: att Test Plan: ci Differential Revision: D75633031 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154668 Approved by: https://github.com/zhxchen17	2025-05-30 17:33:27 +00:00
Sidharth	3bdceab124	[dynamo] fix: added star operator for graph_break_hints (#154713 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154713 Approved by: https://github.com/zou3519, https://github.com/williamwen42	2025-05-30 17:31:03 +00:00
Henry Hu	802ffd06c8	[Export] Add math module for deserialization (#154643 ) Summary: As title Test Plan: ci Differential Revision: D75580646 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154643 Approved by: https://github.com/yushangdi	2025-05-30 17:29:25 +00:00
Aaron Orenstein	fc0135ca11	Re-enable FakeTensor caching for SymInts (#152662 ) Summary: This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present. There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously. Test Plan: Reran the tests listed in T196779132 and they pass. ## Perf ### Instruction Counter Benchmark: - 26% win on add_loop_eager_dynamic - 13% win on add_loop_inductor_dynamic_gpu ### Perf Dashboard Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s. Differential Revision: [D75467694](https://our.internmc.facebook.com/intern/diff/D75467694) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662 Approved by: https://github.com/anijain2305	2025-05-30 17:23:36 +00:00
Pian Pawakapan	3027051590	[export] avoid float/bool specialization for scalar tensor construction (#154661 ) Fixes #153411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154661 Approved by: https://github.com/angelayi	2025-05-30 17:18:21 +00:00
bobrenjc93	e7bf72c908	[multigraph] fix composabilty with aotautograd cache (#153526 ) AOTAutogradCache uses FXGraphCache which uses the tracing context to get the ShapeEnv. Although the TracingContext global_context is cleared by the time we get around to reusing it, we don't actually need it. We just need the ShapeEnv in the TracingContext, which isn't cleared at the end of dynamo and does persist. This PR adds the tracing context manager around the specialized compile to ensure our caching infrastructure can get access to the ShapeEnv. A test was also added to prove correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153526 Approved by: https://github.com/jamesjwu, https://github.com/zou3519 ghstack dependencies: #153433, #153449	2025-05-30 16:56:17 +00:00
Ryan Guo	7183f52675	[dynamo] Support namedtuple subclass (#153982 ) Fixes #133762. This involves 1. support tuple subclass constructed inside compile region. 2. handle the "fake" global scope associated with NamedTuple-generated `__new__`. 3. handle `namedtuple._tuplegetter` more faithfully. Differential Revision: [D75488091](https://our.internmc.facebook.com/intern/diff/D75488091) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153982 Approved by: https://github.com/jansel ghstack dependencies: #154176	2025-05-30 16:14:37 +00:00
Ryan Guo	8002d22ce3	[dynamo] Trace into descriptor with `__set__` (#154176 ) As title, this patch basically implements https://github.com/python/cpython/blob/3.11/Objects/object.c#L1371-L1452, and make the `__get__` handling more robust. I ran into this while fixing #133762. Differential Revision: [D75488090](https://our.internmc.facebook.com/intern/diff/D75488090) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154176 Approved by: https://github.com/jansel	2025-05-30 16:14:37 +00:00
PyTorch MergeBot	31f95b5d2e	Revert "inductor codecache: include private inductor configs in cache key (#153672 )" This reverts commit 2c1cb38d9516e10474b4f12a2e839046648a71a8. Reverted https://github.com/pytorch/pytorch/pull/153672 on behalf of https://github.com/malfet due to Looks like it regressed pr_time_benchmarks, see `ba3f91af97/1` ([comment](https://github.com/pytorch/pytorch/pull/153672#issuecomment-2922759739))	2025-05-30 15:54:14 +00:00
Guilherme Leobas	4b1f047a33	Add CPython list/tuple tests (#150790 ) Tests: * test_list.py * test_tuple.py * test_userlist.py Minor changes were made to each test to run them inside Dynamo One can reproduce the changes by downloading the tests from CPython and applying the diff: ```bash for f in "test_raise" "test_list" "test_tuple" "test_userlist"; do wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py" git apply "test/dynamo/cpython/3_13/${f}.diff" done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150790 Approved by: https://github.com/williamwen42	2025-05-30 15:53:38 +00:00
Randolf Scholz	ba3f91af97	Type hints for distributions/utils (#154712 ) Fixes #144196 Part of #144219 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154712 Approved by: https://github.com/Skylion007	2025-05-30 15:50:31 +00:00
Bin Bao	0f81c7a28d	[CI] Pin the torchao version used when testing torchbench (#154723 ) Summary: To fix a recent CI breakage. As a follow-up, the torchao pin in .github/ci_commit_pins/torchao.txt is 6-month old. We should bump up that once we verify this fix works. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154723 Approved by: https://github.com/eellison	2025-05-30 15:04:26 +00:00
PyTorch MergeBot	7e8532077f	Revert "Use 3.27 as the minimum CMake version (#153153 )" This reverts commit 1ece53b157db4425ad12cae31fb570c591dc19e7. Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2922369830))	2025-05-30 13:16:33 +00:00
cyy	1ece53b157	Use 3.27 as the minimum CMake version (#153153 ) Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783. It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153 Approved by: https://github.com/malfet	2025-05-30 11:25:30 +00:00
Laith Sakka	9d6f0d5991	avoid sym_max on nested int in is_contiguous. (#154633 ) calling is_contiguous will fail due to sym_max not being supported for nested int, this address in a way consistent with make_contiguous_strides_for Pull Request resolved: https://github.com/pytorch/pytorch/pull/154633 Approved by: https://github.com/bobrenjc93	2025-05-30 09:59:33 +00:00
Zhang, Jianyi	3c05167489	[Intel GPU] fix matmul accuracy when offset > 0 (#154495 ) This pr will make matmul tensors contiguous if they are not 64 byte alignment. oneDNN requires a minimal alignment of 64 https://uxlfoundation.github.io/oneDNN/dev_guide_c_and_cpp_apis.html#intel-r-processor-graphics-and-xe-architecture-graphics Fixes https://github.com/intel/torch-xpu-ops/issues/1656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154495 Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang	2025-05-30 09:53:51 +00:00
Paul Zhang	aec3ef1008	[Inductor] Add NaN assert to returned values from generated code (#154455 ) Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace. Test Plan: NaN asserts properly generated in example gemm script: vars = (buf1, primals_2, buf2, primals_1, ) for var in vars: if isinstance(var, torch.Tensor): assert not var.isnan().any().item() assert not var.isinf().any().item() Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455 Approved by: https://github.com/eellison	2025-05-30 08:53:24 +00:00
Bob Ren	dc82e911e7	remove allow-untyped-defs from torch/utils/data/datapipes/iter/filelister.py (#154624 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154624 Approved by: https://github.com/Skylion007	2025-05-30 08:38:05 +00:00
PyTorch MergeBot	639f459cb6	Revert "[Inductor] Add NaN assert to returned values from generated code (#154455 )" This reverts commit c3de2c7c6bc865b9fabd2db8f2af6383936aa653. Reverted https://github.com/pytorch/pytorch/pull/154455 on behalf of https://github.com/huydhn due to Sorry for reverting your change, I am trying to see if it help fix the broken trunk below. It it does not help, I will reland the PR ([comment](https://github.com/pytorch/pytorch/pull/154455#issuecomment-2921562089))	2025-05-30 08:11:22 +00:00
Simon Fan	f889dea97d	[internal] Expose additional metadata to compilation callbacks (#153596 ) These hooks are used by internal stuck job detection to associate compilation events with the compile lease. Previously, we only had events for Dynamo and Inductor compilation. And recently, the callback handler was updated to ignore nested events. So the Inductor event was only really used by lazy backward. Here, I remove the inductor event, and add an explicit lazy backward one. Additionally, I add other runtime compilation events: autotuning and cudagraphs. I also expose the CompileId as a string to avoid imports, this will let internal UIs track each graph's contribution to the timeout. ```python class CallbackTrigger(enum.Enum): # most common case, dynamo attempts to trace a new frame DYNAMO = 1 # backward compilation can be deferred to runtime LAZY_BACKWARD = 2 # some backends autotune at runtime TRITON_AUTOTUNING = 3 # cudagraphs record at runtime CUDAGRAPH_RECORDING = 4 ``` Differential Revision: [D75092426](https://our.internmc.facebook.com/intern/diff/D75092426) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153596 Approved by: https://github.com/masnesral	2025-05-30 08:07:04 +00:00
drisspg	208965a9d6	Fix unbackend symint error (#154672 ) ## Summary Me and @laithsakka spoke offline about this one, TLDR is that we wanted this ![image](https://github.com/user-attachments/assets/2e537612-3261-4fbe-a6b9-f8ff92ba3c37) to also be true for Inductor. In that vein we added two new apis to size-vars which is `guard_or_false`, or `guard_or_true` with the semantics: guard_or_false, guard_or_true: Those APIs may add guards, but will never fail with data-dependent errors; They will try to evaluate the expression with the possibility of adding guards, if that fails due to data dependency, instead of hard failing. False or True are returned. When to use this? Performance optimizations that warrant a recompilation. Take the general path and add a runtime check. ``` # Consider this branching. if x==0: return 1 else return 10 # To make data dependent friendly, it can be written as the following: if guard_or_false(x==0): return 1 else torch.check(x!=0) # runtime check return 10 ``` However there is still 1 more api to add to make this example work which is the torch.check which works with expressions, I will leave that to the @laithsakka Pull Request resolved: https://github.com/pytorch/pytorch/pull/154672 Approved by: https://github.com/laithsakka	2025-05-30 07:45:01 +00:00
Bob Ren	5a7442b91f	remove allow-untyped-defs from torch/distributed/checkpoint/resharding.py (#154626 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154626 Approved by: https://github.com/Skylion007	2025-05-30 07:43:04 +00:00
Bob Ren	d66a55def0	remove allow-untyped-defs from torch/distributed/elastic/utils/logging.py (#154625 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154625 Approved by: https://github.com/Skylion007	2025-05-30 07:37:56 +00:00
Bob Ren	382b38ed1b	remove allow-untyped-defs from torch/nn/utils/_expanded_weights/conv_expanded_weights.py (#154623 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154623 Approved by: https://github.com/Skylion007	2025-05-30 07:32:57 +00:00
baodii	bcbd2a22b2	[Intel GPU] OneDNN primitive cache support for Int4 WOQ gemm on XPU (#147693 ) * add onednn primitive cache for int4 gemm for xpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/147693 Approved by: https://github.com/EikanWang, https://github.com/liangan1, https://github.com/guangyey, https://github.com/ZhiweiYan-96 Co-authored-by: Yan, Zhiwei <zhiwei.yan@intel.com> Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-05-30 07:26:36 +00:00
Bob Ren	0df96e3921	remove allow-untyped-defs from torch/ao/quantization/stubs.py (#154622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154622 Approved by: https://github.com/Skylion007	2025-05-30 07:26:09 +00:00
Xuanteng Huang	30f7079c93	[FSDP2] allow different dtypes for no grad model params (#154103 ) Fixes #154082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154103 Approved by: https://github.com/weifengpy	2025-05-30 07:00:54 +00:00
PyTorch MergeBot	d173ba5a75	Revert "Remove MemPoolContext (#154042 )" This reverts commit 3b38989b5f8f918cf1ad38bdade059608544af4b. Reverted https://github.com/pytorch/pytorch/pull/154042 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154042#issuecomment-2921401100))	2025-05-30 06:53:37 +00:00
Ivan Zaitsev	0fdd568b78	[forward fix] add support for MemoryFormat after type tightening (#154658 ) Summary: fixes error: ``` raise AssertionError(f"Unexpected type in c_type_for_prim_type: {type_=}") AssertionError: Unexpected type in c_type_for_prim_type: type_=MemoryFormat ``` after https://github.com/pytorch/pytorch/pull/154371 \| D75568111 Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/test:test_custom_ops -- --exact 'deeplearning/aot_inductor/test:test_custom_ops - test_export_extern_fallback_nodes (deeplearning.aot_inductor.test.test_custom_ops.TestAOTInductorProxyExecutor)' ``` Differential Revision: D75617432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154658 Approved by: https://github.com/Camyll, https://github.com/atalman, https://github.com/malfet	2025-05-30 06:53:25 +00:00
Henry Tsang	a4b0023f3b	[cutlass backend] Cache config generation locally and remotely (#154686 ) Summary: Trying to cache the json list of configs. There are probably some more work: * preset * filelock (?) * for cases where we generate from scratch, save it to local as well (?) Test Plan: tested offline Reviewed By: coconutruben Differential Revision: D75334439 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154686 Approved by: https://github.com/coconutruben, https://github.com/ColinPeppler	2025-05-30 05:40:46 +00:00
PyTorch MergeBot	ba51f4876d	Revert "Enable C++ dynamic shape guards by default (#140756 )" This reverts commit dc0f09a4785349fc3b4e4d3dc3c02b018e5a0534. Reverted https://github.com/pytorch/pytorch/pull/140756 on behalf of https://github.com/izaitsevfb due to seem to break dynamo tests ([comment](https://github.com/pytorch/pytorch/pull/140756#issuecomment-2921151663))	2025-05-30 03:52:02 +00:00
PyTorch MergeBot	852b99eba0	Revert "[c10d] Separate monitoring thread into a class in PGNCCL (#153977 )" This reverts commit 0db9c64d68dcdf25210357c4f7a41618441091d4. Reverted https://github.com/pytorch/pytorch/pull/153977 on behalf of https://github.com/izaitsevfb due to breaks lots of jobs internally, safer to revert, see D75628917 ([comment](https://github.com/pytorch/pytorch/pull/153977#issuecomment-2921146129))	2025-05-30 03:46:43 +00:00
Bob Ren	20ee5f9044	remove allow-untyped-defs from elastic_distributed_sampler.py (#154620 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154620 Approved by: https://github.com/Skylion007	2025-05-30 03:29:45 +00:00
bobrenjc93	9c06dff1ce	[multigraph] use specializations in compile_and_call_fx_graph (#153449 ) The goal of this multigraph work is to enable a compiled region that has a single dynamo trace but multiple backend specializations. This work was inspired by vLLM which does this in a somewhat hacky way where they use a custom backend to capture a dynamo graph and then manually invoke compile_fx multiple times to get specialized graphs. There's really two parts of this work: The frontend changes: 1) we introduce an optional kwarg `specialize_on` to mark_{dynamic,unbacked} that takes in a list of specializations. I debated other methods including specifying specializations via decorators, but ultimately decided this approach was more harmonious. The big issue with decorators is the difficulty of composing well with the rest of the torch.compile ecosystem including graph breaks, lazy initialization of variable trackers and symbolic variables, etc. The backend changes (this PR): 1) We capture the backend_specialization specified in the mark_{dynamic,unbacked} API into a SymbolicContext. See changes in `/_dynamo/variables/builder.py` 2) After we are done dynamo tracing, we will lazily (more on this later) invoke `call_user_compiler` up to N + 1 times for N specializations and 1 generic graph. Under the hood this will call compile_fx, which composes nicely with both Async Compile and AOTAutogradCache. We do this by using a context manager to patch in specialization specific axioms into the ShapeEnv before invoking the user compiler. 3) When we have specializations, we install a lazy specialized dispatch function that checks each specialization and dispatches to the first one that matches. Instead of doing all of the specialization compiles up front, we do the compiles lazily. The first time a specialization is invoked, we will do the compilation and save it in a cache so subsequent invocations are fast. If none of the specializations match, we dispatch to the generic graph. I decided to do this over returning N different GuardedCodes since 1) it doesn't pollute the dynamo cache (eg. if you have 8 specializations, you would hit the cache limit) 2) it naturally incorporates the hierarchical lattice structure of the guards since the specializations are always necessarily stricter than the generic region's guards. I benchmarked this PR stack with #152596 and found around a 50% reduction when dispatching to the specialized regions: ![495269647_576053105510082_9189856138964956774_n](https://github.com/user-attachments/assets/66030fed-d62e-4d87-940f-aa13c99b1a73) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153449 Approved by: https://github.com/zou3519 ghstack dependencies: #153433	2025-05-30 03:19:49 +00:00
Paul Zhang	c3de2c7c6b	[Inductor] Add NaN assert to returned values from generated code (#154455 ) Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace. Test Plan: NaN asserts properly generated in example gemm script: vars = (buf1, primals_2, buf2, primals_1, ) for var in vars: if isinstance(var, torch.Tensor): assert not var.isnan().any().item() assert not var.isinf().any().item() Differential Revision: D74691131 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455 Approved by: https://github.com/eellison	2025-05-30 03:09:37 +00:00
Dylan Maloy	4a302b5731	NativeRT readme (#154581 ) Summary: att Test Plan: ci Differential Revision: D75557667 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154581 Approved by: https://github.com/Skylion007, https://github.com/zhxchen17, https://github.com/yiming0416	2025-05-30 02:50:53 +00:00
Yu, Guangye	adfd5b293a	Enhance UT on elapsed_time for XPUEvent (#154494 ) # Motivation UT enhancement to avoid the incorrect elapsed time return by xpu's Event. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154494 Approved by: https://github.com/EikanWang	2025-05-30 02:00:02 +00:00
Yiming Zhou	0289313551	[AOTI] Support OptionalTensor return type in AOTI proxy executor (#154286 ) Summary: When a C++ custom op returns an uninitialized tensor, it will be marked as None in Python. For this scenario, the user should mark the possibly uninitialized return as Tensor? in the custom op schema. This diff adds `as_optional_tensor` type to export schema and the support for optional tensor in AOTI proxy executor. Test Plan: ``` buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_optional_tensor_output ``` Differential Revision: D75262529 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154286 Approved by: https://github.com/desertfire	2025-05-30 01:53:00 +00:00
Pian Pawakapan	58ead04ee9	[dynamic shapes] unbacked safe unsqueeze (#154087 ) Also ran into this working on https://github.com/SWivid/F5-TTS Pull Request resolved: https://github.com/pytorch/pytorch/pull/154087 Approved by: https://github.com/laithsakka	2025-05-30 01:41:57 +00:00
bobrenjc93	172015fc11	[multigraph] add specialize_on kwarg to mark_{dynamic,unbacked} (#153433 ) The goal of this multigraph work is to enable a compiled region that has a single dynamo trace but multiple backend specializations. This work was inspired by vLLM which does this in a somewhat hacky way where they use a custom backend to capture a dynamo graph and then manually invoke compile_fx multiple times to get specialized graphs. There's really two parts of this work: The frontend changes (this PR): 1) we introduce an optional kwarg `specialize_on` to mark_{dynamic,unbacked} that takes in a list of specializations. I debated other methods including specifying specializations via decorators, but ultimately decided this approach was more harmonious. The big issue with decorators is the difficulty of composing well with the rest of the torch.compile ecosystem including graph breaks, lazy initialization of variable trackers and symbolic variables, etc. The backend changes: 1) We capture the backend_specialization specified in the mark_{dynamic,unbacked} API into a SymbolicContext. See changes in `/_dynamo/variables/builder.py` 2) After we are done dynamo tracing, we will lazily (more on this later) invoke `call_user_compiler` up to N + 1 times for N specializations and 1 generic graph. Under the hood this will call compile_fx, which composes nicely with both Async Compile and AOTAutogradCache. We do this by using a context manager to patch in specialization specific axioms into the ShapeEnv before invoking the user compiler. 3) When we have specializations, we install a lazy specialized dispatch function that checks each specialization and dispatches to the first one that matches. Instead of doing all of the specialization compiles up front, we do the compiles lazily. The first time a specialization is invoked, we will do the compilation and save it in a cache so subsequent invocations are fast. If none of the specializations match, we dispatch to the generic graph. I decided to do this over returning N different GuardedCodes since 1) it doesn't pollute the dynamo cache (eg. if you have 8 specializations, you would hit the cache limit) 2) it naturally incorporates the hierarchical lattice structure of the guards since the specializations are always necessarily stricter than the generic region's guards. I benchmarked this PR stack with #152596 and found around a 50% reduction when dispatching to the specialized regions: ![495269647_576053105510082_9189856138964956774_n](https://github.com/user-attachments/assets/66030fed-d62e-4d87-940f-aa13c99b1a73) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153433 Approved by: https://github.com/zou3519	2025-05-30 01:08:15 +00:00
Chen Lai	9371491529	[Reland][pytorch] Patch the _is_conv_node function (#154473 ) Summary: Add the conv padding ops in pytorch, the corresponding pr in torch ao is https://github.com/pytorch/ao/pull/2257 Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_conv_padding_bn_relu (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)' ``` Differential Revision: D75494468 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154473 Approved by: https://github.com/Skylion007	2025-05-30 00:41:03 +00:00
Nikita Shulga	d6cb0fe576	[MPS] Extend index_copy support to complex dtypes (#154671 ) Should have noticed it during the review Pull Request resolved: https://github.com/pytorch/pytorch/pull/154671 Approved by: https://github.com/dcci ghstack dependencies: #154670	2025-05-30 00:28:13 +00:00
Nikita Shulga	0134150ebb	[MPS][BE] Do not copy sizes/strides unnecesserily (#154670 ) Just pass them as args to `mtl_setArgs`, metaprogramming should deal with the rest Also use `mtl_dispatch1DJob` instead of computing max threadgroup size by nand Pull Request resolved: https://github.com/pytorch/pytorch/pull/154670 Approved by: https://github.com/dcci	2025-05-30 00:28:13 +00:00
Ke Wen	61bfb3df9f	[a2av] Improve tuning for 4 GPUs (#154580 ) ### Problem Running `nvshmem_all_to_all_vdev` on 4 x H100s (fully connected with NVSwitch). Before: ``` Bytes: MiB, Time: us, BusBw: GB/s 0 32.29 16.23 1 33.01 31.76 2 33.01 63.54 4 33.83 123.97 8 49.83 168.34 16 80.82 207.59 32 178.66 187.82 64 335.79 199.86 128 646.72 207.54 256 1268.77 211.57 512 2511.14 213.80 1024 4998.31 214.82 2048 9964.49 215.51 4096 19892.34 215.91 ``` 215 GB/s does not reach the SOL of NV18 (350-400 GB/s). ### Change If the number of peers decreases (say 8 to 4), we do not reduce the number of CTAs; instead, we shift more CTAs towards the data parallel dimension. After: ``` Bytes: MiB, Time: us, BusBw: GB/s 0 25.01 20.96 1 25.70 40.80 2 25.76 81.42 4 28.87 145.26 8 40.79 205.64 16 61.46 272.97 32 111.82 300.06 64 202.40 331.57 128 382.56 350.84 256 739.11 363.19 512 1450.79 370.05 1024 2873.13 373.72 2048 5719.50 375.47 4096 11395.65 376.90 ``` If we look at MoE related region, say 32 MB, we can see a 187 -> 300 GB/s improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154580 Approved by: https://github.com/ngimel	2025-05-30 00:26:13 +00:00
Brian Hirsh	2c1cb38d95	inductor codecache: include private inductor configs in cache key (#153672 ) Fixes https://github.com/pytorch/torchtitan/issues/1185 It looks like inductor's logic to include inductor configs in the cache key skips configs with a leading underscore by default. This came up in torchtitan - there's an asyncTP pipelining pass in inductor gated by a private config, and by not caching on the config we were attempting to use asyncTP when we shouldn't be. I'm not sure how worried we should be on the blast radius of this change. On the one hand: (1) it technically fixes any silent correctness issues in the cache around any other private inductor configs (it looks like there are a few) (2) there is some risk that there are some "harmless" configs that we are now including in the key, which may increase false negatives. I do see that there is an explicit list for "configs we want to ignore for caching" (`_save_config_ignore`), so my hope is that all harmless configs are already encapsulated there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153672 Approved by: https://github.com/oulgen ghstack dependencies: #153766	2025-05-30 00:24:29 +00:00
Brian Hirsh	5b6fd277f9	convert inductor codecache to use getArtifactLogger (#153766 ) I'm not entirely sure of the background for why inductor codecache code uses default python logging instead of the new TORCH_LOGS-based artifact logging, but switching it over to artifact logging makes it easier to use nice testing utils in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153766 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2025-05-30 00:24:29 +00:00
eqy	818f76a745	[cuDNN] Allow cudnn attention or flash attention in `test_export.py` regex (#154458 ) Analogous to #153272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154458 Approved by: https://github.com/drisspg	2025-05-29 23:51:09 +00:00
Isuru Fernando	dc0f09a478	Enable C++ dynamic shape guards by default (#140756 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140756 Approved by: https://github.com/anijain2305, https://github.com/laithsakka ghstack dependencies: #151225	2025-05-29 23:44:43 +00:00
Paul Zhang	0c6c7780d9	[Inductor] Add envvar to disable decomposeK (#154421 ) Summary: Add envvar to Inductor config to disable decomposeK autotuning choice Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -- --exact 'caffe2/test/inductor:max_autotune - test_max_autotune_decompose_k_dynamic_False_sizes2 (caffe2.test.inductor.test_max_autotune.TestMaxAutotune)' --run-disabled` Reviewed By: eellison Differential Revision: D75174823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154421 Approved by: https://github.com/eellison	2025-05-29 23:34:41 +00:00
Isuru Fernando	9ba67e99bb	[dynamo] keep C++ symbolic shape guards disabled for benchmarks (#151225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151225 Approved by: https://github.com/anijain2305	2025-05-29 23:29:39 +00:00
Jerry Mannil	d5e0704247	[ROCm] Update maxpool launch config (#154619 ) * Better perf on MI300 with updated launch configs Pull Request resolved: https://github.com/pytorch/pytorch/pull/154619 Approved by: https://github.com/jeffdaily	2025-05-29 23:28:07 +00:00
Joel Schlosser	43b18d098b	Forward fix for test_frame_traced_hook in internal testing (#154641 ) Summary: Fixes the newly-added dynamo test test_frame_traced_hook so it can run internally Test Plan: This is a test change Differential Revision: D75616787 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154641 Approved by: https://github.com/Skylion007	2025-05-29 23:02:01 +00:00
soulitzer	b040d63ce4	Prevent SAC cache from being kept alive by reference cycle (#154651 ) Fixes https://github.com/pytorch/pytorch/issues/154642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154651 Approved by: https://github.com/xmfan	2025-05-29 22:27:35 +00:00
Aaron Gokaslan	7d17253af8	[BE]: Improve aten formatter with fmtlib (#152830 ) Replaces stateful ostream output with stateless fmtlib, which is signficantly faster and more contained. It is especially faster for the type of complex double formatting found here since it uses the newer [DragonBox algorithm](https://github.com/jk-jeon/dragonbox) for faster floating point formatting (which is the main bottleneck here). This also enables some static time checking of the formatting strings test plan: all tests pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/152830 Approved by: https://github.com/cyyever, https://github.com/malfet, https://github.com/atalman	2025-05-29 22:11:30 +00:00
Paul Zhang	fdbf314278	[Inductor] Cache subgraph autotuning choices properly (#154067 ) Differential Revision: D75170507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154067 Approved by: https://github.com/eellison	2025-05-29 22:01:44 +00:00
Gabriel Ferns	c7e8e8ee19	Add torch.profile benchmarking function to feedback_fns (#153579 ) Summary: Updates some benchmarking code to have the option to use torch.profile, and passes in a thunk to benchmark_fns to get this information (this will be a different result from `timings`, which are already passed into those functions). Test Plan: Existing unit tests. Differential Revision: D74444990 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153579 Approved by: https://github.com/coconutruben, https://github.com/masnesral, https://github.com/nmacchioni	2025-05-29 21:43:45 +00:00
Arash Pakbin	1237f271aa	[ROCm] MIOpen: Get current device from Torch rather than HIP in handle creation (#154549 ) Get current device from Torch rather than HIP in MIOpen handle creation. The device may have already been set from torch side, otherwise device is set to 0 for handle. Additional audits of cudnn vs miopen Handle.cpp file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154549 Approved by: https://github.com/jeffdaily, https://github.com/cyyever Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-05-29 21:12:12 +00:00
Arash Pakbin	08fdc64c86	[ROCm] Exposing Some MIOpen Symbols (#2176 ) (#154545 ) This PR exposes some MIOpen symbols, namely: 1. `miopenDataType_t getMiopenDataType(const at::Tensor& tensor)` 2. `miopenHandle_t getMiopenHandle()` 3. `class TensorDescriptor` 4. `class Descriptor` 5. `class FilterDescriptor` 6. `struct ConvolutionDescriptor` 7. `struct DropoutDescriptor` 8. `struct RNNDescriptor` to enable adding extensions that make use of them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154545 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007 Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-05-29 21:10:45 +00:00
Shivam Raikundalia	83a0e4e6f9	[Visualizer] Start at index with most events (#154571 ) Summary: Oftentimes a single snapshot will contain multiple GPU traces in it based on what the process can see. In this case lets just start with the gpu trace with the highest amount of activity Test Plan: Ran od with: https://www.35929.od.internalfb.com/pytorch_memory_visualizer/mvai_gpu_traces/tree/gpu_snapshot/fire-chujiechen-f701302011/1/rank-1_itrn-3.Mar_01_06_10_09.3747.snapshot.pickle And it started at index 1 instead of 0 Differential Revision: D75555558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154571 Approved by: https://github.com/aaronenyeshi	2025-05-29 20:49:33 +00:00
Feny Patel	2bc8fec744	deprecate MTIA_WORKLOADD from pytorch (#154627 ) Differential Revision: D75612179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154627 Approved by: https://github.com/sraikund16	2025-05-29 20:30:40 +00:00
Joaquin	cb56df55dc	[Inductor]Cleanup autotune_fallback_to_aten post-deprecation (#154331 ) Fixes #153298 This PR is the 3rd and final step of #147479 All references to autotune_fallback_to_aten have been removed, and the feature is now deprecated. All calls to should_fallback_to_aten() were also removed, as they were deemed unnecessary. [henrylhtsang](https://github.com/henrylhtsang) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154331 Approved by: https://github.com/henrylhtsang	2025-05-29 20:29:58 +00:00
Huy Do	629fca295e	Always set CPU affinity for benchmark jobs (#154569 ) Because metrics like compilation time requires CPU. I want to see if this help fix https://github.com/pytorch/pytorch/issues/152566 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154569 Approved by: https://github.com/malfet, https://github.com/desertfire	2025-05-29 20:11:47 +00:00
atalman	3afbab66f7	[BE] Remove unused release scripts. Add clarifications for the branch cut process (#154649 ) Scripts in ``scripts/release/promote/`` are not used for a while. We use the ones in test-infra [here](https://github.com/pytorch/test-infra/blob/main/release/) . Hence this small cleanup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154649 Approved by: https://github.com/Skylion007, https://github.com/huydhn	2025-05-29 19:49:37 +00:00
Shiyan Deng	e8f5c24d17	[rocm]add device guard when initialize single stream (#154433 ) Summary: AMD streams are lazily initialized and sometimes (e.g. when we just want to do event recording on the stream) we might not be setting the device guard while it's initializing which would lead to invalid configuration error. Differential Revision: D75456460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154433 Approved by: https://github.com/jeffdaily	2025-05-29 19:42:12 +00:00
Jeddie Ji	20ec61a02f	[BE] fix lint errors caused by const SROpFunctor fn (#154552 ) Summary: Remove const quaiflier from SR suggsted from CLANGTIDY. Test Plan: arc lint -a -e extra --take CLANGTIDY caffe2/torch/fb/sparsenn/cpu_operators/to_dense_representation_cpu.cpp Reviewed By: henryoier Differential Revision: D75534056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154552 Approved by: https://github.com/Skylion007	2025-05-29 19:40:08 +00:00
Bin Bao	5a21d6f982	[AOTI][reland] Support multi-arch when using package_cpp_only (#154608 ) Summary: Reland https://github.com/pytorch/pytorch/pull/154414 Add support of multi_arch_kernel_binary in the package_cpp_only mode. More specifically, generate specific cmake targets to compile .ptx to .fatbin and embed them in the final shared library or binary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154608 Approved by: https://github.com/yushangdi	2025-05-29 19:32:33 +00:00
fduwjj	0db9c64d68	[c10d] Separate monitoring thread into a class in PGNCCL (#153977 ) This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (https://github.com/pytorch/pytorch/issues/146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review. We did our first attemp in https://github.com/pytorch/pytorch/pull/153668 but we also want to try to see if we can make monitoring thread a class. This PR is doing the first step to make monitoring thread a class. The next step to also extract watchdog to be a separate class so that we know its dependency. What we did in this PR: 1. Move all related variables and methods into a class named `HeartbeatMonitor`. 2. Correct some errors in the original logics inside monitoring thread loop. 3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now. Today there are two major functions inside heartbeat monitoring thread today: 1. Check the heartbeat of watchdog thread every 8 minutes. If no heartbeat detected and we are sure monitoring thread has not been stopped, we will kill the program by SIG_ABORT. 2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. (We do this only in the default PG) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153977 Approved by: https://github.com/kwen2501, https://github.com/d4l3k	2025-05-29 17:45:04 +00:00
Nicolas Macchioni	6f992e1b3f	[BE][AT] cleanup my old todo (#154542 ) Summary: this todo is very old, and probably not needed anymore. let's have CI figure out if removing this breaks anything Test Plan: CI Differential Revision: D75491068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154542 Approved by: https://github.com/Skylion007	2025-05-29 17:22:01 +00:00
Nikita Shulga	634ce22601	[MPSInductor] Fix codegen for nested multistage reductions (#154578 ) Yet to write a unittest for it, but this fixes codegen for ``` python3 benchmarks/dynamo/torchbench.py --performance --only hf_T5 --backend inductor --inference --devices mps --float16 ``` By correctly closing triple nested loop Pull Request resolved: https://github.com/pytorch/pytorch/pull/154578 Approved by: https://github.com/jansel, https://github.com/dcci	2025-05-29 17:09:25 +00:00
henrylhtsang	8883e494b3	[cutlass backend][ez] remove indent for cutlass config serialization (#154573 ) Differential Revision: [D75566642](https://our.internmc.facebook.com/intern/diff/D75566642) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154573 Approved by: https://github.com/ColinPeppler	2025-05-29 17:00:52 +00:00
Isalia20	41092cb86c	[MPS] index copy impl (#154326 ) Second most requested op according to #154052 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154326 Approved by: https://github.com/malfet	2025-05-29 16:57:43 +00:00
soulitzer	733e684b11	Skip test file that doesn't run gradcheck for slow gradcheck (#154509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154509 Approved by: https://github.com/malfet	2025-05-29 16:32:26 +00:00
Bo Li	2c6f24c62d	[ROCm] Updated default workspace for gfx95 (#153988 ) Fixes test_cuda.py::test_cublas_workspace_explicit_allocation on gfx95 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153988 Approved by: https://github.com/jeffdaily	2025-05-29 16:22:17 +00:00
PyTorch MergeBot	53b0f6f543	Revert "Use 3.27 as the minimum CMake version (#153153 )" This reverts commit 4613081b729273a9273185e9ef7470ce76e22da2. Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/malfet due to It broke windows debug builds, see `ef1d45b12d/1` ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2919897160))	2025-05-29 16:14:28 +00:00
eellison	ef1d45b12d	Cleanup parent fallback logic (#154006 ) The `parent` in fallback_node_due_to_unsupported_type is a duplication of `unsupported_output_tensor` logic. remove it. tested that the tests in test_add_complex give same codegen. this fixes an issue in mx that @drisspg was running into. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154006 Approved by: https://github.com/drisspg	2025-05-29 13:40:36 +00:00
eellison	d6e29bf875	Reflect back mutation if we clone misaligned tensors (#154442 ) Fix for https://github.com/pytorch/pytorch/issues/152425 inductor specializes whether or not a tensor is 16-bit aligned on the first invocation. then, on subsequent invocations, if we inferred alignment but are passed a non-aligned tensor we clone the tensor. If we infer alignment, then run with unaligned, and mutate the input, we need to reflect back the mutation to the input. This pr adds back that mutation. We could have also been less aggressive about inferring alignment for mutated tensors, but that has a pretty perf hit.See the following benchmark: ``` import torch t = torch.rand(4096 * 4096, device="cuda", dtype=torch.float16) @torch.compile(dynamic=False) def foo(x): return x.add_(1) import triton print(triton.testing.do_bench(lambda: foo(t[:-1]))) torch._dynamo.reset() print(triton.testing.do_bench(lambda: foo(t[1:]))) ``` gives ``` 0.04063070610165596 0.07613472988113162 ``` So almost twice as slow for non-aligned tensors. Tensors changing alignment is a relatively rare case. In the future, we could considering a multi-kernel approach, or codegening a triton kernel that does most of the loads with aligned instructions, and a prologue/epilogue of un-alignment. But, it's yet to be seen this is a huge issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154442 Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh	2025-05-29 13:36:48 +00:00
Yu, Guangye	3c74a72ea0	Keep XPU compatible with toolchain 2025.2 (#154359 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154359 Approved by: https://github.com/EikanWang, https://github.com/cyyever	2025-05-29 11:12:07 +00:00
Laith Sakka	cd9ff41282	check fallback_value first. (#154493 ) This is just a refactor, not a fix for any issue. we do check fallback_value first and early exit instead of checking it not set over and over. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154493 Approved by: https://github.com/bobrenjc93	2025-05-29 09:06:43 +00:00
Janet Yang	447b481c79	[AOTI] Save data sizes to constants_info (#154534 ) Differential Revision: D75223179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154534 Approved by: https://github.com/muchulee8	2025-05-29 06:39:13 +00:00
Rachel Guo	9c7ed3e46e	[debug_printer][BE] Fix float8 type printing for min/max value printing (#154466 ) Summary: ATT GH Issue: https://github.com/pytorch/pytorch/issues/149008 Previous: Failed to use debug printing for float8 types due to the limitation of "min_all_cuda" implementation from aten native: `4b39832412/aten/src/ATen/native/cuda/ReduceMinValuesKernel.cu (L51)` Error: Min value: Error: "min_all_cuda" not implemented for 'Float8_e4m3fn' Now: Example output paste: P1824621233 Unblocked float8 type tensor debug printing. Suggest to print the whole value if numel <= threshold. Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_C OMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_float8_dtype_cuda ``` ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_fp8_cuda 2>&1 \| tee fp8_example_printing.txt ``` Differential Revision: D74847967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154466 Approved by: https://github.com/jingsh, https://github.com/henrylhtsang	2025-05-29 05:48:02 +00:00
henrylhtsang	07343efc15	[cutlass backend] small refactor to flatten the ops to avoid nested for loops (#154576 ) Differential Revision: [D75565429](https://our.internmc.facebook.com/intern/diff/D75565429) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154576 Approved by: https://github.com/ColinPeppler	2025-05-29 04:42:58 +00:00
jianan-gu	b394c6e89c	[Inductor][CPP] Add block sparse for FlexAttention CPU (#147196 ) ## Overview This PR is to optimize FlexAttention CPP template with block sparse. Block sparse is natively supported in FlexAttention block mask structures, thus following logic of the kv blocks from `kv_indice ` and `full_kv_indice ` is the strightforward way to add this optimization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147196 Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel	2025-05-29 02:57:02 +00:00
eellison	c0864bb389	Add a (t * 0) pattern (#153161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153161 Approved by: https://github.com/danielvegamyhre	2025-05-29 02:19:36 +00:00
Aaron Gokaslan	316e7a9293	[BE][Ez]: Denote common types as TypeAlias (#154527 ) Denotes common_types as TypeAlias. This triggered a Ruff rule since we named our TypeAlias off standards so I added a file wide ruff suppression Pull Request resolved: https://github.com/pytorch/pytorch/pull/154527 Approved by: https://github.com/benjaminglass1, https://github.com/aorenste	2025-05-29 02:00:13 +00:00
Jerry Mannil	2d932a2e01	[ROCm] Fix 3D tensor perf degradation with NHWC format (#154522 ) Co-author: @doru1004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154522 Approved by: https://github.com/jeffdaily	2025-05-29 01:33:49 +00:00
cyy	4613081b72	Use 3.27 as the minimum CMake version (#153153 ) Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783. It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153 Approved by: https://github.com/malfet	2025-05-29 00:52:44 +00:00
Aaron Orenstein	946a4c2bdc	BE: Type previously untyped decorators (#154515 ) Summary: Cloned #153726 from Skylion007 and fixed internal typing issues. Test Plan: Unit tests pass Differential Revision: D75477355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154515 Approved by: https://github.com/Skylion007	2025-05-29 00:36:34 +00:00
Menglu Yu	ba0a91b3ea	[4/n][Optimus][Auto-AC] Expose the config to skip the dynamo gaurds to avoid recompile (#154152 ) Summary: context: https://fb.workplace.com/groups/1075192433118967/permalink/1673720956599442/ Thanks Microve for raising the existing dynamo skip API in D75196435 The dynamic shape triggers recompilation, introducing compilation time increase, we expose config that users can skip the dynamo guards to avoid the recompile. Note that it may quantize unnessarily nodes, which can impact NE, QPS and memory saving, needs verification. Differential Revision: D75248430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154152 Approved by: https://github.com/bobrenjc93	2025-05-29 00:35:37 +00:00
Natalia Gimelshein	22a1b3b5d0	use 4 elements per thread in no-cast elementwise kernel (#154558 ) Reduce elems per thread to 4 in vectorized function also (only for unaligned inputs where there's no vectorization anyway). This slightly reduces binary size (by 4MB) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154558 Approved by: https://github.com/malfet	2025-05-29 00:32:44 +00:00
nirajkamalk	40abb2b403	Fix deprecated amp APIs in docs (#154553 ) Update usage of deprecated amp APIs. Fixes https://github.com/pytorch/tutorials/issues/3331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154553 Approved by: https://github.com/Skylion007	2025-05-29 00:05:59 +00:00
Michael Lazos	2b3ac17aa2	[Cutlass] Remove spammy log for gemm extensions (#154548 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154548 Approved by: https://github.com/henrylhtsang	2025-05-28 23:55:36 +00:00
William Wen	81b7c96697	[dynamo, nested graph breaks] add skip_frame debugging function (#153773 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153773 Approved by: https://github.com/jansel ghstack dependencies: #151056, #153510, #153772	2025-05-28 23:29:37 +00:00
William Wen	6cda280483	[dynamo, nested graph breaks] remove block stack graph break in output_graph (#153772 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153772 Approved by: https://github.com/jansel ghstack dependencies: #151056, #153510	2025-05-28 23:29:37 +00:00
William Wen	bbd45f1f1f	[dynamo, nested graph breaks] refactor codegen to minimize NULL codegen'ing (#153510 ) Stop codegening NULLs that we need to pop later. Some output_graph.py changes to prepare for nested graph break support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153510 Approved by: https://github.com/jansel ghstack dependencies: #151056	2025-05-28 23:29:37 +00:00
William Wen	0f0d5749a0	[dynamo, nested graph breaks] small fixes to resume function generation (#151056 ) Old: ~pack resume function stack + locals into a list: we need to be able to pass frame stack+locals in lists to hand off to nested functions in the future, so we implement this part first.~ We are no longer doing this right now since GraphModule/guard variable naming gets messed up. Going forward, our approach will be to keep the top frame unpacked, but pack the rest of the contents of other frames in a list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151056 Approved by: https://github.com/jansel	2025-05-28 23:29:37 +00:00
Benjamin Glass	65b1aedd09	[Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371 ) Prepares for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled: 1. removing a number of now clearly unnecessary asserts 2. adding a few more targeted asserts to validate the code's current assumptions 3. removing some unneeded control flow in several functions As far as I can tell, this PR should be functionally neutral. One argument was removed from a `cpp_wrapper` public API, but that argument was unused, and only had a single callsite. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371 Approved by: https://github.com/desertfire	2025-05-28 23:25:17 +00:00
Shangdi Yu	3e05a48927	Fix clamp type promotion in inductor decomposition (#154471 ) Summary: as title, the clamp type promotion should take min/max arg into consideration as well. Test Plan: ``` buck run fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_clamp_decomposition_cpu python test/inductor/test_torchinductor.py -k test_clamp -v ``` Differential Revision: D75490124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154471 Approved by: https://github.com/desertfire, https://github.com/chenyang78	2025-05-28 23:24:25 +00:00
bobrenjc93	d865b784e4	Support unbacked whitelist (#154295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154295 Approved by: https://github.com/angelayi	2025-05-28 23:01:22 +00:00

4124 changed files with 110513 additions and 48544 deletions

									
										7

.ci/aarch64_linux/aarch64_ci_build.sh
									
												View File
												
				@ -3,10 +3,8 @@ set -eux -o pipefail

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="9.0"

				elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0"

				if [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"

				fi

				SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"

				@ -27,6 +25,7 @@ if [ "$DESIRED_CUDA" = "cpu" ]; then

				    USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn

				else

				    echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"

				    export USE_SYSTEM_NCCL=1

				    #USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

				    USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda

				fi

									
										6

.ci/aarch64_linux/aarch64_wheel_ci_build.py
									
												View File
												
				@ -88,7 +88,7 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				        "/usr/local/cuda/lib64/libcusparseLt.so.0",

				        "/usr/local/cuda/lib64/libcusolver.so.11",

				        "/usr/local/cuda/lib64/libcurand.so.10",

				        "/usr/local/cuda/lib64/libnvToolsExt.so.1",

				        "/usr/local/cuda/lib64/libnccl.so.2",

				        "/usr/local/cuda/lib64/libnvJitLink.so.12",

				        "/usr/local/cuda/lib64/libnvrtc.so.12",

				        "/usr/local/cuda/lib64/libcudnn_adv.so.9",

				@ -108,9 +108,9 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				        "/usr/local/lib/libnvpl_blas_core.so.0",

				    ]

				    if "128" in desired_cuda:

				    if "129" in desired_cuda:

				        libs_to_copy += [

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.9",

				            "/usr/local/cuda/lib64/libcufile.so.0",

				            "/usr/local/cuda/lib64/libcufile_rdma.so.1",

				        ]

									
										13

.ci/docker/almalinux/Dockerfile
									
												View File
												
				@ -1,4 +1,4 @@

				ARG CUDA_VERSION=12.4

				ARG CUDA_VERSION=12.6

				ARG BASE_TARGET=cuda${CUDA_VERSION}

				ARG ROCM_IMAGE=rocm/dev-almalinux-8:6.3-complete

				FROM amd64/almalinux:8.10-20250519 as base

				@ -52,10 +52,6 @@ ENV CUDA_VERSION=${CUDA_VERSION}

				# Make things in our path by default

				ENV PATH=/usr/local/cuda-${CUDA_VERSION}/bin:$PATH

				FROM cuda as cuda11.8

				RUN bash ./install_cuda.sh 11.8

				ENV DESIRED_CUDA=11.8

				FROM cuda as cuda12.6

				RUN bash ./install_cuda.sh 12.6

				ENV DESIRED_CUDA=12.6

				@ -64,6 +60,10 @@ FROM cuda as cuda12.8

				RUN bash ./install_cuda.sh 12.8

				ENV DESIRED_CUDA=12.8

				FROM cuda as cuda12.9

				RUN bash ./install_cuda.sh 12.9

				ENV DESIRED_CUDA=12.9

				FROM ${ROCM_IMAGE} as rocm

				ENV PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

				ADD ./common/install_mkl.sh install_mkl.sh

				@ -78,7 +78,8 @@ RUN bash ./install_mnist.sh

				FROM base as all_cuda

				COPY --from=cuda11.8  /usr/local/cuda-11.8 /usr/local/cuda-11.8

				COPY --from=cuda12.6  /usr/local/cuda-12.6 /usr/local/cuda-12.6

				COPY --from=cuda12.4  /usr/local/cuda-12.8 /usr/local/cuda-12.8

				COPY --from=cuda12.8  /usr/local/cuda-12.8 /usr/local/cuda-12.8

				COPY --from=cuda12.9  /usr/local/cuda-12.9 /usr/local/cuda-12.9

				# Final step

				FROM ${BASE_TARGET} as final

									
										82

.ci/docker/build.sh
									
												View File
												
				@ -50,30 +50,21 @@ if [[ "$image" == *xla* ]]; then

				  exit 0

				fi

				if [[ "$image" == *-focal* ]]; then

				  UBUNTU_VERSION=20.04

				elif [[ "$image" == *-jammy* ]]; then

				if [[ "$image" == *-jammy* ]]; then

				  UBUNTU_VERSION=22.04

				elif [[ "$image" == *ubuntu* ]]; then

				  extract_version_from_image_name ubuntu UBUNTU_VERSION

				elif [[ "$image" == *centos* ]]; then

				  extract_version_from_image_name centos CENTOS_VERSION

				fi

				if [ -n "${UBUNTU_VERSION}" ]; then

				  OS="ubuntu"

				elif [ -n "${CENTOS_VERSION}" ]; then

				  OS="centos"

				else

				  echo "Unable to derive operating system base..."

				  exit 1

				fi

				DOCKERFILE="${OS}/Dockerfile"

				# When using ubuntu - 22.04, start from Ubuntu docker image, instead of nvidia/cuda docker image.

				if [[ "$image" == *cuda* && "$UBUNTU_VERSION" != "22.04" ]]; then

				  DOCKERFILE="${OS}-cuda/Dockerfile"

				elif [[ "$image" == *rocm* ]]; then

				if [[ "$image" == *rocm* ]]; then

				  DOCKERFILE="${OS}-rocm/Dockerfile"

				elif [[ "$image" == *xpu* ]]; then

				  DOCKERFILE="${OS}-xpu/Dockerfile"

				@ -98,8 +89,8 @@ tag=$(echo $image | awk -F':' '{print $2}')

				# configuration, so we hardcode everything here rather than do it

				# from scratch

				case "$tag" in

				  pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11)

				    CUDA_VERSION=12.6.3

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				@ -110,7 +101,7 @@ case "$tag" in

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -121,7 +112,31 @@ case "$tag" in

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9)

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.6.3

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				@ -168,8 +183,8 @@ case "$tag" in

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)

				    CUDA_VERSION=11.8.0

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -179,25 +194,25 @@ case "$tag" in

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3-clang10-onnx)

				  pytorch-linux-jammy-py3-clang12-onnx)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				    CLANG_VERSION=12

				    VISION=yes

				    ONNX=yes

				    ;;

				  pytorch-linux-focal-py3.9-clang10)

				  pytorch-linux-jammy-py3.9-clang12)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				    CLANG_VERSION=12

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3.11-clang10)

				  pytorch-linux-jammy-py3.11-clang12)

				    ANACONDA_PYTHON_VERSION=3.11

				    CLANG_VERSION=10

				    CLANG_VERSION=12

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3.9-gcc9)

				  pytorch-linux-jammy-py3.9-gcc9)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=9

				    VISION=yes

				@ -252,9 +267,9 @@ case "$tag" in

				    DOCS=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-clang12)

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12)

				    ANACONDA_PYTHON_VERSION=3.9

				    CUDA_VERSION=11.8

				    CUDA_VERSION=12.8.1

				    CUDNN_VERSION=9

				    CLANG_VERSION=12

				    VISION=yes

				@ -303,15 +318,15 @@ case "$tag" in

				    GCC_VERSION=11

				    TRITON_CPU=yes

				    ;;

				  pytorch-linux-focal-linter)

				  pytorch-linux-jammy-linter)

				    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

				    # We will need to update mypy version eventually, but that's for another day. The task

				    # would be to upgrade mypy to 1.0.0 with Python 3.11

				    PYTHON_VERSION=3.9

				    ;;

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter)

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter)

				    PYTHON_VERSION=3.9

				    CUDA_VERSION=11.8

				    CUDA_VERSION=12.8.1

				    ;;

				  pytorch-linux-jammy-aarch64-py3.10-gcc11)

				    ANACONDA_PYTHON_VERSION=3.10

				@ -370,14 +385,6 @@ esac

				tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

				#when using cudnn version 8 install it separately from cuda

				if [[ "$image" == *cuda*  && ${OS} == "ubuntu" ]]; then

				  IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${UBUNTU_VERSION}"

				  if [[ ${CUDNN_VERSION} == 9 ]]; then

				    IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}"

				  fi

				fi

				no_cache_flag=""

				progress_flag=""

				# Do not use cache and progress=plain when in CI

				@ -394,7 +401,6 @@ docker build \

				       --build-arg "LLVMDEV=${LLVMDEV:-}" \

				       --build-arg "VISION=${VISION:-}" \

				       --build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \

				       --build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \

				       --build-arg "DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}" \

				       --build-arg "GLIBC_VERSION=${GLIBC_VERSION}" \

				       --build-arg "CLANG_VERSION=${CLANG_VERSION}" \

									
										1

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -39,6 +39,7 @@ RUN bash ./install_user.sh && rm install_user.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG BUILD_ENVIRONMENT

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 b173722085b3f555d6ba4533d6bbaddfd7c71144
 aa978594cc155fa8af48cd949f5b5f1823a

2

.ci/docker/ci_commit_pins/nccl-cu12.txt

View File

 @ -1 +1 @@
 v2.26.5-1
 v2.27.3-1

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 b0e26b7359c147b8aa0af686c20510fb9b15990a
 ae324eeac8e102a2b40370e341460f3791353398

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 c8757738a7418249896224430ce84888e8ecdd79
 ae848267bebc65c6181e8cc5e64a6357d2679260

									
										13

.ci/docker/common/install_base.sh
									
												View File
												
				@ -30,18 +30,6 @@ install_ubuntu() {

				    maybe_libomp_dev=""

				  fi

				  # HACK: UCC testing relies on libnccl library from NVIDIA repo, and version 2.16 crashes

				  # See https://github.com/pytorch/pytorch/pull/105260#issuecomment-1673399729

				  # TODO: Eliminate this hack, we should not relay on apt-get installation

				  # See https://github.com/pytorch/pytorch/issues/144768

				  if [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "11.8"* ]]; then

				    maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"

				  elif [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "12.4"* ]]; then

				    maybe_libnccl_dev="libnccl2=2.26.2-1+cuda12.4 libnccl-dev=2.26.2-1+cuda12.4 --allow-downgrades --allow-change-held-packages"

				  else

				    maybe_libnccl_dev=""

				  fi

				  # Install common dependencies

				  apt-get update

				  # TODO: Some of these may not be necessary

				@ -70,7 +58,6 @@ install_ubuntu() {

				    libasound2-dev \

				    libsndfile-dev \

				    ${maybe_libomp_dev} \

				    ${maybe_libnccl_dev} \

				    software-properties-common \

				    wget \

				    sudo \

									
										7

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -6,7 +6,7 @@ set -ex

				if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  BASE_URL="https://repo.anaconda.com/miniconda"

				  CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"

				  if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]] || [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				    BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"  # @lint-ignore

				    CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"

				  fi

				@ -64,6 +64,11 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  # which is provided in libstdcxx 12 and up.

				  conda_install libstdcxx-ng=12.3.0 --update-deps -c conda-forge

				  # Miniforge installer doesn't install sqlite by default

				  if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				    conda_install sqlite

				  fi

				  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

				  if [[ $(uname -m) == "aarch64" ]]; then

				    conda_install "openblas==0.3.29=*openmp*"

									
										111

.ci/docker/common/install_cuda.sh
									
												View File
												
				@ -40,37 +40,9 @@ function install_cudnn {

				  rm -rf tmp_cudnn

				}

				function install_118 {

				    CUDNN_VERSION=9.1.0.70

				    echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.4.0"

				    install_cuda 11.8.0 cuda_11.8.0_520.61.05_linux

				    install_cudnn 11 $CUDNN_VERSION

				    CUDA_VERSION=11.8 bash install_nccl.sh

				    CUDA_VERSION=11.8 bash install_cusparselt.sh

				    ldconfig

				}

				function install_124 {

				  CUDNN_VERSION=9.1.0.70

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.2"

				  install_cuda 12.4.1 cuda_12.4.1_550.54.15_linux

				  install_cudnn 12 $CUDNN_VERSION

				  CUDA_VERSION=12.4 bash install_nccl.sh

				  CUDA_VERSION=12.4 bash install_cusparselt.sh

				  ldconfig

				}

				function install_126 {

				  CUDNN_VERSION=9.5.1.17

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"

				  CUDNN_VERSION=9.10.2.21

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"

				  install_cuda 12.6.3 cuda_12.6.3_560.35.05_linux

				  install_cudnn 12 $CUDNN_VERSION

				@ -82,69 +54,20 @@ function install_126 {

				  ldconfig

				}

				function prune_118 {

				    echo "Pruning CUDA 11.8 and cuDNN"

				    #####################################################################################

				    # CUDA 11.8 prune static libs

				    #####################################################################################

				    export NVPRUNE="/usr/local/cuda-11.8/bin/nvprune"

				    export CUDA_LIB_DIR="/usr/local/cuda-11.8/lib64"

				function install_129 {

				  CUDNN_VERSION=9.10.2.21

				  echo "Installing CUDA 12.9.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"

				  # install CUDA 12.9.1 in the same container

				  install_cuda 12.9.1 cuda_12.9.1_575.57.08_linux

				    export GENCODE="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				    export GENCODE_CUDNN="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  install_cudnn 12 $CUDNN_VERSION

				    if [[ -n "$OVERRIDE_GENCODE" ]]; then

				        export GENCODE=$OVERRIDE_GENCODE

				    fi

				  CUDA_VERSION=12.9 bash install_nccl.sh

				    # all CUDA libs except CuDNN and CuBLAS (cudnn and cublas need arch 3.7 included)

				    ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  CUDA_VERSION=12.9 bash install_cusparselt.sh

				    # prune CuDNN and CuBLAS

				    $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				    $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				    #####################################################################################

				    # CUDA 11.8 prune visual tools

				    #####################################################################################

				    export CUDA_BASE="/usr/local/cuda-11.8/"

				    rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2022.3.0 $CUDA_BASE/nsight-systems-2022.4.2/

				}

				function prune_124 {

				  echo "Pruning CUDA 12.4"

				  #####################################################################################

				  # CUDA 12.4 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then

				      export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.4 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.4/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/

				  ldconfig

				}

				function prune_126 {

				@ -183,7 +106,7 @@ function prune_126 {

				function install_128 {

				  CUDNN_VERSION=9.8.0.87

				  echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"

				  echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"

				  # install CUDA 12.8.1 in the same container

				  install_cuda 12.8.1 cuda_12.8.1_570.124.06_linux

				@ -201,13 +124,11 @@ function install_128 {

				while test $# -gt 0

				do

				    case "$1" in

				    11.8) install_118; prune_118

				    12.6|12.6.*) install_126; prune_126

				        ;;

				    12.4) install_124; prune_124

				    12.8|12.8.*) install_128;

				        ;;

				    12.6) install_126; prune_126

				        ;;

				    12.8) install_128;

				    12.9|12.9.*) install_129;

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

									
										8

.ci/docker/common/install_cudnn.sh
									
												View File
												
				@ -4,12 +4,10 @@ if [[ -n "${CUDNN_VERSION}" ]]; then

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn

				    pushd tmp_cudnn

				    if [[ ${CUDA_VERSION:0:4} == "12.8" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.8.0.87_cuda12-archive"

				    if [[ ${CUDA_VERSION:0:4} == "12.9" || ${CUDA_VERSION:0:4} == "12.8" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.5.1.17_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"

				        CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"

				    else

									
										15

.ci/docker/common/install_cusparselt.sh
									
												View File
												
				@ -5,25 +5,14 @@ set -ex

				# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				mkdir tmp_cusparselt && cd tmp_cusparselt

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-8]$ ]]; then

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-9]$ ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.3.2-archive"

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.7.1.0-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then

				    CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz

				else

				    echo "Not sure which libcusparselt version to install for this ${CUDA_VERSION}"

				fi

									
										15

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -8,16 +8,6 @@ retry () {

				    "$@" || (sleep 10 && "$@") || (sleep 20 && "$@") || (sleep 40 && "$@")

				}

				# A bunch of custom pip dependencies for ONNX

				pip_install \

				  beartype==0.15.0 \

				  filelock==3.9.0 \

				  flatbuffers==2.0 \

				  mock==5.0.1 \

				  ninja==1.10.2 \

				  networkx==2.5 \

				  numpy==1.24.2

				# ONNXRuntime should be installed before installing

				# onnx-weekly. Otherwise, onnx-weekly could be

				# overwritten by onnx.

				@ -29,11 +19,8 @@ pip_install \

				  transformers==4.36.2

				pip_install coloredlogs packaging

				pip_install onnxruntime==1.18.1

				pip_install onnxscript==0.2.6 --no-deps

				# required by onnxscript

				pip_install ml_dtypes

				pip_install onnxscript==0.3.1

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

									
										3

.ci/docker/common/install_openblas.sh
									
												View File
												
				@ -4,8 +4,7 @@

				set -ex

				cd /

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.29 --depth 1 --shallow-submodules

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION:-v0.3.29}" --depth 1 --shallow-submodules

				OPENBLAS_BUILD_FLAGS="

				NUM_THREADS=128

									
										19

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -26,6 +26,11 @@ Pin: release o=repo.radeon.com

				Pin-Priority: 600

				EOF

				    # we want the patch version of 6.4 instead

				    if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then

				        ROCM_VERSION="${ROCM_VERSION}.1"

				    fi

				    # Add amdgpu repository

				    UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`

				    echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list

				@ -67,19 +72,23 @@ EOF

				    # ROCm 6.3 had a regression where initializing static code objects had significant overhead

				    # ROCm 6.4 did not yet fix the regression, also HIP branch names are different

				    if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]] || [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then

				        if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then

				            HIP_BRANCH=rocm-6.3.x

				            VER_STR=6.3

				    if [[ $(ver $ROCM_VERSION) -ge $(ver 6.3) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then

				        if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then

				            HIP_BRANCH=release/rocm-rel-6.4

				            VER_STR=6.4

				            VER_PATCH=.1

				        elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then

				            HIP_BRANCH=release/rocm-rel-6.4

				            VER_STR=6.4

				        elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then

				            HIP_BRANCH=rocm-6.3.x

				            VER_STR=6.3

				        fi

				        # clr build needs CppHeaderParser but can only find it using conda's python

				        /opt/conda/bin/python -m pip install CppHeaderParser

				        git clone https://github.com/ROCm/HIP -b $HIP_BRANCH

				        HIP_COMMON_DIR=$(readlink -f HIP)

				        git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}-statco-hotfix

				        git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}${VER_PATCH}-statco-hotfix

				        mkdir -p clr/build

				        pushd clr/build

				        cmake .. -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR

									
										10

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -51,7 +51,12 @@ as_jenkins git clone --recursive ${TRITON_REPO} triton

				cd triton

				as_jenkins git checkout ${TRITON_PINNED_COMMIT}

				as_jenkins git submodule update --init --recursive

				cd python

				# Old versions of python have setup.py in ./python; newer versions have it in ./

				if [ ! -f setup.py ]; then

				  cd python

				fi

				pip_install pybind11==2.13.6

				# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

				@ -93,3 +98,6 @@ fi

				if [ -n "${NUMPY_VERSION}" ]; then

				  pip_install "numpy==${NUMPY_VERSION}"

				fi

				if [[ "$ANACONDA_PYTHON_VERSION" != 3.9* ]]; then

				  pip_install helion

				fi

									
										15

.ci/docker/libtorch/Dockerfile
									
												View File
												
				@ -54,16 +54,6 @@ COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				COPY ./common/install_cusparselt.sh install_cusparselt.sh

				ENV CUDA_HOME /usr/local/cuda

				FROM cuda as cuda11.8

				RUN bash ./install_cuda.sh 11.8

				RUN bash ./install_magma.sh 11.8

				RUN ln -sf /usr/local/cuda-11.8 /usr/local/cuda

				FROM cuda as cuda12.4

				RUN bash ./install_cuda.sh 12.4

				RUN bash ./install_magma.sh 12.4

				RUN ln -sf /usr/local/cuda-12.4 /usr/local/cuda

				FROM cuda as cuda12.6

				RUN bash ./install_cuda.sh 12.6

				RUN bash ./install_magma.sh 12.6

				@ -74,6 +64,11 @@ RUN bash ./install_cuda.sh 12.8

				RUN bash ./install_magma.sh 12.8

				RUN ln -sf /usr/local/cuda-12.8 /usr/local/cuda

				FROM cuda as cuda12.9

				RUN bash ./install_cuda.sh 12.9

				RUN bash ./install_magma.sh 12.9

				RUN ln -sf /usr/local/cuda-12.9 /usr/local/cuda

				FROM cpu as rocm

				ARG ROCM_VERSION

				ARG PYTORCH_ROCM_ARCH

3

.ci/docker/manywheel/Dockerfile_2_28

View File

 @ -26,7 +26,7 @@ ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 # remove unncessary python versions
 # remove unnecessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 @ -103,6 +103,7 @@ ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 # Install LLVM version
 COPY --from=openssl            /opt/openssl                          /opt/openssl
 COPY --from=base               /opt/python                           /opt/python
 COPY --from=base               /usr/local/lib/                       /usr/local/lib/
 COPY --from=base               /opt/_internal                        /opt/_internal
 COPY --from=base               /usr/local/bin/auditwheel             /usr/local/bin/auditwheel
 COPY --from=intel              /opt/intel                            /opt/intel

5

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

 @ -2,7 +2,7 @@ FROM quay.io/pypa/manylinux_2_28_aarch64 as base
 ARG GCCTOOLSET_VERSION=13
 # Language variabes
 # Language variables
 ENV LC_ALL=en_US.UTF-8
 ENV LANG=en_US.UTF-8
 ENV LANGUAGE=en_US.UTF-8
 @ -58,12 +58,13 @@ RUN git config --global --add safe.directory "*"
 FROM base as openblas
 # Install openblas
 ARG OPENBLAS_VERSION
 ADD ./common/install_openblas.sh install_openblas.sh
 RUN bash ./install_openblas.sh && rm install_openblas.sh
 FROM base as final
 # remove unncessary python versions
 # remove unnecessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6

2

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

 @ -60,7 +60,7 @@ RUN bash ./install_openssl.sh && rm install_openssl.sh
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 FROM openssl as final
 # remove unncessary python versions
 # remove unnecessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6

4

.ci/docker/manywheel/Dockerfile_s390x

View File

 @ -120,11 +120,13 @@ RUN python3 -mpip install cmake==3.28.0
 # so just build it from upstream repository.
 # h5py is dependency of onnxruntime_training.
 # h5py==3.11.0 builds with hdf5-devel 1.10.5 from repository.
 # h5py 3.11.0 doesn't build with numpy >= 2.3.0.
 # install newest flatbuffers version first:
 # for some reason old version is getting pulled in otherwise.
 # packaging package is required for onnxruntime wheel build.
 RUN pip3 install flatbuffers && \
   pip3 install h5py==3.11.0 && \
   pip3 install cython 'pkgconfig>=1.5.5' 'setuptools>=77' 'numpy<2.3.0' && \
   pip3 install --no-build-isolation h5py==3.11.0 && \
   pip3 install packaging && \
   git clone https://github.com/microsoft/onnxruntime && \
   cd onnxruntime && git checkout v1.21.0 && \

									
										3

.ci/docker/manywheel/build.sh
									
												View File
												
				@ -27,6 +27,7 @@ fi

				MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}

				DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}

				OPENBLAS_VERSION=${OPENBLAS_VERSION:-}

				case ${image} in

				    manylinux2_28-builder:cpu)

				@ -40,6 +41,7 @@ case ${image} in

				        GPU_IMAGE=arm64v8/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13 --build-arg NINJA_VERSION=1.12.1"

				        MANY_LINUX_VERSION="2_28_aarch64"

				        OPENBLAS_VERSION="v0.3.29"

				        ;;

				    manylinuxcxx11-abi-builder:cpu-cxx11-abi)

				        TARGET=final

				@ -109,6 +111,7 @@ tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

				DOCKER_BUILDKIT=1 docker build  \

				    ${DOCKER_GPU_BUILD_ARG} \

				    --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				    --build-arg "OPENBLAS_VERSION=${OPENBLAS_VERSION}" \

				    --target "${TARGET}" \

				    -t "${tmp_tag}" \

				    $@ \

20

.ci/docker/requirements-ci.txt

View File

 @ -41,14 +41,11 @@ fbscribelogger==0.1.7
 #Pinned versions: 0.1.6
 #test that import:
 flatbuffers==2.0 ; platform_machine != "s390x"
 flatbuffers==24.12.23
 #Description: cross platform serialization library
 #Pinned versions: 2.0
 #Pinned versions: 24.12.23
 #test that import:
 flatbuffers ; platform_machine == "s390x"
 #Description: cross platform serialization library; Newer version is required on s390x for new python version
 hypothesis==5.35.1
 # Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
 #Description: advanced library for generating parametrized tests
 @ -93,10 +90,10 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.15.0
 mypy==1.16.0
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.14.0
 #Pinned versions: 1.16.0
 #test that import: test_typing.py, test_type_hints.py
 networkx==2.8.8
 @ -342,7 +339,7 @@ onnx==1.18.0
 #Pinned versions:
 #test that import:
 onnxscript==0.2.6
 onnxscript==0.3.1
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 @ -382,3 +379,10 @@ dataclasses_json==0.6.7
 cmake==4.0.0
 #Description: required for building
 tlparse==0.3.30
 #Description: required for log parsing
 cuda-bindings>=12.0,<13.0
 #Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
 #test that import: test_cuda.py

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .3.1
 .4.0

1

.ci/docker/triton_xpu_version.txt Normal file

View File

				`@ -0,0 +1 @@`
				`3.4.0`

									
										170

.ci/docker/ubuntu-cuda/Dockerfile
									
												View File
											
				@ -1,170 +0,0 @@

				ARG UBUNTU_VERSION

				ARG CUDA_VERSION

				ARG IMAGE_NAME

				FROM ${IMAGE_NAME} as base

				ARG UBUNTU_VERSION

				ARG CUDA_VERSION

				ENV DEBIAN_FRONTEND noninteractive

				# Install common dependencies (so that this step can be cached separately)

				COPY ./common/install_base.sh install_base.sh

				RUN bash ./install_base.sh && rm install_base.sh

				# Install user

				COPY ./common/install_user.sh install_user.sh

				RUN bash ./install_user.sh && rm install_user.sh

				# Install katex

				ARG KATEX

				COPY ./common/install_docs_reqs.sh install_docs_reqs.sh

				RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/install_magma_conda.sh install_magma_conda.sh

				RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

				# Install gcc

				ARG GCC_VERSION

				COPY ./common/install_gcc.sh install_gcc.sh

				RUN bash ./install_gcc.sh && rm install_gcc.sh

				# Install clang

				ARG CLANG_VERSION

				COPY ./common/install_clang.sh install_clang.sh

				RUN bash ./install_clang.sh && rm install_clang.sh

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				RUN rm install_vision.sh cache_vision_models.sh common_utils.sh

				ENV INSTALLED_VISION ${VISION}

				# (optional) Install UCC

				ARG UCX_COMMIT

				ARG UCC_COMMIT

				ENV UCX_COMMIT $UCX_COMMIT

				ENV UCC_COMMIT $UCC_COMMIT

				ENV UCX_HOME /usr

				ENV UCC_HOME /usr

				ADD ./common/install_ucc.sh install_ucc.sh

				RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi

				RUN rm install_ucc.sh

				COPY ./common/install_openssl.sh install_openssl.sh

				ENV OPENSSL_ROOT_DIR /opt/openssl

				RUN bash ./install_openssl.sh

				ENV OPENSSL_DIR /opt/openssl

				ARG INDUCTOR_BENCHMARKS

				ARG ANACONDA_PYTHON_VERSION

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

				COPY ci_commit_pins/timm.txt timm.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				ARG TRITON

				FROM base as triton-builder

				# Install triton, this needs to be done before sccache because the latter will

				# try to reach out to S3, which docker build runners don't have access

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton.txt triton.txt

				COPY triton_version.txt triton_version.txt

				RUN bash ./install_triton.sh

				FROM base as final

				COPY --from=triton-builder /opt/triton /opt/triton

				RUN if [ -n "${TRITON}" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi

				RUN rm -rf /opt/triton

				ARG HALIDE

				# Build and install halide

				COPY ./common/install_halide.sh install_halide.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/halide.txt halide.txt

				RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi

				RUN rm install_halide.sh common_utils.sh halide.txt

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

				# See https://github.com/pytorch/pytorch/issues/82174

				# TODO(sdym@fb.com):

				# check if this is needed after full off Xenial migration

				ENV CARGO_NET_GIT_FETCH_WITH_CLI true

				RUN bash ./install_cache.sh && rm install_cache.sh

				ENV CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache

				# Add jni.h for java host build

				COPY ./common/install_jni.sh install_jni.sh

				COPY ./java/jni.h jni.h

				RUN bash ./install_jni.sh && rm install_jni.sh

				# Install Open MPI for CUDA

				COPY ./common/install_openmpi.sh install_openmpi.sh

				RUN if [ -n "${CUDA_VERSION}" ]; then bash install_openmpi.sh; fi

				RUN rm install_openmpi.sh

				# Include BUILD_ENVIRONMENT environment variable in image

				ARG BUILD_ENVIRONMENT

				ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}

				# AWS specific CUDA build guidance

				ENV TORCH_CUDA_ARCH_LIST Maxwell

				ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"

				ENV CUDA_PATH /usr/local/cuda

				# Install LLVM dev version (Defined in the pytorch/builder github repository)

				COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm

				# Install CUDNN

				ARG CUDNN_VERSION

				ARG CUDA_VERSION

				COPY ./common/install_cudnn.sh install_cudnn.sh

				RUN if [ -n "${CUDNN_VERSION}" ]; then bash install_cudnn.sh; fi

				RUN rm install_cudnn.sh

				# Install CUSPARSELT

				ARG CUDA_VERSION

				COPY ./common/install_cusparselt.sh install_cusparselt.sh

				RUN bash install_cusparselt.sh

				RUN rm install_cusparselt.sh

				# Install NCCL

				ARG CUDA_VERSION

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				RUN bash install_nccl.sh

				RUN rm install_nccl.sh /ci_commit_pins/nccl-cu*

				ENV USE_SYSTEM_NCCL=1

				ENV NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				# Install CUDSS

				ARG CUDA_VERSION

				COPY ./common/install_cudss.sh install_cudss.sh

				RUN bash install_cudss.sh

				RUN rm install_cudss.sh

				# Delete /usr/local/cuda-11.X/cuda-11.X symlinks

				RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi

				RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi

				RUN if [ -h /usr/local/cuda-12.1/cuda-12.1 ]; then rm /usr/local/cuda-12.1/cuda-12.1; fi

				RUN if [ -h /usr/local/cuda-12.4/cuda-12.4 ]; then rm /usr/local/cuda-12.4/cuda-12.4; fi

				USER jenkins

				CMD ["bash"]

									
										1

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -25,6 +25,7 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG BUILD_ENVIRONMENT

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

									
										2

.ci/docker/ubuntu-xpu/Dockerfile
									
												View File
												
				@ -72,7 +72,7 @@ ARG TRITON

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-xpu.txt triton-xpu.txt

				COPY triton_version.txt triton_version.txt

				COPY triton_xpu_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-xpu.txt triton_version.txt

									
										16

.ci/magma/Makefile
									
												View File
												
				@ -1,7 +1,7 @@

				SHELL=/usr/bin/env bash

				DOCKER_CMD ?= docker

				DESIRED_CUDA ?= 11.8

				DESIRED_CUDA ?= 12.8

				DESIRED_CUDA_SHORT = $(subst .,,$(DESIRED_CUDA))

				PACKAGE_NAME = magma-cuda

				CUDA_ARCH_LIST ?= -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90

				@ -16,15 +16,21 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \

					magma/build_magma.sh

				.PHONY: all

				all: magma-cuda129

				all: magma-cuda128

				all: magma-cuda126

				all: magma-cuda118

				.PHONY:

				clean:

					$(RM) -r magma-*

					$(RM) -r output

				.PHONY: magma-cuda129

				magma-cuda129: DESIRED_CUDA := 12.9

				magma-cuda129: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120

				magma-cuda129:

					$(DOCKER_RUN)

				.PHONY: magma-cuda128

				magma-cuda128: DESIRED_CUDA := 12.8

				magma-cuda128: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120

				@ -35,9 +41,3 @@ magma-cuda128:

				magma-cuda126: DESIRED_CUDA := 12.6

				magma-cuda126:

					$(DOCKER_RUN)

				.PHONY: magma-cuda118

				magma-cuda118: DESIRED_CUDA := 11.8

				magma-cuda118: CUDA_ARCH_LIST += -gencode arch=compute_37,code=sm_37

				magma-cuda118:

					$(DOCKER_RUN)

									
										4

.ci/manywheel/build_common.sh
									
												View File
												
				@ -31,7 +31,6 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    # Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968

				    # shellcheck disable=SC2046

				    sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")

				    retry apt-get update

				    retry apt-get -y install zip openssl

				else

				@ -98,6 +97,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then

				    exit 1

				fi

				pushd "$PYTORCH_ROOT"

				retry pip install -q cmake

				python setup.py clean

				retry pip install -qr requirements.txt

				case ${DESIRED_PYTHON} in

				@ -151,7 +151,7 @@ if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR --cmake

				    CMAKE_FRESH=1 python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				    echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				else

				    time CMAKE_ARGS=${CMAKE_ARGS[@]} \

									
										115

.ci/manywheel/build_cuda.sh
									
												View File
												
				@ -15,6 +15,9 @@ export INSTALL_TEST=0 # dont install test binaries into site-packages

				export USE_CUPTI_SO=0

				export USE_CUSPARSELT=${USE_CUSPARSELT:-1} # Enable if not disabled by libtorch build

				export USE_CUFILE=${USE_CUFILE:-1}

				export USE_SYSTEM_NCCL=1

				export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				export NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				# Keep an array of cmake variables to add to

				if [[ -z "$CMAKE_ARGS" ]]; then

				@ -48,20 +51,23 @@ else

				fi

				cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')

				EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"

				case ${CUDA_VERSION} in

				    #removing sm_50-sm_60 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases

				    #however we would like to keep sm_70 architecture see: https://github.com/pytorch/pytorch/issues/157517

				    12.8)

				        TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8 and will be removed in future releases

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0;10.0;12.0"

				        ;;

				    12.9)

				        TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX"

				        # WAR to resolve the ld error in libtorch build with CUDA 12.9

				        if [[ "$PACKAGE_TYPE" == "libtorch" ]]; then

				            TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"

				        fi

				        ;;

				    12.6)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    11.8)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};3.7;9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6;9.0"

				        ;;

				    *)

				        echo "unknown cuda version $CUDA_VERSION"

				@ -104,12 +110,11 @@ DEPS_SONAME=(

				)

				# CUDA_VERSION 12.6, 12.8

				# CUDA_VERSION 12.6, 12.8, 12.9

				if [[ $CUDA_VERSION == 12* ]]; then

				    export USE_STATIC_CUDNN=0

				    # Try parallelizing nvcc as well

				    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				    if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				        echo "Bundling with cudnn and cublas."

				        DEPS_LIST+=(

				@ -125,7 +130,6 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            "/usr/local/cuda/lib64/libcublasLt.so.12"

				            "/usr/local/cuda/lib64/libcusparseLt.so.0"

				            "/usr/local/cuda/lib64/libcudart.so.12"

				            "/usr/local/cuda/lib64/libnvToolsExt.so.1"

				            "/usr/local/cuda/lib64/libnvrtc.so.12"

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so"

				            "/usr/local/cuda/lib64/libcufile.so.0"

				@ -144,7 +148,6 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            "libcublasLt.so.12"

				            "libcusparseLt.so.0"

				            "libcudart.so.12"

				            "libnvToolsExt.so.1"

				            "libnvrtc.so.12"

				            "libnvrtc-builtins.so"

				            "libcufile.so.0"

				@ -162,6 +165,7 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            '$ORIGIN/../../nvidia/curand/lib'

				            '$ORIGIN/../../nvidia/cusolver/lib'

				            '$ORIGIN/../../nvidia/cusparse/lib'

				            '$ORIGIN/../../nvidia/cusparselt/lib'

				            '$ORIGIN/../../cusparselt/lib'

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/nvtx/lib'

				@ -172,94 +176,9 @@ if [[ $CUDA_VERSION == 12* ]]; then

				        export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

				        export FORCE_RPATH="--force-rpath"

				        export USE_STATIC_NCCL=0

				        export USE_SYSTEM_NCCL=1

				        export ATEN_STATIC_CUDA=0

				        export USE_CUDA_STATIC_LINK=0

				        export USE_CUPTI_SO=1

				        export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				        export NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				    fi

				elif [[ $CUDA_VERSION == "11.8" ]]; then

				    export USE_STATIC_CUDNN=0

				    # Turn USE_CUFILE off for CUDA 11.8 since nvidia-cufile-cu11 and 1.9.0.20 are

				    # not available in PYPI

				    export USE_CUFILE=0

				    # Try parallelizing nvcc as well

				    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				    # Bundle ptxas into the wheel, see https://github.com/pytorch/pytorch/pull/119750

				    export BUILD_BUNDLE_PTXAS=1

				    # CUDA 11.8 have to ship the libcusparseLt.so.0 with the binary

				    # since nvidia-cusparselt-cu11 is not available in PYPI

				    if [[ $USE_CUSPARSELT == "1" ]]; then

				        DEPS_SONAME+=(

				            "libcusparseLt.so.0"

				        )

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcusparseLt.so.0"

				        )

				    fi

				    if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				        echo "Bundling with cudnn and cublas."

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcudnn_adv.so.9"

				            "/usr/local/cuda/lib64/libcudnn_cnn.so.9"

				            "/usr/local/cuda/lib64/libcudnn_graph.so.9"

				            "/usr/local/cuda/lib64/libcudnn_ops.so.9"

				            "/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_heuristic.so.9"

				            "/usr/local/cuda/lib64/libcudnn.so.9"

				            "/usr/local/cuda/lib64/libcublas.so.11"

				            "/usr/local/cuda/lib64/libcublasLt.so.11"

				            "/usr/local/cuda/lib64/libcudart.so.11.0"

				            "/usr/local/cuda/lib64/libnvToolsExt.so.1"

				            "/usr/local/cuda/lib64/libnvrtc.so.11.2"    # this is not a mistake, it links to more specific cuda version

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so.11.8"

				        )

				        DEPS_SONAME+=(

				            "libcudnn_adv.so.9"

				            "libcudnn_cnn.so.9"

				            "libcudnn_graph.so.9"

				            "libcudnn_ops.so.9"

				            "libcudnn_engines_runtime_compiled.so.9"

				            "libcudnn_engines_precompiled.so.9"

				            "libcudnn_heuristic.so.9"

				            "libcudnn.so.9"

				            "libcublas.so.11"

				            "libcublasLt.so.11"

				            "libcudart.so.11.0"

				            "libnvToolsExt.so.1"

				            "libnvrtc.so.11.2"

				            "libnvrtc-builtins.so.11.8"

				        )

				    else

				        echo "Using nvidia libs from pypi."

				        CUDA_RPATHS=(

				            '$ORIGIN/../../nvidia/cublas/lib'

				            '$ORIGIN/../../nvidia/cuda_cupti/lib'

				            '$ORIGIN/../../nvidia/cuda_nvrtc/lib'

				            '$ORIGIN/../../nvidia/cuda_runtime/lib'

				            '$ORIGIN/../../nvidia/cudnn/lib'

				            '$ORIGIN/../../nvidia/cufft/lib'

				            '$ORIGIN/../../nvidia/curand/lib'

				            '$ORIGIN/../../nvidia/cusolver/lib'

				            '$ORIGIN/../../nvidia/cusparse/lib'

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/nvtx/lib'

				        )

				        CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")

				        export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'

				        export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

				        export FORCE_RPATH="--force-rpath"

				        export USE_STATIC_NCCL=0

				        export USE_SYSTEM_NCCL=1

				        export ATEN_STATIC_CUDA=0

				        export USE_CUDA_STATIC_LINK=0

				        export USE_CUPTI_SO=1

				        export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				        export NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				    fi

				else

				    echo "Unknown cuda version $CUDA_VERSION"

									
										1

.ci/manywheel/build_libtorch.sh
									
												View File
												
				@ -92,6 +92,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then

				    exit 1

				fi

				pushd "$PYTORCH_ROOT"

				retry pip install -q cmake

				python setup.py clean

				retry pip install -qr requirements.txt

				retry pip install -q numpy==2.0.1

									
										25

.ci/manywheel/build_rocm.sh
									
												View File
												
				@ -95,6 +95,7 @@ ROCM_SO_FILES=(

				    "libroctracer64.so"

				    "libroctx64.so"

				    "libhipblaslt.so"

				    "libhipsparselt.so"

				    "libhiprtc.so"

				)

				@ -186,20 +187,28 @@ do

				    OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array

				done

				ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; separated arch list to bar for grep

				# rocBLAS library files

				ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library

				ROCBLAS_LIB_DST=lib/rocblas/library

				ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; seperated arch list to bar for grep

				ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)

				OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)

				ROCBLAS_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)

				ROCBLAS_ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)

				ROCBLAS_OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)

				ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $OTHER_FILES)

				# hipblaslt library files

				HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library

				HIPBLASLT_LIB_DST=lib/hipblaslt/library

				ARCH_SPECIFIC_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -E $ARCH)

				OTHER_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -v gfx)

				HIPBLASLT_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)

				HIPBLASLT_ARCH_SPECIFIC_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -E $ARCH)

				HIPBLASLT_OTHER_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -v gfx)

				HIPBLASLT_LIB_FILES=($HIPBLASLT_ARCH_SPECIFIC_FILES $HIPBLASLT_OTHER_FILES)

				# hipsparselt library files

				HIPSPARSELT_LIB_SRC=$ROCM_HOME/lib/hipsparselt/library

				HIPSPARSELT_LIB_DST=lib/hipsparselt/library

				HIPSPARSELT_ARCH_SPECIFIC_FILES=$(ls $HIPSPARSELT_LIB_SRC | grep -E $ARCH)

				#HIPSPARSELT_OTHER_FILES=$(ls $HIPSPARSELT_LIB_SRC | grep -v gfx)

				HIPSPARSELT_LIB_FILES=($HIPSPARSELT_ARCH_SPECIFIC_FILES $HIPSPARSELT_OTHER_FILES)

				# ROCm library files

				ROCM_SO_PATHS=()

				@ -234,12 +243,14 @@ DEPS_SONAME=(

				DEPS_AUX_SRCLIST=(

				    "${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_SRC/}"

				    "${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_SRC/}"

				    "${HIPSPARSELT_LIB_FILES[@]/#/$HIPSPARSELT_LIB_SRC/}"

				    "/opt/amdgpu/share/libdrm/amdgpu.ids"

				)

				DEPS_AUX_DSTLIST=(

				    "${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_DST/}"

				    "${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_DST/}"

				    "${HIPSPARSELT_LIB_FILES[@]/#/$HIPSPARSELT_LIB_DST/}"

				    "share/libdrm/amdgpu.ids"

				)

									
										13

.ci/pytorch/build.sh
									
												View File
												
				@ -27,6 +27,12 @@ cmake --version

				echo "Environment variables:"

				env

				# The sccache wrapped version of nvcc gets put in /opt/cache/lib in docker since

				# there are some issues if it is always wrapped, so we need to add it to PATH

				# during CI builds.

				# https://github.com/pytorch/pytorch/blob/0b6c0898e6c352c8ea93daec854e704b41485375/.ci/docker/common/install_cache.sh#L97

				export PATH="/opt/cache/lib:$PATH"

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				  # Use jemalloc during compilation to mitigate https://github.com/pytorch/pytorch/issues/116289

				  export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

				@ -52,12 +58,6 @@ fi

				export USE_LLVM=/opt/llvm

				export LLVM_DIR=/opt/llvm/lib/cmake/llvm

				if [[ "$BUILD_ENVIRONMENT" == *executorch* ]]; then

				  # To build test_edge_op_registration

				  export BUILD_EXECUTORCH=ON

				  export USE_CUDA=0

				fi

				if ! which conda; then

				  # In ROCm CIs, we are doing cross compilation on build machines with

				  # intel cpu and later run tests on machines with amd cpu.

				@ -257,6 +257,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then

				  set -e -o pipefail

				  get_bazel

				  python3 tools/optional_submodules.py checkout_eigen

				  # Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing

				  # the runner

									
										2

.ci/pytorch/check_binary.sh
									
												View File
												
				@ -313,7 +313,7 @@ if [[ "$(uname)" == 'Linux' &&  "$PACKAGE_TYPE" == 'manywheel' ]]; then

				  # Please see issue for reference: https://github.com/pytorch/pytorch/issues/152426

				  if [[ "$(uname -m)" == "s390x" ]]; then

				    cxx_abi="19"

				  elif [[ "$DESIRED_CUDA" != 'cu118' && "$DESIRED_CUDA" != 'xpu' && "$DESIRED_CUDA" != 'rocm'* ]]; then

				  elif [[ "$DESIRED_CUDA" != 'xpu' && "$DESIRED_CUDA" != 'rocm'* ]]; then

				    cxx_abi="18"

				  else

				    cxx_abi="16"

									
										2

.ci/pytorch/common.sh
									
												View File
												
				@ -15,6 +15,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then

				  export PYTORCH_TEST_WITH_ROCM=1

				fi

				# TODO: Renable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598

				# TODO: Reenable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598

				# shellcheck disable=SC2034

				BUILD_TEST_LIBTORCH=0

									
										7

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -159,11 +159,6 @@ function install_torchvision() {

				  fi

				}

				function install_tlparse() {

				  pip_install --user "tlparse==0.3.30"

				  PATH="$(python -m site --user-base)/bin:$PATH"

				}

				function install_torchrec_and_fbgemm() {

				  local torchrec_commit

				  torchrec_commit=$(get_pinned_commit torchrec)

				@ -202,7 +197,7 @@ function install_torchrec_and_fbgemm() {

				function clone_pytorch_xla() {

				  if [[ ! -d ./xla ]]; then

				    git clone --recursive --quiet https://github.com/pytorch/xla.git

				    git clone --recursive -b r2.8 https://github.com/pytorch/xla.git

				    pushd xla

				    # pin the xla hash so that we don't get broken by changes to xla

				    git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"

									
										86

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -5,11 +5,6 @@ set -x

				# shellcheck source=./macos-common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/macos-common.sh"

				if [[ -n "$CONDA_ENV" ]]; then

				  # Use binaries under conda environment

				  export PATH="$CONDA_ENV/bin":$PATH

				fi

				# Test that OpenMP is enabled

				pushd test

				if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then

				@ -233,53 +228,52 @@ test_torchbench_smoketest() {

				  mkdir -p "$TEST_REPORTS_DIR"

				  local device=mps

				  local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor)

				  local hf_models=(GoogleFnet YituTechConvBert Speech2Text2ForCausalLM)

				  local dtypes=(undefined float16 bfloat16 notset)

				  local dtype=${dtypes[$1]}

				  local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor timm_resnet timm_vovnet vgg16)

				  for backend in eager inductor; do

				    for dtype in notset float16 bfloat16; do

				      echo "Launching torchbench inference performance run for backend ${backend} and dtype ${dtype}"

				      local dtype_arg="--${dtype}"

				      if [ "$dtype" == notset ]; then

				          dtype_arg="--float32"

				      fi

				      touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"

				      for model in "${models[@]}"; do

				    echo "Launching torchbench inference performance run for backend ${backend} and dtype ${dtype}"

				    local dtype_arg="--${dtype}"

				    if [ "$dtype" == notset ]; then

				        dtype_arg="--float32"

				    fi

				    touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"

				    for model in "${models[@]}"; do

				      PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				        --performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \

				        --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv" || true

				      if [ "$backend" == "inductor" ]; then

				        PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				          --performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \

				          --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv" || true

				        if [ "$backend" == "inductor" ]; then

				          PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				            --accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \

				            --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_accuracy.csv" || true

				        fi

				      done

				      for model in "${hf_models[@]}"; do

				        if [ "$backend" == "inductor" ]; then

				          PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \

				            --performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \

				            --output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_performance.csv" || true

				          PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \

				            --accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \

				            --output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_accuracy.csv" || true

				        fi

				      done

				          --accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \

				          --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_accuracy.csv" || true

				      fi

				    done

				    if [ "$backend" == "inductor" ]; then

				      PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \

				        --performance --backend "$backend" --inference --devices "$device" "$dtype_arg" \

				        --output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_performance.csv" || true

				      PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \

				        --accuracy --backend "$backend" --inference --devices "$device" "$dtype_arg" \

				        --output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_accuracy.csv" || true

				    fi

				    for dtype in notset amp; do

				      echo "Launching torchbench training performance run for backend ${backend} and dtype ${dtype}"

				      touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"

				      local dtype_arg="--${dtype}"

				      if [ "$dtype" == notset ]; then

				    if [ "$dtype" == notset ]; then

				      for dtype_ in notset amp; do

				        echo "Launching torchbench training performance run for backend ${backend} and dtype ${dtype_}"

				        touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype_}_training_${device}_performance.csv"

				        local dtype_arg="--${dtype_}"

				        if [ "$dtype_" == notset ]; then

				          dtype_arg="--float32"

				      fi

				      for model in "${models[@]}"; do

				        PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				          --performance --only "$model" --backend "$backend" --training --devices "$device" "$dtype_arg" \

				          --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv" || true

				        fi

				        for model in "${models[@]}"; do

				          PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				            --performance --only "$model" --backend "$backend" --training --devices "$device" "$dtype_arg" \

				            --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype_}_training_${device}_performance.csv" || true

				        done

				      done

				    done

				    fi

				  done

				@ -318,8 +312,6 @@ test_timm_perf() {

				  echo "timm benchmark on mps device completed"

				}

				install_tlparse

				if [[ $TEST_CONFIG == *"perf_all"* ]]; then

				  test_torchbench_perf

				  test_hf_perf

				@ -331,7 +323,7 @@ elif [[ $TEST_CONFIG == *"perf_hf"* ]]; then

				elif [[ $TEST_CONFIG == *"perf_timm"* ]]; then

				  test_timm_perf

				elif [[ $TEST_CONFIG == *"perf_smoketest"* ]]; then

				  test_torchbench_smoketest

				  test_torchbench_smoketest "${SHARD_NUMBER}"

				elif [[ $TEST_CONFIG == *"mps"* ]]; then

				  test_python_mps

				elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then

									
										2

.ci/pytorch/smoke_test/check_binary_symbols.py
									
												View File
												
				@ -93,7 +93,7 @@ def check_lib_symbols_for_abi_correctness(lib: str) -> None:

				            f"Found pre-cxx11 symbols, but there shouldn't be any, see: {pre_cxx11_symbols[:100]}"

				        )

				    if num_cxx11_symbols < 100:

				        raise RuntimeError("Didn't find enought cxx11 symbols")

				        raise RuntimeError("Didn't find enough cxx11 symbols")

				def main() -> None:

									
										3

.ci/pytorch/smoke_test/check_gomp.py
									
												View File
												
				@ -46,6 +46,9 @@ def get_gomp_thread():

				    # use the default gomp path of AlmaLinux OS

				    libgomp_path = "/usr/lib64/libgomp.so.1"

				    # if it does not exist, try Ubuntu path

				    if not os.path.exists(libgomp_path):

				        libgomp_path = f"/usr/lib/{os.uname().machine}-linux-gnu/libgomp.so.1"

				    os.environ["GOMP_CPU_AFFINITY"] = "0-3"

									
										2

.ci/pytorch/smoke_test/smoke_test.py
									
												View File
												
				@ -276,7 +276,7 @@ def smoke_test_cuda(

				            torch_nccl_version = ".".join(str(v) for v in torch.cuda.nccl.version())

				            print(f"Torch nccl; version: {torch_nccl_version}")

				        # Pypi dependencies are installed on linux ony and nccl is availbale only on Linux.

				        # Pypi dependencies are installed on linux only and nccl is available only on Linux.

				        if pypi_pkg_check == "enabled" and sys.platform in ["linux", "linux2"]:

				            compare_pypi_to_torch_versions(

				                "cudnn", find_pypi_package_version("nvidia-cudnn"), torch_cudnn_version

									
										49

.ci/pytorch/test.sh
									
												View File
												
				@ -196,7 +196,7 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  # shellcheck disable=SC1091

				  source /opt/intel/oneapi/mpi/latest/env/vars.sh

				  # Check XPU status before testing

				  xpu-smi discovery

				  timeout 30 xpu-smi discovery || true

				fi

				if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then

				@ -212,8 +212,6 @@ if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then

				  export VALGRIND=OFF

				fi

				install_tlparse

				# DANGER WILL ROBINSON.  The LD_PRELOAD here could cause you problems

				# if you're not careful.  Check this if you made some changes and the

				# ASAN test is not working

				@ -226,7 +224,7 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then

				    export PYTORCH_TEST_WITH_ASAN=1

				    export PYTORCH_TEST_WITH_UBSAN=1

				    # TODO: Figure out how to avoid hard-coding these paths

				    export ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-15/bin/llvm-symbolizer

				    export ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-18/bin/llvm-symbolizer

				    export TORCH_USE_RTLD_GLOBAL=1

				    # NB: We load libtorch.so with RTLD_GLOBAL for UBSAN, unlike our

				    # default behavior.

				@ -324,6 +322,17 @@ test_python_smoke() {

				  assert_git_not_dirty

				}

				test_h100_distributed() {

				  # Distributed tests at H100

				  time python test/run_test.py --include distributed/_composable/test_composability/test_pp_composability.py  $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  # This test requires multicast support

				  time python test/run_test.py --include distributed/_composable/fsdp/test_fully_shard_comm.py -k TestFullyShardAllocFromPG $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  # symmetric memory test

				  time python test/run_test.py --include distributed/test_symmetric_memory.py  $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  time python test/run_test.py --include distributed/test_nvshmem.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				test_lazy_tensor_meta_reference_disabled() {

				  export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1

				  echo "Testing lazy tensor operations without meta reference"

				@ -352,6 +361,17 @@ test_dynamo_wrapped_shard() {

				  assert_git_not_dirty

				}

				test_einops() {

				  pip install einops==0.6.1

				  time python test/run_test.py --einops --verbose --upload-artifacts-while-running

				  pip install einops==0.7.0

				  time python test/run_test.py --einops --verbose --upload-artifacts-while-running

				  pip install einops==0.8.1

				  time python test/run_test.py --einops --verbose --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				test_inductor_distributed() {

				  # Smuggle a few multi-gpu tests here so that we don't have to request another large node

				  echo "Testing multi_gpu tests in test_torchinductor"

				@ -590,7 +610,9 @@ test_perf_for_dashboard() {

				  local device=cuda

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    if [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then

				    if [[ "${TEST_CONFIG}" == *zen_cpu_x86* ]]; then

				      device=zen_cpu_x86

				    elif [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then

				      device=cpu_x86

				    elif [[ "${TEST_CONFIG}" == *cpu_aarch64* ]]; then

				      device=cpu_aarch64

				@ -1131,6 +1153,12 @@ test_custom_backend() {

				test_custom_script_ops() {

				  echo "Testing custom script operators"

				  if [[ "$BUILD_ENVIRONMENT" == *s390x* ]]; then

				    echo "Skipping custom script operators until it's fixed"

				    return 0

				  fi

				  CUSTOM_OP_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/custom-op-build"

				  pushd test/custom_operator

				  cp -a "$CUSTOM_OP_BUILD" build

				@ -1520,7 +1548,7 @@ test_executorch() {

				test_linux_aarch64() {

				  python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \

				        test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \

				        test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops \

				        test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops test_cpp_extensions_open_device_registration \

				        --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Dynamo tests

				@ -1554,7 +1582,8 @@ test_operator_benchmark() {

				  cd "${TEST_DIR}"/benchmarks/operator_benchmark

				  $TASKSET python -m benchmark_all_test --device "$1" --tag-filter "$2" \

				      --output-dir "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.csv"

				      --output-csv "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.csv" \

				      --output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.json" \

				  pip_install pandas

				  python check_perf_csv.py \

				@ -1639,7 +1668,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				    install_torchaudio cuda

				  fi

				  install_torchvision

				  TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install git+https://github.com/pytorch/ao.git

				  TORCH_CUDA_ARCH_LIST="8.0;8.6" install_torchao

				  id=$((SHARD_NUMBER-1))

				  # https://github.com/opencv/opencv-python/issues/885

				  pip_install opencv-python==4.8.0.74

				@ -1677,6 +1706,8 @@ elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				      test_inductor_distributed

				    fi

				  fi

				elif [[ "${TEST_CONFIG}" == *einops* ]]; then

				  test_einops

				elif [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then

				  install_torchvision

				  test_dynamo_wrapped_shard "${SHARD_NUMBER}"

				@ -1724,6 +1755,8 @@ elif [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then

				  test_xpu_bin

				elif [[ "${TEST_CONFIG}" == smoke ]]; then

				  test_python_smoke

				elif [[ "${TEST_CONFIG}" == h100_distributed ]]; then

				  test_h100_distributed

				else

				  install_torchvision

				  install_monkeytype

									
										2

.ci/pytorch/test_example_code/CMakeLists.txt
									
												View File
												
				@ -16,7 +16,7 @@ target_link_libraries(simple-torch-test CUDA::cudart CUDA::cufft CUDA::cusparse

				find_library(CUDNN_LIBRARY NAMES cudnn)

				target_link_libraries(simple-torch-test  ${CUDNN_LIBRARY} )

				if(MSVC)

				  file(GLOB TORCH_DLLS  "$ENV{CUDA_PATH}/bin/cudnn64_8.dll" "$ENV{NVTOOLSEXT_PATH}/bin/x64/*.dll")

				  file(GLOB TORCH_DLLS  "$ENV{CUDA_PATH}/bin/cudnn64_8.dll")

				  message("dlls to copy "  ${TORCH_DLLS})

				  add_custom_command(TARGET simple-torch-test

				                     POST_BUILD

									
										2

.ci/pytorch/win-build.sh
									
												View File
												
				@ -31,7 +31,7 @@ PYLONG_API_CHECK=$?

				if [[ $PYLONG_API_CHECK == 0 ]]; then

				  echo "Usage of PyLong_{From,As}{Unsigned}Long API may lead to overflow errors on Windows"

				  echo "because \`sizeof(long) == 4\` and \`sizeof(unsigned long) == 4\`."

				  echo "Please include \"torch/csrc/utils/python_numbers.h\" and use the correspoding APIs instead."

				  echo "Please include \"torch/csrc/utils/python_numbers.h\" and use the corresponding APIs instead."

				  echo "PyLong_FromLong -> THPUtils_packInt32 / THPUtils_packInt64"

				  echo "PyLong_AsLong -> THPUtils_unpackInt (32-bit) / THPUtils_unpackLong (64-bit)"

				  echo "PyLong_FromUnsignedLong -> THPUtils_packUInt32 / THPUtils_packUInt64"

									
										2

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -10,7 +10,7 @@ set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocol

				:: able to see what our cl.exe commands are (since you can actually

				:: just copy-paste them into a local Windows setup to just rebuild a

				:: single file.)

				:: log sizes are too long, but leaving this here incase someone wants to use it locally

				:: log sizes are too long, but leaving this here in case someone wants to use it locally

				:: set CMAKE_VERBOSE_MAKEFILE=1

									
										2

.ci/pytorch/win-test-helpers/run_python_nn_smoketests.py
									
												View File
												
				@ -52,7 +52,7 @@ if __name__ == "__main__":

				            if os.path.exists(debugger):

				                command_args = [debugger, "-o", "-c", "~*g; q"] + command_args

				                command_string = " ".join(command_args)

				                print("Reruning with traceback enabled")

				                print("Rerunning with traceback enabled")

				                print("Command:", command_string)

				                subprocess.run(command_args, check=False)

				            sys.exit(e.returncode)

									
										59

.ci/pytorch/windows/cuda118.bat
									
												View File
											
				@ -1,59 +0,0 @@

				@echo off

				set MODULE_NAME=pytorch

				IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (

				    call internal\clone.bat

				    cd %~dp0

				) ELSE (

				    call internal\clean.bat

				)

				IF ERRORLEVEL 1 goto :eof

				call internal\check_deps.bat

				IF ERRORLEVEL 1 goto :eof

				REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%NVTOOLSEXT_PATH%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib"  (

				        set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				    ) ELSE (

				        echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing

				        exit /b 1

				    )

				)

				IF "%CUDA_PATH_V118%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\nvcc.exe" (

				        set "CUDA_PATH_V118=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8"

				    ) ELSE (

				        echo CUDA 11.8 not found, failing

				        exit /b 1

				    )

				)

				IF "%BUILD_VISION%" == "" (

				    set TORCH_CUDA_ARCH_LIST=3.7+PTX;5.0;6.0;6.1;7.0;7.5;8.0;8.6;9.0

				    set TORCH_NVCC_FLAGS=-Xfatbin -compress-all

				) ELSE (

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90

				)

				set "CUDA_PATH=%CUDA_PATH_V118%"

				set "PATH=%CUDA_PATH_V118%\bin;%PATH%"

				:optcheck

				call internal\check_opts.bat

				IF ERRORLEVEL 1 goto :eof

				if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..

				call  %~dp0\internal\copy.bat

				IF ERRORLEVEL 1 goto :eof

				call  %~dp0\internal\setup.bat

				IF ERRORLEVEL 1 goto :eof

									
										59

.ci/pytorch/windows/cuda124.bat
									
												View File
											
				@ -1,59 +0,0 @@

				@echo off

				set MODULE_NAME=pytorch

				IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (

				    call internal\clone.bat

				    cd %~dp0

				) ELSE (

				    call internal\clean.bat

				)

				IF ERRORLEVEL 1 goto :eof

				call internal\check_deps.bat

				IF ERRORLEVEL 1 goto :eof

				REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%NVTOOLSEXT_PATH%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib"  (

				        set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				    ) ELSE (

				        echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing

				        exit /b 1

				    )

				)

				IF "%CUDA_PATH_V124%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\nvcc.exe" (

				        set "CUDA_PATH_V124=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4"

				    ) ELSE (

				        echo CUDA 12.4 not found, failing

				        exit /b 1

				    )

				)

				IF "%BUILD_VISION%" == "" (

				    set TORCH_CUDA_ARCH_LIST=6.1;7.0;7.5;8.0;8.6;9.0

				    set TORCH_NVCC_FLAGS=-Xfatbin -compress-all

				) ELSE (

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90

				)

				set "CUDA_PATH=%CUDA_PATH_V124%"

				set "PATH=%CUDA_PATH_V124%\bin;%PATH%"

				:optcheck

				call internal\check_opts.bat

				IF ERRORLEVEL 1 goto :eof

				if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..

				call  %~dp0\internal\copy.bat

				IF ERRORLEVEL 1 goto :eof

				call  %~dp0\internal\setup.bat

				IF ERRORLEVEL 1 goto :eof

									
										9

.ci/pytorch/windows/cuda126.bat
									
												View File
												
				@ -18,15 +18,6 @@ REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%NVTOOLSEXT_PATH%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib"  (

				        set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				    ) ELSE (

				        echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing

				        exit /b 1

				    )

				)

				IF "%CUDA_PATH_V126%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe" (

				        set "CUDA_PATH_V126=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6"

									
										9

.ci/pytorch/windows/cuda128.bat
									
												View File
												
				@ -18,15 +18,6 @@ REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%NVTOOLSEXT_PATH%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib"  (

				        set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				    ) ELSE (

				        echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing

				        exit /b 1

				    )

				)

				IF "%CUDA_PATH_V128%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin\nvcc.exe" (

				        set "CUDA_PATH_V128=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8"

									
										50

.ci/pytorch/windows/cuda129.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,50 @@

				@echo off

				set MODULE_NAME=pytorch

				IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (

				    call internal\clone.bat

				    cd %~dp0

				) ELSE (

				    call internal\clean.bat

				)

				IF ERRORLEVEL 1 goto :eof

				call internal\check_deps.bat

				IF ERRORLEVEL 1 goto :eof

				REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%CUDA_PATH_V129%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin\nvcc.exe" (

				        set "CUDA_PATH_V129=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9"

				    ) ELSE (

				        echo CUDA 12.9 not found, failing

				        exit /b 1

				    )

				)

				IF "%BUILD_VISION%" == "" (

				    set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_NVCC_FLAGS=-Xfatbin -compress-all

				) ELSE (

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				)

				set "CUDA_PATH=%CUDA_PATH_V129%"

				set "PATH=%CUDA_PATH_V129%\bin;%PATH%"

				:optcheck

				call internal\check_opts.bat

				IF ERRORLEVEL 1 goto :eof

				if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..

				call  %~dp0\internal\copy.bat

				IF ERRORLEVEL 1 goto :eof

				call  %~dp0\internal\setup.bat

				IF ERRORLEVEL 1 goto :eof

									
										2

.ci/pytorch/windows/internal/check_deps.bat
									
												View File
												
				@ -65,7 +65,7 @@ for /F "usebackq delims=" %%i in (`python -c "import sys; print('{0[0]}{0[1]}'.f

				if  %PYVER% LSS 35 (

				    echo Warning: PyTorch for Python 2 under Windows is experimental.

				    echo Python x64 3.5 or up is recommended to compile PyTorch on Windows

				    echo Maybe you can create a virual environment if you have conda installed:

				    echo Maybe you can create a virtual environment if you have conda installed:

				    echo ^> conda create -n test python=3.6 pyyaml numpy

				    echo ^> activate test

				)

									
										1

.ci/pytorch/windows/internal/copy.bat
									
												View File
												
				@ -9,7 +9,6 @@ copy "%CUDA_PATH%\bin\cudnn*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\nvrtc*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\extras\CUPTI\lib64\cupti64_*.dll*" pytorch\torch\lib

				copy "C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64\nvToolsExt64_1.dll*" pytorch\torch\lib

				copy "%PYTHON_LIB_PATH%\libiomp*5md.dll" pytorch\torch\lib

				:: Should be set in build_pytorch.bat

									
										97

.ci/pytorch/windows/internal/cuda_install.bat
									
												View File
												
				@ -23,66 +23,13 @@ set CUDNN_LIB_FOLDER="lib\x64"

				:: Skip all of this if we already have cuda installed

				if exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin\nvcc.exe" goto set_cuda_env_vars

				if %CUDA_VER% EQU 118 goto cuda118

				if %CUDA_VER% EQU 124 goto cuda124

				if %CUDA_VER% EQU 126 goto cuda126

				if %CUDA_VER% EQU 128 goto cuda128

				if %CUDA_VER% EQU 129 goto cuda129

				echo CUDA %CUDA_VERSION_STR% is not supported

				exit /b 1

				:cuda118

				set CUDA_INSTALL_EXE=cuda_11.8.0_522.06_windows.exe

				if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (

				    curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" & REM @lint-ignore

				    if errorlevel 1 exit /b 1

				    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"

				    set "ARGS=cuda_profiler_api_11.8 thrust_11.8 nvcc_11.8 cuobjdump_11.8 nvprune_11.8 nvprof_11.8 cupti_11.8 cublas_11.8 cublas_dev_11.8 cudart_11.8 cufft_11.8 cufft_dev_11.8 curand_11.8 curand_dev_11.8 cusolver_11.8 cusolver_dev_11.8 cusparse_11.8 cusparse_dev_11.8 npp_11.8 npp_dev_11.8 nvrtc_11.8 nvrtc_dev_11.8 nvml_dev_11.8 nvtx_11.8"

				)

				set CUDNN_FOLDER=cudnn-windows-x86_64-9.5.0.50_cuda11-archive

				set CUDNN_LIB_FOLDER="lib"

				set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"

				if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (

				    curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" & REM @lint-ignore

				    if errorlevel 1 exit /b 1

				    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"

				)

				@REM cuDNN 8.3+ required zlib to be installed on the path

				echo Installing ZLIB dlls

				curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"

				7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"

				xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"

				goto cuda_common

				:cuda124

				set CUDA_INSTALL_EXE=cuda_12.4.0_551.61_windows.exe

				if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (

				    curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" & REM @lint-ignore

				    if errorlevel 1 exit /b 1

				    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"

				    set "ARGS=cuda_profiler_api_12.4 thrust_12.4 nvcc_12.4 cuobjdump_12.4 nvprune_12.4 nvprof_12.4 cupti_12.4 cublas_12.4 cublas_dev_12.4 cudart_12.4 cufft_12.4 cufft_dev_12.4 curand_12.4 curand_dev_12.4 cusolver_12.4 cusolver_dev_12.4 cusparse_12.4 cusparse_dev_12.4 npp_12.4 npp_dev_12.4 nvrtc_12.4 nvrtc_dev_12.4 nvml_dev_12.4 nvjitlink_12.4 nvtx_12.4"

				)

				set CUDNN_FOLDER=cudnn-windows-x86_64-9.5.0.50_cuda12-archive

				set CUDNN_LIB_FOLDER="lib"

				set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"

				if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (

				    curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" & REM @lint-ignore

				    if errorlevel 1 exit /b 1

				    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"

				)

				@REM cuDNN 8.3+ required zlib to be installed on the path

				echo Installing ZLIB dlls

				curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"

				7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"

				xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"

				goto cuda_common

				:cuda126

				@ -139,17 +86,39 @@ xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"

				goto cuda_common

				:cuda129

				set CUDA_INSTALL_EXE=cuda_12.9.1_576.57_windows.exe

				if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (

				    curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" & REM @lint-ignore

				    if errorlevel 1 exit /b 1

				    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"

				    set "ARGS=cuda_profiler_api_12.9 thrust_12.9 nvcc_12.9 cuobjdump_12.9 nvprune_12.9 nvprof_12.9 cupti_12.9 cublas_12.9 cublas_dev_12.9 cudart_12.9 cufft_12.9 cufft_dev_12.9 curand_12.9 curand_dev_12.9 cusolver_12.9 cusolver_dev_12.9 cusparse_12.9 cusparse_dev_12.9 npp_12.9 npp_dev_12.9 nvrtc_12.9 nvrtc_dev_12.9 nvml_dev_12.9 nvjitlink_12.9 nvtx_12.9"

				)

				set CUDNN_FOLDER=cudnn-windows-x86_64-9.10.2.21_cuda12-archive

				set CUDNN_LIB_FOLDER="lib"

				set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"

				if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (

				    curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" & REM @lint-ignore

				    if errorlevel 1 exit /b 1

				    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"

				)

				@REM cuDNN 8.3+ required zlib to be installed on the path

				echo Installing ZLIB dlls

				curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"

				7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"

				xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"

				goto cuda_common

				:cuda_common

				:: NOTE: We only install CUDA if we don't have it installed already.

				:: With GHA runners these should be pre-installed as part of our AMI process

				:: If you cannot find the CUDA version you want to build for here then please

				:: add it @ https://github.com/pytorch/test-infra/tree/main/aws/ami/windows

				if not exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin\nvcc.exe" (

				    if not exist "%SRC_DIR%\temp_build\NvToolsExt.7z" (

				        curl -k -L https://ossci-windows.s3.us-east-1.amazonaws.com/builder/NvToolsExt.7z --output "%SRC_DIR%\temp_build\NvToolsExt.7z"

				        if errorlevel 1 exit /b 1

				    )

				    if not exist "%SRC_DIR%\temp_build\gpu_driver_dlls.zip" (

				        curl -k -L "https://ossci-windows.s3.us-east-1.amazonaws.com/builder/additional_dlls.zip" --output "%SRC_DIR%\temp_build\gpu_driver_dlls.zip"

				        if errorlevel 1 exit /b 1

				@ -176,15 +145,6 @@ if not exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_

				        xcopy /Y "%SRC_DIR%\temp_build\cuda\CUDAVisualStudioIntegration\extras\visual_studio_integration\MSBuildExtensions\*.*" "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations"

				    )

				    echo Installing NvToolsExt...

				    7z x %SRC_DIR%\temp_build\NvToolsExt.7z -o"%SRC_DIR%\temp_build\NvToolsExt"

				    mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\bin\x64"

				    mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\include"

				    mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\lib\x64"

				    xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\bin\x64\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\bin\x64"

				    xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\include\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\include"

				    xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\lib\x64\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\lib\x64"

				    echo Installing cuDNN...

				    7z x %CUDNN_SETUP_FILE% -o"%SRC_DIR%\temp_build\cudnn"

				    xcopy /Y "%SRC_DIR%\temp_build\cudnn\%CUDNN_FOLDER%\bin\*.*" "%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin"

				@ -215,4 +175,3 @@ echo Setting up environment...

				set "PATH=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin;%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\libnvvp;%PATH%"

				set "CUDA_PATH=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%"

				set "CUDA_PATH_V%CUDA_VER_MAJOR%_%CUDA_VER_MINOR%=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%"

				set "NVTOOLSEXT_PATH=%ProgramFiles%\NVIDIA Corporation\NvToolsExt"

									
										5

.ci/pytorch/windows/internal/smoke_test.bat
									
												View File
												
				@ -99,7 +99,6 @@ goto end

				:libtorch

				echo "install and test libtorch"

				if "%VC_YEAR%" == "2019" powershell internal\vs2019_install.ps1

				if "%VC_YEAR%" == "2022" powershell internal\vs2022_install.ps1

				if ERRORLEVEL 1 exit /b 1

				@ -111,10 +110,6 @@ pushd tmp\libtorch

				set VC_VERSION_LOWER=17

				set VC_VERSION_UPPER=18

				IF "%VC_YEAR%" == "2019" (

				    set VC_VERSION_LOWER=16

				    set VC_VERSION_UPPER=17

				)

				for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -legacy -products * -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (

				    if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (

									
										7

.ci/pytorch/windows/internal/vc_install_helper.bat
									
												View File
												
				@ -1,14 +1,7 @@

				if "%VC_YEAR%" == "2019" powershell windows/internal/vs2019_install.ps1

				if "%VC_YEAR%" == "2022" powershell windows/internal/vs2022_install.ps1

				set VC_VERSION_LOWER=17

				set VC_VERSION_UPPER=18

				:: Please don't delete VS2019 as an alternative, in case some Windows compiler issue.

				:: Reference: https://github.com/pytorch/pytorch/issues/145702#issuecomment-2858693930

				if "%VC_YEAR%" == "2019" (

				    set VC_VERSION_LOWER=16

				    set VC_VERSION_UPPER=17

				)

				for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe"  -products Microsoft.VisualStudio.Product.BuildTools -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (

				    if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (

									
										48

.ci/pytorch/windows/internal/vs2019_install.ps1
									
												View File
											
				@ -1,48 +0,0 @@

				# https://developercommunity.visualstudio.com/t/install-specific-version-of-vs-component/1142479

				# https://docs.microsoft.com/en-us/visualstudio/releases/2019/history#release-dates-and-build-numbers

				# 16.8.6 BuildTools

				$VS_DOWNLOAD_LINK = "https://ossci-windows.s3.us-east-1.amazonaws.com/vs16.8.6_BuildTools.exe"

				$COLLECT_DOWNLOAD_LINK = "https://aka.ms/vscollect.exe"

				$VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStudio.Workload.VCTools",

				                                                     "--add Microsoft.Component.MSBuild",

				                                                     "--add Microsoft.VisualStudio.Component.Roslyn.Compiler",

				                                                     "--add Microsoft.VisualStudio.Component.TextTemplating",

				                                                     "--add Microsoft.VisualStudio.Component.VC.CoreIde",

				                                                     "--add Microsoft.VisualStudio.Component.VC.Redist.14.Latest",

				                                                     "--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Core",

				                                                     "--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64",

				                                                     "--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Win81")

				curl.exe --retry 3 -kL $VS_DOWNLOAD_LINK --output vs_installer.exe

				if ($LASTEXITCODE -ne 0) {

				    echo "Download of the VS 2019 Version 16.8.5 installer failed"

				    exit 1

				}

				if (Test-Path "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe") {

				    $existingPath = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -products "Microsoft.VisualStudio.Product.BuildTools" -version "[16, 17)" -property installationPath

				    if ($existingPath -ne $null) {

				        if (!${env:CIRCLECI}) {

				            echo "Found correctly versioned existing BuildTools installation in $existingPath"

				            exit 0

				        }

				        echo "Found existing BuildTools installation in $existingPath, keeping it"

				    }

				}

				$process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VS_INSTALL_ARGS -NoNewWindow -Wait -PassThru

				Remove-Item -Path vs_installer.exe -Force

				$exitCode = $process.ExitCode

				if (($exitCode -ne 0) -and ($exitCode -ne 3010)) {

				    echo "VS 2019 installer exited with code $exitCode, which should be one of [0, 3010]."

				    curl.exe --retry 3 -kL $COLLECT_DOWNLOAD_LINK --output Collect.exe

				    if ($LASTEXITCODE -ne 0) {

				        echo "Download of the VS Collect tool failed."

				        exit 1

				    }

				    Start-Process "${PWD}\Collect.exe" -NoNewWindow -Wait -PassThru

				    New-Item -Path "C:\w\build-results" -ItemType "directory" -Force

				    Copy-Item -Path "C:\Users\${env:USERNAME}\AppData\Local\Temp\vslogs.zip" -Destination "C:\w\build-results\"

				    exit 1

				}

									
										4

.ci/pytorch/windows/internal/xpu_install.bat
									
												View File
												
				@ -25,8 +25,8 @@ set XPU_EXTRA_INSTALLED=0

				set XPU_EXTRA_UNINSTALL=0

				if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.1] (

				    set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/1a9fff3d-04c2-4d77-8861-3d86c774b66f/intel-deep-learning-essentials-2025.1.1.26_offline.exe

				    set XPU_BUNDLE_VERSION=2025.1.1+23

				    set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/75d4eb97-914a-4a95-852c-7b9733d80f74/intel-deep-learning-essentials-2025.1.3.8_offline.exe

				    set XPU_BUNDLE_VERSION=2025.1.3+5

				)

				:: Check if XPU bundle is target version or already installed

									
										2

.ci/wheel/build_wheel.sh
									
												View File
												
				@ -206,7 +206,7 @@ if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel -d "$whl_tmp_dir"

				    echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				    BUILD_PYTHON_ONLY=1 BUILD_LIBTORCH_WHL=0 python setup.py bdist_wheel -d "$whl_tmp_dir" --cmake

				    BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 CMAKE_FRESH=1 python setup.py bdist_wheel -d "$whl_tmp_dir"

				    echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				else

				    python setup.py bdist_wheel -d "$whl_tmp_dir"

									
										5

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -75,8 +75,8 @@ TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"

				# CUDA 12.8 builds have triton for Linux and Linux aarch64 binaries.

				if [[ "$DESIRED_CUDA" == cu128 ]]; then

				# CUDA 12.9 builds have triton for Linux and Linux aarch64 binaries.

				if [[ "$DESIRED_CUDA" == "cu129" ]]; then

				  TRITON_CONSTRAINT="platform_system == 'Linux'"

				fi

				@ -105,6 +105,7 @@ fi

				# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton xpu package

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then

				    TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_xpu_version.txt)

				    TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}"

				    if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				        TRITON_SHORTHASH=$(cut -c1-8 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-xpu.txt)

									
										157

.circleci/scripts/trigger_azure_pipeline.py
									
												View File
											
				@ -1,157 +0,0 @@

				# Documentation: https://docs.microsoft.com/en-us/rest/api/azure/devops/build/?view=azure-devops-rest-6.0

				import json

				import os

				import re

				import sys

				import time

				import requests

				AZURE_PIPELINE_BASE_URL = "https://aiinfra.visualstudio.com/PyTorch/"

				AZURE_DEVOPS_PAT_BASE64 = os.environ.get("AZURE_DEVOPS_PAT_BASE64_SECRET", "")

				PIPELINE_ID = "911"

				PROJECT_ID = "0628bce4-2d33-499e-bac5-530e12db160f"

				TARGET_BRANCH = os.environ.get("CIRCLE_BRANCH", "main")

				TARGET_COMMIT = os.environ.get("CIRCLE_SHA1", "")

				build_base_url = AZURE_PIPELINE_BASE_URL + "_apis/build/builds?api-version=6.0"

				s = requests.Session()

				s.headers.update({"Authorization": "Basic " + AZURE_DEVOPS_PAT_BASE64})

				def submit_build(pipeline_id, project_id, source_branch, source_version):

				    print("Submitting build for branch: " + source_branch)

				    print("Commit SHA1: ", source_version)

				    run_build_raw = s.post(

				        build_base_url,

				        json={

				            "definition": {"id": pipeline_id},

				            "project": {"id": project_id},

				            "sourceBranch": source_branch,

				            "sourceVersion": source_version,

				        },

				    )

				    try:

				        run_build_json = run_build_raw.json()

				    except json.decoder.JSONDecodeError as e:

				        print(e)

				        print(

				            "Failed to parse the response. Check if the Azure DevOps PAT is incorrect or expired."

				        )

				        sys.exit(-1)

				    build_id = run_build_json["id"]

				    print("Submitted bulid: " + str(build_id))

				    print("Bulid URL: " + run_build_json["url"])

				    return build_id

				def get_build(_id):

				    get_build_url = (

				        AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}?api-version=6.0"

				    )

				    get_build_raw = s.get(get_build_url)

				    return get_build_raw.json()

				def get_build_logs(_id):

				    get_build_logs_url = (

				        AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}/logs?api-version=6.0"

				    )

				    get_build_logs_raw = s.get(get_build_logs_url)

				    return get_build_logs_raw.json()

				def get_log_content(url):

				    resp = s.get(url)

				    return resp.text

				def wait_for_build(_id):

				    build_detail = get_build(_id)

				    build_status = build_detail["status"]

				    while build_status == "notStarted":

				        print("Waiting for run to start: " + str(_id))

				        sys.stdout.flush()

				        try:

				            build_detail = get_build(_id)

				            build_status = build_detail["status"]

				        except Exception as e:

				            print("Error getting build")

				            print(e)

				        time.sleep(30)

				    print("Bulid started: ", str(_id))

				    handled_logs = set()

				    while build_status == "inProgress":

				        try:

				            print("Waiting for log: " + str(_id))

				            logs = get_build_logs(_id)

				        except Exception as e:

				            print("Error fetching logs")

				            print(e)

				            time.sleep(30)

				            continue

				        for log in logs["value"]:

				            log_id = log["id"]

				            if log_id in handled_logs:

				                continue

				            handled_logs.add(log_id)

				            print("Fetching log: \n" + log["url"])

				            try:

				                log_content = get_log_content(log["url"])

				                print(log_content)

				            except Exception as e:

				                print("Error getting log content")

				                print(e)

				            sys.stdout.flush()

				        build_detail = get_build(_id)

				        build_status = build_detail["status"]

				        time.sleep(30)

				    build_result = build_detail["result"]

				    print("Bulid status: " + build_status)

				    print("Bulid result: " + build_result)

				    return build_status, build_result

				if __name__ == "__main__":

				    # Convert the branch name for Azure DevOps

				    match = re.search(r"pull/(\d+)", TARGET_BRANCH)

				    if match is not None:

				        pr_num = match.group(1)

				        SOURCE_BRANCH = f"refs/pull/{pr_num}/head"

				    else:

				        SOURCE_BRANCH = f"refs/heads/{TARGET_BRANCH}"

				    MAX_RETRY = 2

				    retry = MAX_RETRY

				    while retry > 0:

				        build_id = submit_build(PIPELINE_ID, PROJECT_ID, SOURCE_BRANCH, TARGET_COMMIT)

				        build_status, build_result = wait_for_build(build_id)

				        if build_result != "succeeded":

				            retry = retry - 1

				            if retry > 0:

				                print("Retrying... remaining attempt: " + str(retry))

				                # Wait a bit before retrying

				                time.sleep((MAX_RETRY - retry) * 120)

				                continue

				            else:

				                print("No more chance to retry. Giving up.")

				                sys.exit(-1)

				        else:

				            break

									
										5

.github/ISSUE_TEMPLATE/bug-report.yml
									
										vendored
									
												View File
												
				@ -12,7 +12,9 @@ body:

				    description: |

				      Please provide a clear and concise description of what the bug is.

				      If relevant, add a minimal example so that we can reproduce the error by running the code. It is very important for the snippet to be as succinct (minimal) as possible, so please take time to trim down any irrelevant code to help us debug efficiently. We are going to copy-paste your code and we expect to get the same result as you did: avoid any external data, and include the relevant imports, etc. For example:

				      If relevant, add a minimal example so that we can reproduce the error by running the code. It is very important for the snippet to be as succinct (minimal) as possible, so please take time to trim down any irrelevant code to help us debug efficiently.

				      Your example should be fully self-contained and not rely on any artifact that should be downloaded.

				      For example:

				      ```python

				      # All necessary imports at the beginning

				@ -26,6 +28,7 @@ body:

				      If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.

				      Please also paste or describe the results you observe instead of the expected results. If you observe an error, please paste the error message including the **full** traceback of the exception. It may be relevant to wrap error messages in ```` ```triple quotes blocks``` ````.

				      If your issue is related to numerical accuracy or reproducibility, please read the [numerical accuracy](https://docs.pytorch.org/docs/stable/notes/numerical_accuracy.html) and [reproducibility](https://docs.pytorch.org/docs/stable/notes/randomness.html) notes. If the difference is not expected as described in these documents, please provide appropriate justification on why one result is wrong and the other is correct.

				    placeholder: |

				      A clear and concise description of what the bug is.

									
										2

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -14,6 +14,7 @@ self-hosted-runner:

				    - linux.12xlarge

				    - linux.24xlarge

				    - linux.24xlarge.ephemeral

				    - linux.24xlarge.amd

				    - linux.arm64.2xlarge

				    - linux.arm64.2xlarge.ephemeral

				    - linux.arm64.m7g.4xlarge

				@ -49,6 +50,7 @@ self-hosted-runner:

				    # Organization-wide AMD-hosted runners

				    # MI2xx runners

				    - linux.rocm.gpu

				    - linux.rocm.gpu.mi250

				    - linux.rocm.gpu.2

				    - linux.rocm.gpu.4

				    # MI300 runners

									
										2

.github/actions/build-android/action.yml
									
										vendored
									
												View File
												
				@ -9,7 +9,7 @@ inputs:

				  arch-for-build-env:

				    description: |

				      arch to pass to build environment.

				      This is currently different than the arch name we use elswhere, which

				      This is currently different than the arch name we use elsewhere, which

				      should be fixed.

				    required: true

				  github-secret:

									
										2

.github/actions/filter-test-configs/action.yml
									
										vendored
									
												View File
												
				@ -157,4 +157,4 @@ runs:

				        echo "Is keep-going label set? ${{ steps.filter.outputs.keep-going }}"

				        echo

				        echo "Renabled issues? ${{ steps.filter.outputs.reenabled-issues }}"

				        echo "Reenabled issues? ${{ steps.filter.outputs.reenabled-issues }}"

									
										2

.github/actions/linux-test/action.yml
									
										vendored
									
												View File
												
				@ -153,7 +153,7 @@ runs:

				        github-token: ${{ inputs.GITHUB_TOKEN }}

				    - name: Check for keep-going label and re-enabled test issues

				      # This uses the filter-test-configs action because it conviniently

				      # This uses the filter-test-configs action because it conveniently

				      # checks for labels and re-enabled test issues.  It does not actually do

				      # any filtering.  All filtering is done in the build step.

				      id: keep-going

									
										9

.github/actions/reuse-old-whl/action.yml
									
										vendored
									
												View File
												
				@ -13,6 +13,12 @@ inputs:

				  github-token:

				    description: GitHub token

				    required: true

				  job-id:

				    description: Job ID

				    required: true

				  job-name:

				    description: Job name

				    required: true

				outputs:

				  reuse:

				@ -30,8 +36,11 @@ runs:

				      continue-on-error: true

				      env:

				        GITHUB_TOKEN: ${{ inputs.github-token }}

				        JOB_ID: ${{ inputs.job-id }}

				        JOB_NAME: ${{ inputs.job-name }}

				      run: |

				        set -x

				        python3 -m pip install boto3==1.35.42

				        python3 ${GITHUB_ACTION_PATH}/reuse_old_whl.py \

				          --build-environment "${{ inputs.build-environment }}" \

				          --run-id "${{ inputs.run-id }}" \

									
										168

.github/actions/reuse-old-whl/reuse_old_whl.py
									
										vendored
									
												View File
												
				@ -1,13 +1,22 @@

				import argparse

				import os

				import subprocess

				import sys

				from functools import lru_cache

				from pathlib import Path

				from typing import Any, cast, Optional

				from typing import Any, cast, Optional, Union

				import requests

				REPO_ROOT = Path(__file__).resolve().parent.parent.parent.parent

				sys.path.insert(0, str(REPO_ROOT))

				from tools.stats.upload_metrics import emit_metric

				sys.path.remove(str(REPO_ROOT))  # Clean up sys.path after import

				FORCE_REBUILD_LABEL = "ci-force-rebuild"

				@ -114,15 +123,43 @@ def ok_changed_file(file: str) -> bool:

				        return True

				    if file.startswith("test/") and file.endswith(".py"):

				        return True

				    if file.startswith("docs/") and file.endswith((".md", ".rst")):

				        return True

				    return False

				def check_changed_files(sha: str) -> bool:

				    # Return true if all the changed files are in the list of allowed files to

				    # be changed to reuse the old whl

				    # Removing files in the torch folder is not allowed since rsync will not

				    # remove files

				    removed_files = (

				        subprocess.check_output(

				            [

				                "git",

				                "diff",

				                "--name-only",

				                sha,

				                "HEAD",

				                "--diff-filter=D",

				                "--no-renames",

				            ],

				            text=True,

				            stderr=subprocess.DEVNULL,

				        )

				        .strip()

				        .split()

				    )

				    if any(file.startswith("torch/") for file in removed_files):

				        print(

				            f"Removed files between {sha} and HEAD: {removed_files}, cannot reuse old whl"

				        )

				        return False

				    changed_files = (

				        subprocess.check_output(

				            ["git", "diff", "--name-only", sha, "HEAD"],

				            ["git", "diff", "--name-only", sha, "HEAD", "--no-renames"],

				            text=True,

				            stderr=subprocess.DEVNULL,

				        )

				@ -179,38 +216,83 @@ def unzip_artifact_and_replace_files() -> None:

				    )

				    os.remove("artifacts.zip")

				    head_sha = get_head_sha()

				    # Rename wheel into zip

				    wheel_path = Path("artifacts/dist").glob("*.whl")

				    for path in wheel_path:

				        new_path = path.with_suffix(".zip")

				        os.rename(path, new_path)

				        print(f"Renamed {path} to {new_path}")

				        print(new_path.stem)

				        # Should be of the form torch-2.0.0+git1234567-cp37-etc.whl

				        # Should usually be the merge base sha but for the ones that didn't do

				        # the replacement, it won't be.  Can probably change it to just be merge

				        # base later

				        old_version = f"+git{path.stem.split('+')[1].split('-')[0][3:]}"

				        new_version = f"+git{head_sha[:7]}"

				        def rename_to_new_version(file: Union[str, Path]) -> None:

				            # Rename file with old_version to new_version

				            subprocess.check_output(

				                ["mv", file, str(file).replace(old_version, new_version)]

				            )

				        def change_content_to_new_version(file: Union[str, Path]) -> None:

				            # Check if is a file

				            if os.path.isdir(file):

				                return

				            # Replace the old version in the file with the new version

				            with open(file) as f:

				                content = f.read()

				                content = content.replace(old_version, new_version)

				            with open(file, "w") as f:

				                f.write(content)

				        zip_path = path.with_suffix(".zip")

				        os.rename(path, zip_path)

				        old_stem = zip_path.stem

				        # Unzip the wheel

				        subprocess.check_output(

				            ["unzip", "-o", new_path, "-d", f"artifacts/dist/{new_path.stem}"],

				            ["unzip", "-o", zip_path, "-d", f"artifacts/dist/{old_stem}"],

				        )

				        # Remove the old wheel (which is now a zip file)

				        os.remove(zip_path)

				        # Copy python files into the artifact

				        subprocess.check_output(

				            ["rsync", "-avz", "torch", f"artifacts/dist/{new_path.stem}"],

				            ["rsync", "-avz", "torch", f"artifacts/dist/{old_stem}"],

				        )

				        change_content_to_new_version(f"artifacts/dist/{old_stem}/torch/version.py")

				        for file in Path(f"artifacts/dist/{old_stem}").glob(

				            "*.dist-info/**",

				        ):

				            change_content_to_new_version(file)

				        rename_to_new_version(f"artifacts/dist/{old_stem}")

				        new_stem = old_stem.replace(old_version, new_version)

				        for file in Path(f"artifacts/dist/{new_stem}").glob(

				            "*.dist-info",

				        ):

				            rename_to_new_version(file)

				        # Zip the wheel back

				        subprocess.check_output(

				            ["zip", "-r", f"{new_path.stem}.zip", "."],

				            cwd=f"artifacts/dist/{new_path.stem}",

				            ["zip", "-r", f"{new_stem}.zip", "."],

				            cwd=f"artifacts/dist/{new_stem}",

				        )

				        subprocess.check_output(

				            [

				                "mv",

				                f"artifacts/dist/{new_path.stem}/{new_path.stem}.zip",

				                f"artifacts/dist/{new_path.stem}.whl",

				                f"artifacts/dist/{new_stem}/{new_stem}.zip",

				                f"artifacts/dist/{new_stem}.whl",

				            ],

				        )

				        # Remove the extracted folder

				        subprocess.check_output(

				            ["rm", "-rf", f"artifacts/dist/{new_path.stem}"],

				            ["rm", "-rf", f"artifacts/dist/{new_stem}"],

				        )

				    # Rezip the artifact

				@ -244,46 +326,60 @@ def parse_args() -> argparse.Namespace:

				    return parser.parse_args()

				def can_reuse_whl(args: argparse.Namespace) -> bool:

				    # if is_main_branch() or (

				    #     args.github_ref

				    #     and any(

				    #         args.github_ref.startswith(x)

				    #         for x in ["refs/heads/release", "refs/tags/v", "refs/heads/main"]

				    #     )

				    # ):

				    #     print("On main branch or release branch, rebuild whl")

				    #     return False

				    if check_labels_for_pr():

				        print(f"Found {FORCE_REBUILD_LABEL} label on PR, rebuild whl")

				        return False

				    if check_issue_open():

				        print("Issue #153759 is open, rebuild whl")

				        return False

				def can_reuse_whl(args: argparse.Namespace) -> tuple[bool, str]:

				    if args.github_ref and any(

				        args.github_ref.startswith(x)

				        for x in [

				            "refs/heads/release",

				            "refs/tags/v",

				            "refs/heads/nightly",

				        ]

				    ):

				        print("Release branch, rebuild whl")

				        return (False, "Release branch")

				    if not check_changed_files(get_merge_base()):

				        print("Cannot use old whl due to the changed files, rebuild whl")

				        return False

				        return (False, "Changed files not allowed")

				    if check_labels_for_pr():

				        print(f"Found {FORCE_REBUILD_LABEL} label on PR, rebuild whl")

				        return (False, "Found FORCE_REBUILD_LABEL on PR")

				    if check_issue_open():

				        print("Issue #153759 is open, rebuild whl")

				        return (False, "Issue #153759 is open")

				    workflow_id = get_workflow_id(args.run_id)

				    if workflow_id is None:

				        print("No workflow ID found, rebuild whl")

				        return False

				        return (False, "No workflow ID found")

				    if not find_old_whl(workflow_id, args.build_environment, get_merge_base()):

				        print("No old whl found, rebuild whl")

				        return (False, "No old whl found")

				        # TODO: go backwards from merge base to find more runs

				        return False

				    return True

				    return (True, "Found old whl")

				if __name__ == "__main__":

				    args = parse_args()

				    if can_reuse_whl(args):

				    reuse_whl, reason = can_reuse_whl(args)

				    if reuse_whl:

				        print("Reusing old whl")

				        unzip_artifact_and_replace_files()

				        set_output()

				    emit_metric(

				        "reuse_old_whl",

				        {

				            "reuse_whl": reuse_whl,

				            "reason": reason,

				            "build_environment": args.build_environment,

				            "merge_base": get_merge_base(),

				            "head_sha": get_head_sha(),

				        },

				    )

									
										4

.github/actions/setup-linux/action.yml
									
										vendored
									
												View File
												
				@ -33,14 +33,14 @@ runs:

				      id: check_container_runner

				      run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"

				    - name: Start docker if docker deamon is not running

				    - name: Start docker if docker daemon is not running

				      shell: bash

				      if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				      run: |

				        if systemctl is-active --quiet docker; then

				            echo "Docker daemon is running...";

				        else

				            echo "Starting docker deamon..." && sudo systemctl start docker;

				            echo "Starting docker daemon..." && sudo systemctl start docker;

				        fi

				    - name: Log in to ECR

									
										4

.github/actions/setup-xpu/action.yml
									
										vendored
									
												View File
												
				@ -29,13 +29,13 @@ runs:

				      if: always()

				      shell: bash

				      run: |

				        xpu-smi discovery

				        timeout 30 xpu-smi discovery || true

				    - name: Runner health check GPU count

				      if: always()

				      shell: bash

				      run: |

				        ngpu=$(xpu-smi discovery | grep -c -E 'Device Name')

				        ngpu=$(timeout 30 xpu-smi discovery | grep -c -E 'Device Name' || true)

				        msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"

				        if [[ $ngpu -eq 0 ]]; then

				          echo "Error: Failed to detect any GPUs on the runner"

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 a8f6213b0b61efc6a4862bc45b853551a93dbb6
 e94321c54617dd738a05bfedfc28bc0fa635b5c

2

.github/ci_commit_pins/xla.txt vendored

View File

 @ -1 +1 @@
 edc1a882d872dd7f1362e4312fd045a1d81b3355
 r2.8

									
										1

.github/labeler.yml
									
										vendored
									
												View File
												
				@ -116,7 +116,6 @@

				"release notes: inductor (aoti)":

				- torch/_C/_aoti.pyi

				- torch/_dynamo/repro/aoti.py

				- torch/_export/serde/aoti_schema.py

				- torch/_higher_order_ops/aoti_call_delegate.py

				- torch/_inductor/codegen/aoti_runtime/**

				- torch/_inductor/codegen/aoti_hipify_utils.py

									
										2

.github/merge_rules.yaml
									
										vendored
									
												View File
												
				@ -123,6 +123,8 @@

				  - torch/*docs.py

				  approved_by:

				  - svekars

				  - sekyondaMeta

				  - AlannaBurke

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

									
										2

.github/pytorch-probot.yml
									
										vendored
									
												View File
												
				@ -11,6 +11,7 @@ ciflow_push_tags:

				- ciflow/inductor-perf-compare

				- ciflow/inductor-micro-benchmark

				- ciflow/inductor-micro-benchmark-cpu-x86

				- ciflow/inductor-perf-test-nightly-x86-zen

				- ciflow/inductor-cu126

				- ciflow/linux-aarch64

				- ciflow/mps

				@ -28,6 +29,7 @@ ciflow_push_tags:

				- ciflow/op-benchmark

				- ciflow/pull

				- ciflow/h100

				- ciflow/h100-distributed

				retryable_workflows:

				- pull

				- trunk

2

.github/requirements-gha-cache.txt vendored

View File

 @ -10,5 +10,5 @@ lintrunner==0.10.7
 ninja==1.10.0.post1
 nvidia-ml-py==11.525.84
 pyyaml==6.0
 requests==2.32.2
 requests==2.32.4
 rich==10.9.0

1

.github/requirements/conda-env-macOS-ARM64 vendored

View File

 @ -2,5 +2,4 @@
 certifi
 pip=23.2.1
 pkg-config=0.29.2
 setuptools=72.1.0
 wheel=0.37.1

6

.github/requirements/pip-requirements-macOS.txt vendored

View File

 @ -1,5 +1,5 @@
 boto3==1.35.42
 cmake==3.25.*
 cmake==3.27.*
 expecttest==0.3.0
 fbscribelogger==0.1.7
 filelock==3.6.0
 @ -14,7 +14,7 @@ opt-einsum>=3.3
 optree==0.13.0
 packaging==23.1
 parameterized==0.8.1
 pillow==10.0.1
 pillow==10.3.0
 protobuf==5.29.4
 psutil==5.9.1
 pygments==2.15.0
 @ -26,7 +26,9 @@ pytest-xdist==3.3.1
 pytest==7.3.2
 pyyaml==6.0.2
 scipy==1.12.0
 setuptools==72.1.0
 sympy==1.13.3
 tlparse==0.3.30
 tensorboard==2.13.0
 typing-extensions==4.12.2
 unittest-xml-reporting<=3.2.0,>=2.0.0

									
										2

.github/scripts/amd/patch_triton_wheel.sh
									
										vendored
									
												View File
												
				@ -78,7 +78,7 @@ for pkg in /$WHEELHOUSE_DIR/*triton*.whl; do

				        echo "Copied $filepath to $patchedpath"

				    done

				    # Go through all required shared objects and see if any of our other objects are dependants.  If so, replace so.ver wth so

				    # Go through all required shared objects and see if any of our other objects are dependants.  If so, replace so.ver with so

				    for ((i=0;i<${#deps[@]};++i)); do

				        echo "replacing "${deps_soname[i]} ${patched[i]}

				        replace_needed_sofiles $PREFIX/$ROCM_LIB ${deps_soname[i]} ${patched[i]}

									
										32

.github/scripts/build_triton_wheel.py
									
										vendored
									
												View File
												
				@ -21,8 +21,11 @@ def read_triton_pin(device: str = "cuda") -> str:

				        return f.read().strip()

				def read_triton_version() -> str:

				    with open(REPO_DIR / ".ci" / "docker" / "triton_version.txt") as f:

				def read_triton_version(device: str = "cuda") -> str:

				    triton_version_file = "triton_version.txt"

				    if device == "xpu":

				        triton_version_file = "triton_xpu_version.txt"

				    with open(REPO_DIR / ".ci" / "docker" / triton_version_file) as f:

				        return f.read().strip()

				@ -65,6 +68,7 @@ def build_triton(

				    with TemporaryDirectory() as tmpdir:

				        triton_basedir = Path(tmpdir) / "triton"

				        triton_pythondir = triton_basedir / "python"

				        triton_repo = "https://github.com/openai/triton"

				        if device == "rocm":

				            triton_pkg_name = "pytorch-triton-rocm"

				@ -90,7 +94,7 @@ def build_triton(

				        patch_init_py(

				            triton_pythondir / "triton" / "__init__.py",

				            version=f"{version}",

				            expected_version=None,

				            expected_version=read_triton_version(device),

				        )

				        if device == "rocm":

				@ -101,11 +105,19 @@ def build_triton(

				            )

				            print("ROCm libraries setup for triton installation...")

				        check_call(

				            [sys.executable, "setup.py", "bdist_wheel"], cwd=triton_pythondir, env=env

				        # old triton versions have setup.py in the python/ dir,

				        # new versions have it in the root dir.

				        triton_setupdir = (

				            triton_basedir

				            if (triton_basedir / "setup.py").exists()

				            else triton_pythondir

				        )

				        whl_path = next(iter((triton_pythondir / "dist").glob("*.whl")))

				        check_call(

				            [sys.executable, "setup.py", "bdist_wheel"], cwd=triton_setupdir, env=env

				        )

				        whl_path = next(iter((triton_setupdir / "dist").glob("*.whl")))

				        shutil.copy(whl_path, Path.cwd())

				        if device == "rocm":

				@ -128,15 +140,19 @@ def main() -> None:

				    parser.add_argument("--py-version", type=str)

				    parser.add_argument("--commit-hash", type=str)

				    parser.add_argument("--with-clang-ldd", action="store_true")

				    parser.add_argument("--triton-version", type=str, default=read_triton_version())

				    parser.add_argument("--triton-version", type=str, default=None)

				    args = parser.parse_args()

				    triton_version = read_triton_version(args.device)

				    if args.triton_version:

				        triton_version = args.triton_version

				    build_triton(

				        device=args.device,

				        commit_hash=(

				            args.commit_hash if args.commit_hash else read_triton_pin(args.device)

				        ),

				        version=args.triton_version,

				        version=triton_version,

				        py_version=args.py_version,

				        release=args.release,

				        with_clang_ldd=args.with_clang_ldd,

									
										6

.github/scripts/filter_test_configs.py
									
										vendored
									
												View File
												
				@ -40,9 +40,9 @@ SUPPORTED_PERIODICAL_MODES: dict[str, Callable[[Optional[str]], bool]] = {

				}

				# The link to the published list of disabled jobs

				DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json"

				DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json?versionId=HnkH0xQWnnsoeMsSIVf9291NE5c4jWSa"

				# and unstable jobs

				UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json"

				UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json?versionId=iP_F8gBs60PfOMAJ8gnn1paVrzM1WYsK"

				# Some constants used to handle disabled and unstable jobs

				JOB_NAME_SEP = "/"

				@ -80,7 +80,7 @@ def parse_args() -> Any:

				    parser.add_argument(

				        "--job-name",

				        type=str,

				        help="the name of the current job, i.e. linux-focal-py3.8-gcc7 / build",

				        help="the name of the current job, i.e. linux-jammy-py3.8-gcc7 / build",

				    )

				    parser.add_argument("--pr-number", type=str, help="the pull request number")

				    parser.add_argument("--tag", type=str, help="the associated tag if it exists")

									
										69

.github/scripts/generate_binary_build_matrix.py
									
										vendored
									
												View File
												
				@ -15,21 +15,21 @@ import os

				from typing import Optional

				# NOTE: Also update the CUDA sources in tools/nightly.py when changing this list

				CUDA_ARCHES = ["11.8", "12.6", "12.8"]

				CUDA_STABLE = "12.6"

				# NOTE: Please also update the CUDA sources in `PIP_SOURCES` in tools/nightly.py when changing this

				CUDA_ARCHES = ["12.6", "12.8", "12.9"]

				CUDA_STABLE = "12.8"

				CUDA_ARCHES_FULL_VERSION = {

				    "11.8": "11.8.0",

				    "12.6": "12.6.3",

				    "12.8": "12.8.1",

				    "12.9": "12.9.1",

				}

				CUDA_ARCHES_CUDNN_VERSION = {

				    "11.8": "9",

				    "12.6": "9",

				    "12.8": "9",

				    "12.9": "9",

				}

				# NOTE: Also update the ROCm sources in tools/nightly.py when changing this list

				# NOTE: Please also update the ROCm sources in `PIP_SOURCES` in tools/nightly.py when changing this

				ROCM_ARCHES = ["6.3", "6.4"]

				XPU_ARCHES = ["xpu"]

				@ -38,35 +38,22 @@ CPU_AARCH64_ARCH = ["cpu-aarch64"]

				CPU_S390X_ARCH = ["cpu-s390x"]

				CUDA_AARCH64_ARCHES = ["12.8-aarch64"]

				CUDA_AARCH64_ARCHES = ["12.9-aarch64"]

				PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				    "11.8": (

				        "nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | "  # noqa: B950

				        "nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				    "12.6": (

				        "nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'"

				@ -75,25 +62,41 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				    "12.9": (

				        "nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				    "xpu": (

				        "intel-cmplr-lib-rt==2025.1.1 | "

				        "intel-cmplr-lib-ur==2025.1.1 | "

				        "intel-cmplr-lic-rt==2025.1.1 | "

				        "intel-sycl-rt==2025.1.1 | "

				        "oneccl-devel==2021.15.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "oneccl==2021.15.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "oneccl-devel==2021.15.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "oneccl==2021.15.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "impi-rt==2021.15.0; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "onemkl-sycl-blas==2025.1.0 | "

				        "onemkl-sycl-dft==2025.1.0 | "

				@ -107,7 +110,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "tbb==2022.1.0 | "

				        "tcmlib==1.3.0 | "

				        "umf==0.10.0 | "

				        "intel-pti==0.12.0"

				        "intel-pti==0.12.3"

				    ),

				}

				@ -311,10 +314,10 @@ def generate_wheels_matrix(

				                continue

				            if use_split_build and (

				                arch_version not in ["12.6", "12.8", "11.8", "cpu"] or os != "linux"

				                arch_version not in ["12.6", "12.8", "12.9", "cpu"] or os != "linux"

				            ):

				                raise RuntimeError(

				                    "Split build is only supported on linux with cuda 12*, 11.8, and cpu.\n"

				                    "Split build is only supported on linux with cuda 12* and cpu.\n"

				                    f"Currently attempting to build on arch version {arch_version} and os {os}.\n"

				                    "Please modify the matrix generation to exclude this combination."

				                )

				@ -322,7 +325,7 @@ def generate_wheels_matrix(

				            # cuda linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install

				            if (

				                arch_version in ["12.8", "12.6", "11.8"]

				                arch_version in ["12.9", "12.8", "12.6"]

				                and os == "linux"

				                or arch_version in CUDA_AARCH64_ARCHES

				            ):

				@ -411,6 +414,6 @@ def generate_wheels_matrix(

				    return ret

				validate_nccl_dep_consistency("12.9")

				validate_nccl_dep_consistency("12.8")

				validate_nccl_dep_consistency("12.6")

				validate_nccl_dep_consistency("11.8")

									
										2

.github/scripts/generate_ci_workflows.py
									
										vendored
									
												View File
												
				@ -152,7 +152,7 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [

				        package_type="manywheel",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            arches=["11.8", "12.6", "12.8"],

				            arches=["12.6", "12.8", "12.9", "6.4"],

				            python_versions=["3.9"],

				        ),

				        branches="main",

									
										2

.github/scripts/get_workflow_job_id.py
									
										vendored
									
												View File
												
				@ -64,7 +64,7 @@ def fetch_url(

				            )

				        exception_message = (

				            "Is github alright?",

				            f"Recieved status code '{err.code}' when attempting to retrieve {url}:\n",

				            f"Received status code '{err.code}' when attempting to retrieve {url}:\n",

				            f"{err.reason}\n\nheaders={err.headers}",

				        )

				        raise RuntimeError(exception_message) from err

									
										2

.github/scripts/gitutils.py
									
										vendored
									
												View File
												
				@ -211,7 +211,7 @@ class GitRepo:

				        self, from_branch: str, to_branch: str

				    ) -> tuple[list[str], list[str]]:

				        """

				        Returns list of commmits that are missing in each other branch since their merge base

				        Returns list of commits that are missing in each other branch since their merge base

				        Might be slow if merge base is between two branches is pretty far off

				        """

				        from_ref = self.rev_parse(from_branch)

BIN
.github/scripts/gql_mocks.json.gz vendored

View File

Binary file not shown.

									
										2

.github/scripts/pr-sanity-check.sh
									
										vendored
									
												View File
												
				@ -12,7 +12,7 @@ BASE=${BASE:-HEAD~1}

				HEAD=${HEAD:-HEAD}

				ancestor=$(git merge-base "${BASE}" "${HEAD}")

				echo "INFO: Checking aginst the following stats"

				echo "INFO: Checking against the following stats"

				(

				    set -x

				    git diff --stat=10000 "$ancestor" "${HEAD}" | sed '$d' > "${TMPFILE}"

									
										10

.github/scripts/test_filter_test_configs.py
									
										vendored
									
												View File
												
				@ -347,26 +347,26 @@ class TestConfigFilter(TestCase):

				            {

				                "job_name": "a-ci-job",

				                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',

				                "descripion": "Replicate each periodic mode in a different config",

				                "description": "Replicate each periodic mode in a different config",

				            },

				            {

				                "job_name": "a-ci-cuda11.8-job",

				                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',

				                "descripion": "Replicate each periodic mode in a different config for a CUDA job",

				                "description": "Replicate each periodic mode in a different config for a CUDA job",

				            },

				            {

				                "job_name": "a-ci-rocm-job",

				                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',

				                "descripion": "Replicate each periodic mode in a different config for a ROCm job",

				                "description": "Replicate each periodic mode in a different config for a ROCm job",

				            },

				            {

				                "job_name": "",

				                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',

				                "descripion": "Empty job name",

				                "description": "Empty job name",

				            },

				            {

				                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',

				                "descripion": "Missing job name",

				                "description": "Missing job name",

				            },

				        ]

									
										51

.github/scripts/test_trymerge.py
									
										vendored
									
												View File
												
				@ -19,6 +19,7 @@ from urllib.error import HTTPError

				from github_utils import gh_graphql

				from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo

				from trymerge import (

				    _revlist_to_prs,

				    categorize_checks,

				    DRCI_CHECKRUN_NAME,

				    find_matching_merge_rule,

				@ -264,7 +265,7 @@ class DummyGitRepo(GitRepo):

				        return ["FakeCommitSha"]

				    def commit_message(self, ref: str) -> str:

				        return "super awsome commit message"

				        return "super awesome commit message"

				@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)

				@ -432,7 +433,7 @@ class TestTryMerge(TestCase):

				        )

				    def test_cancelled_gets_ignored(self, *args: Any) -> None:

				        """Tests that cancelled workflow does not override existing successfull status"""

				        """Tests that cancelled workflow does not override existing successful status"""

				        pr = GitHubPR("pytorch", "pytorch", 110367)

				        conclusions = pr.get_checkrun_conclusions()

				        lint_checks = [name for name in conclusions.keys() if "Lint" in name]

				@ -1088,5 +1089,51 @@ class TestGitHubPRGhstackDependencies(TestCase):

				        )

				@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)

				@mock.patch("trymerge.gh_fetch_merge_base", return_value="")

				@mock.patch(

				    "trymerge.get_drci_classifications", side_effect=mocked_drci_classifications

				)

				@mock.patch.object(DummyGitRepo, "commit_message")

				class TestRevListToPR(TestCase):

				    # Tests for _revlist_to_prs function

				    def test__revlist_to_prs_zero_matches(

				        self, mock_commit_message: mock.MagicMock, *args: Any

				    ) -> None:

				        # If zero PRs are mentioned in the commit message, it should raise an error

				        pr_num = 154098

				        pr = GitHubPR("pytorch", "pytorch", pr_num)

				        repo = DummyGitRepo()

				        mock_commit_message.return_value = "no PRs"

				        self.assertRaisesRegex(

				            RuntimeError,

				            "PRs mentioned in commit dummy: 0.",

				            lambda: _revlist_to_prs(repo, pr, ["dummy"]),

				        )

				    def test__revlist_to_prs_two_prs(

				        self, mock_commit_message: mock.MagicMock, *args: Any

				    ) -> None:

				        # If two PRs are mentioned in the commit message, it should raise an error

				        pr_num = 154394

				        pr = GitHubPR("pytorch", "pytorch", pr_num)

				        repo = DummyGitRepo()

				        # https://github.com/pytorch/pytorch/commit/343c56e7650f55fd030aca0b9275d6d73501d3f4

				        commit_message = """add sticky cache pgo

				ghstack-source-id: 9bc6dee0b427819f978bfabccb72727ba8be2f81

				Pull-Request-resolved: https://github.com/pytorch/pytorch/pull/154098

				ghstack-source-id: 9bc6dee0b427819f978bfabccb72727ba8be2f81

				Pull Request resolved: https://github.com/pytorch/pytorch/pull/154394"""

				        mock_commit_message.return_value = commit_message

				        self.assertRaisesRegex(

				            RuntimeError,

				            "PRs mentioned in commit dummy: 2.",

				            lambda: _revlist_to_prs(repo, pr, ["dummy"]),

				        )

				if __name__ == "__main__":

				    main()

									
										15

.github/scripts/trymerge.py
									
										vendored
									
												View File
												
				@ -628,11 +628,17 @@ def _revlist_to_prs(

				    rc: list[tuple[GitHubPR, str]] = []

				    for idx, rev in enumerate(rev_list):

				        msg = repo.commit_message(rev)

				        m = RE_PULL_REQUEST_RESOLVED.search(msg)

				        if m is None:

				        # findall doesn't return named captures, so we need to use finditer

				        all_matches = list(RE_PULL_REQUEST_RESOLVED.finditer(msg))

				        if len(all_matches) != 1:

				            raise RuntimeError(

				                f"Could not find PR-resolved string in {msg} of ghstacked PR {pr.pr_num}"

				                f"Found an unexpected number of PRs mentioned in commit {rev}: "

				                f"{len(all_matches)}.  This is probably because you are using an "

				                "old version of ghstack.  Please update ghstack and resubmit "

				                "your PRs"

				            )

				        m = all_matches[0]

				        if pr.org != m.group("owner") or pr.project != m.group("repo"):

				            raise RuntimeError(

				                f"PR {m.group('number')} resolved to wrong owner/repo pair"

				@ -666,6 +672,9 @@ def get_ghstack_prs(

				    assert pr.is_ghstack_pr()

				    entire_stack = _revlist_to_prs(repo, pr, reversed(rev_list), skip_func)

				    print(

				        f"Found {len(entire_stack)} PRs in the stack for {pr.pr_num}: {[x[0].pr_num for x in entire_stack]}"

				    )

				    for stacked_pr, rev in entire_stack:

				        if stacked_pr.is_closed():

									
										9

.github/scripts/windows/build_magma.bat
									
										vendored
									
												View File
												
				@ -17,7 +17,6 @@ if errorlevel 1 exit /b 1

				set "PATH=C:\Tools;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUVER%\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUVER%\libnvvp;%PATH%"

				set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUVER%

				set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				mkdir magma_cuda%CUVER_NODOT%

				cd magma_cuda%CUVER_NODOT%

				@ -35,15 +34,15 @@ cd magma

				mkdir build && cd build

				set GPU_TARGET=All

				if "%CUVER_NODOT%" == "129" (

				  set CUDA_ARCH_LIST=-gencode=arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120

				)

				if "%CUVER_NODOT%" == "128" (

				  set CUDA_ARCH_LIST=-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120

				)

				if "%CUVER_NODOT:~0,2%" == "12" if NOT "%CUVER_NODOT%" == "128" (

				if "%CUVER_NODOT%" == "126" (

				  set CUDA_ARCH_LIST=-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90

				)

				if "%CUVER_NODOT%" == "118" (

				  set CUDA_ARCH_LIST= -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90

				)

				set CC=cl.exe

				set CXX=cl.exe

2

.github/templates/common.yml.j2 vendored

View File

 @ -32,7 +32,7 @@ concurrency:
 {%- macro setup_ec2_windows() -%}
       !{{ display_ec2_information() }}
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: pytorch/test-infra/.github/actions/setup-ssh@main
         uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8
         continue-on-error: true
         with:
           github-secret: ${{ secrets.GITHUB_TOKEN }}

24

.github/templates/linux_binary_build_workflow.yml.j2 vendored

View File

 @ -56,7 +56,7 @@ jobs:
   get-label-type:
     if: github.repository_owner == 'pytorch'
     name: get-label-type
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.8
     with:
       triggering_actor: ${{ github.triggering_actor }}
       issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
 @ -114,12 +114,12 @@ jobs:
       ALPINE_IMAGE: "docker.io/s390x/alpine"
       {%- elif config["gpu_arch_type"] == "rocm" %}
       runs_on: linux.rocm.gpu
       {%- elif config["gpu_arch_type"] == "cuda" and config["gpu_arch_version"] == "12.8" %}
       {%- elif config["gpu_arch_type"] == "cuda" and config["gpu_arch_version"] in ["12.8", "12.9"] %}
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.g4dn.4xlarge.nvidia.gpu  # 12.8 build needs sm_70+ runner
       {%- elif config["gpu_arch_type"] == "cuda" and config["gpu_arch_version"] != "12.8"%}
       runs_on: linux.g4dn.4xlarge.nvidia.gpu  # 12.8 and 12.9 build need sm_70+ runner
       {%- elif config["gpu_arch_type"] == "cuda" %}
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.4xlarge.nvidia.gpu
       runs_on: linux.4xlarge.nvidia.gpu # for other cuda versions, we use 4xlarge runner
       {%- else %}
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.4xlarge
 @ -150,10 +150,10 @@ jobs:
         with:
           name: !{{ config["build_name"] }}
           path: "${{ runner.temp }}/artifacts/"
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       - name: Calculate docker image
         id: calculate-docker-image
         uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
         uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8
         with:
           docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
           docker-image-name: !{{ config["container_image"] }}
 @ -161,7 +161,7 @@ jobs:
           docker-build-dir: .ci/docker
           working-directory: pytorch
       - name: Pull Docker image
         uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8
         with:
           docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
       - name: Test Pytorch binary
 @ -171,7 +171,7 @@ jobs:
       - name: Teardown XPU
         uses: ./.github/actions/teardown-xpu
     {%- else %}
     runs-on: linux.rocm.gpu
     runs-on: linux.rocm.gpu.mi250
     timeout-minutes: !{{ common.timeout_minutes }}
     !{{ upload.binary_env(config) }}
     steps:
 @ -182,7 +182,7 @@ jobs:
         with:
           name: !{{ config["build_name"] }}
           path: "${{ runner.temp }}/artifacts/"
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       - name: ROCm set GPU_FLAG
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
 @ -196,7 +196,7 @@ jobs:
           role-duration-seconds: 18000
       - name: Calculate docker image
         id: calculate-docker-image
         uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
         uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8
         with:
           docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
           docker-image-name: !{{ config["container_image"] }}
 @ -204,7 +204,7 @@ jobs:
           docker-build-dir: .ci/docker
           working-directory: pytorch
       - name: Pull Docker image
         uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8
         with:
           docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
       - name: Test Pytorch binary

2

.github/templates/macos_binary_build_workflow.yml.j2 vendored

View File

 @ -76,7 +76,7 @@ jobs:
           elif [ -d "/Applications/Xcode_13.3.1.app" ]; then
             echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
           fi
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091

6

.github/templates/windows_binary_build_workflow.yml.j2 vendored

View File

 @ -64,7 +64,7 @@ jobs:
   get-label-type:
     if: github.repository_owner == 'pytorch'
     name: get-label-type
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.8
     with:
       triggering_actor: ${{ github.triggering_actor }}
       issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
 @ -135,7 +135,7 @@ jobs:
 {%- else %}
       !{{ set_runner_specific_vars() }}
       !{{ common.setup_ec2_windows() }}
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
 {%- endif %}
       - name: Populate binary env
         shell: bash
 @ -211,7 +211,7 @@ jobs:
           "pytorch/.ci/pytorch/windows/arm64/bootstrap_rust.bat"
 {%- else %}
       !{{ common.setup_ec2_windows() }}
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       !{{ set_runner_specific_vars() }}
 {%- endif %}
       - uses: !{{ common.download_artifact_action }}

									
										14

.github/workflows/_bazel-build-test.yml
									
										vendored
									
												View File
												
				@ -47,7 +47,7 @@ jobs:

				      reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          fetch-depth: 1

				          submodules: false

				@ -69,25 +69,25 @@ jobs:

				    runs-on: ${{ matrix.runner }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8

				        with:

				          docker-image-name: ${{ inputs.docker-image-name }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -97,7 +97,7 @@ jobs:

				        run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.8

				        if: ${{ inputs.cuda-version != 'cpu' && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				      - name: Output disk space left

				@ -209,5 +209,5 @@ jobs:

				          file-suffix: bazel-${{ github.job }}_${{ steps.get-job-id.outputs.job-id }}

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.8

				        if: always()

									
										13

.github/workflows/_binary-build-linux.yml
									
										vendored
									
												View File
												
				@ -151,13 +151,13 @@ jobs:

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

				        if: inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8

				        continue-on-error: true

				        with:

				          github-secret: ${{ secrets.github-token }}

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}

				@ -187,7 +187,6 @@ jobs:

				      - name: Checkout PyTorch to pytorch dir

				        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

				        with:

				          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				          submodules: recursive

				          path: pytorch

				          show-progress: false

				@ -222,9 +221,9 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8

				        with:

				          # If doing this in main or release branch, use docker.io. Otherwise

				          # If doing this in release/2.8 or release branch, use docker.io. Otherwise

				          # use ECR

				          docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}

				          docker-image-name: ${{ inputs.DOCKER_IMAGE }}

				@ -236,7 +235,7 @@ jobs:

				      - name: Pull Docker image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -293,7 +292,7 @@ jobs:

				      - name: Teardown Linux

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.8

				      - name: Chown workspace

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

									
										13

.github/workflows/_binary-test-linux.yml
									
										vendored
									
												View File
												
				@ -134,14 +134,14 @@ jobs:

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

				        if: inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8

				        continue-on-error: true

				        with:

				          github-secret: ${{ secrets.github-token }}

				        # Setup the environment

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8

				        with:

				          no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}

				@ -164,7 +164,6 @@ jobs:

				      - name: Checkout PyTorch to pytorch dir

				        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

				        with:

				          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				          submodules: recursive

				          show-progress: false

				          path: pytorch

				@ -195,7 +194,7 @@ jobs:

				          path: "${{ runner.temp }}/artifacts/"

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.8

				        if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' && steps.filter.outputs.is-test-matrix-empty == 'False' }}

				      - name: configure aws credentials

				@ -210,7 +209,7 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8

				        with:

				          docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}

				          docker-image-name: ${{ inputs.DOCKER_IMAGE }}

				@ -220,7 +219,7 @@ jobs:

				      - name: Pull Docker image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -232,7 +231,7 @@ jobs:

				      - name: Teardown Linux

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.8

				      - name: Chown workspace

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

Compare commits

1280 Commits benchmarki ... v2.8.0-rc5

7 .ci/aarch64_linux/aarch64_ci_build.sh Unescape Escape View File

6 .ci/aarch64_linux/aarch64_wheel_ci_build.py Unescape Escape View File

13 .ci/docker/almalinux/Dockerfile Unescape Escape View File

82 .ci/docker/build.sh Unescape Escape View File

1 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/nccl-cu12.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

13 .ci/docker/common/install_base.sh Unescape Escape View File

7 .ci/docker/common/install_conda.sh Unescape Escape View File

111 .ci/docker/common/install_cuda.sh Unescape Escape View File

8 .ci/docker/common/install_cudnn.sh Unescape Escape View File

15 .ci/docker/common/install_cusparselt.sh Unescape Escape View File

15 .ci/docker/common/install_onnx.sh Unescape Escape View File

3 .ci/docker/common/install_openblas.sh Unescape Escape View File

19 .ci/docker/common/install_rocm.sh Unescape Escape View File

10 .ci/docker/common/install_triton.sh Unescape Escape View File

15 .ci/docker/libtorch/Dockerfile Unescape Escape View File

3 .ci/docker/manywheel/Dockerfile_2_28 Unescape Escape View File

5 .ci/docker/manywheel/Dockerfile_2_28_aarch64 Unescape Escape View File

2 .ci/docker/manywheel/Dockerfile_cuda_aarch64 Unescape Escape View File

4 .ci/docker/manywheel/Dockerfile_s390x Unescape Escape View File

3 .ci/docker/manywheel/build.sh Unescape Escape View File

20 .ci/docker/requirements-ci.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

1 .ci/docker/triton_xpu_version.txt Normal file Unescape Escape View File

170 .ci/docker/ubuntu-cuda/Dockerfile Unescape Escape View File

1 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ubuntu-xpu/Dockerfile Unescape Escape View File

16 .ci/magma/Makefile Unescape Escape View File

4 .ci/manywheel/build_common.sh Unescape Escape View File

115 .ci/manywheel/build_cuda.sh Unescape Escape View File

1 .ci/manywheel/build_libtorch.sh Unescape Escape View File

25 .ci/manywheel/build_rocm.sh Unescape Escape View File

13 .ci/pytorch/build.sh Unescape Escape View File

2 .ci/pytorch/check_binary.sh Unescape Escape View File

2 .ci/pytorch/common.sh Unescape Escape View File

7 .ci/pytorch/common_utils.sh Unescape Escape View File

86 .ci/pytorch/macos-test.sh Unescape Escape View File

2 .ci/pytorch/smoke_test/check_binary_symbols.py Unescape Escape View File

3 .ci/pytorch/smoke_test/check_gomp.py Unescape Escape View File

2 .ci/pytorch/smoke_test/smoke_test.py Unescape Escape View File

49 .ci/pytorch/test.sh Unescape Escape View File

2 .ci/pytorch/test_example_code/CMakeLists.txt Unescape Escape View File

2 .ci/pytorch/win-build.sh Unescape Escape View File

2 .ci/pytorch/win-test-helpers/build_pytorch.bat Unescape Escape View File

2 .ci/pytorch/win-test-helpers/run_python_nn_smoketests.py Unescape Escape View File

59 .ci/pytorch/windows/cuda118.bat Unescape Escape View File

59 .ci/pytorch/windows/cuda124.bat Unescape Escape View File

9 .ci/pytorch/windows/cuda126.bat Unescape Escape View File

9 .ci/pytorch/windows/cuda128.bat Unescape Escape View File

50 .ci/pytorch/windows/cuda129.bat Normal file Unescape Escape View File

2 .ci/pytorch/windows/internal/check_deps.bat Unescape Escape View File

1 .ci/pytorch/windows/internal/copy.bat Unescape Escape View File

97 .ci/pytorch/windows/internal/cuda_install.bat Unescape Escape View File

5 .ci/pytorch/windows/internal/smoke_test.bat Unescape Escape View File

7 .ci/pytorch/windows/internal/vc_install_helper.bat Unescape Escape View File

48 .ci/pytorch/windows/internal/vs2019_install.ps1 Unescape Escape View File

4 .ci/pytorch/windows/internal/xpu_install.bat Unescape Escape View File

2 .ci/wheel/build_wheel.sh Unescape Escape View File

5 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

157 .circleci/scripts/trigger_azure_pipeline.py Unescape Escape View File

5 .github/ISSUE_TEMPLATE/bug-report.yml vendored Unescape Escape View File

2 .github/actionlint.yaml vendored Unescape Escape View File

2 .github/actions/build-android/action.yml vendored Unescape Escape View File

2 .github/actions/filter-test-configs/action.yml vendored Unescape Escape View File

2 .github/actions/linux-test/action.yml vendored Unescape Escape View File

9 .github/actions/reuse-old-whl/action.yml vendored Unescape Escape View File

168 .github/actions/reuse-old-whl/reuse_old_whl.py vendored Unescape Escape View File

4 .github/actions/setup-linux/action.yml vendored Unescape Escape View File

4 .github/actions/setup-xpu/action.yml vendored Unescape Escape View File

2 .github/ci_commit_pins/audio.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/xla.txt vendored Unescape Escape View File

1 .github/labeler.yml vendored Unescape Escape View File

2 .github/merge_rules.yaml vendored Unescape Escape View File

2 .github/pytorch-probot.yml vendored Unescape Escape View File

2 .github/requirements-gha-cache.txt vendored Unescape Escape View File

1280 Commits

benchmarki ... v2.8.0-rc5

7

.ci/aarch64_linux/aarch64_ci_build.sh

View File

6

.ci/aarch64_linux/aarch64_wheel_ci_build.py

View File

13

.ci/docker/almalinux/Dockerfile

View File

82

.ci/docker/build.sh

View File

1

.ci/docker/centos-rocm/Dockerfile

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

2

.ci/docker/ci_commit_pins/nccl-cu12.txt

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

13

.ci/docker/common/install_base.sh

View File

7

.ci/docker/common/install_conda.sh

View File

111

.ci/docker/common/install_cuda.sh

View File

8

.ci/docker/common/install_cudnn.sh

View File

15

.ci/docker/common/install_cusparselt.sh

View File

15

.ci/docker/common/install_onnx.sh

View File

3

.ci/docker/common/install_openblas.sh

View File

19

.ci/docker/common/install_rocm.sh

View File

10

.ci/docker/common/install_triton.sh

View File

15

.ci/docker/libtorch/Dockerfile

View File

3

.ci/docker/manywheel/Dockerfile_2_28

View File

5

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

2

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

4

.ci/docker/manywheel/Dockerfile_s390x

View File

3

.ci/docker/manywheel/build.sh

View File

20

.ci/docker/requirements-ci.txt

View File

2

.ci/docker/triton_version.txt

View File

1

.ci/docker/triton_xpu_version.txt Normal file

View File

170

.ci/docker/ubuntu-cuda/Dockerfile

View File

1

.ci/docker/ubuntu-rocm/Dockerfile

View File

2

.ci/docker/ubuntu-xpu/Dockerfile

View File

16

.ci/magma/Makefile

View File

4

.ci/manywheel/build_common.sh

View File

115

.ci/manywheel/build_cuda.sh

View File

1

.ci/manywheel/build_libtorch.sh

View File

25

.ci/manywheel/build_rocm.sh

View File

13

.ci/pytorch/build.sh

View File

2

.ci/pytorch/check_binary.sh

View File

2

.ci/pytorch/common.sh

View File

7

.ci/pytorch/common_utils.sh

View File

86

.ci/pytorch/macos-test.sh

View File

2

.ci/pytorch/smoke_test/check_binary_symbols.py

View File

3

.ci/pytorch/smoke_test/check_gomp.py

View File

2

.ci/pytorch/smoke_test/smoke_test.py

View File

49

.ci/pytorch/test.sh

View File

2

.ci/pytorch/test_example_code/CMakeLists.txt

View File

2

.ci/pytorch/win-build.sh

View File

2

.ci/pytorch/win-test-helpers/build_pytorch.bat

View File

2

.ci/pytorch/win-test-helpers/run_python_nn_smoketests.py

View File

59

.ci/pytorch/windows/cuda118.bat

View File

59

.ci/pytorch/windows/cuda124.bat

View File

9

.ci/pytorch/windows/cuda126.bat

View File

9

.ci/pytorch/windows/cuda128.bat

View File

50

.ci/pytorch/windows/cuda129.bat Normal file

View File

2

.ci/pytorch/windows/internal/check_deps.bat

View File

1

.ci/pytorch/windows/internal/copy.bat

View File

97

.ci/pytorch/windows/internal/cuda_install.bat

View File

5

.ci/pytorch/windows/internal/smoke_test.bat

View File

7

.ci/pytorch/windows/internal/vc_install_helper.bat

View File

48

.ci/pytorch/windows/internal/vs2019_install.ps1

View File

4

.ci/pytorch/windows/internal/xpu_install.bat

View File

2

.ci/wheel/build_wheel.sh

View File

5

.circleci/scripts/binary_populate_env.sh

View File

157

.circleci/scripts/trigger_azure_pipeline.py

View File

5

.github/ISSUE_TEMPLATE/bug-report.yml vendored

View File

2

.github/actionlint.yaml vendored

View File

2

.github/actions/build-android/action.yml vendored

View File

2

.github/actions/filter-test-configs/action.yml vendored

View File

2

.github/actions/linux-test/action.yml vendored

View File

9

.github/actions/reuse-old-whl/action.yml vendored

View File

168

.github/actions/reuse-old-whl/reuse_old_whl.py vendored

View File

4

.github/actions/setup-linux/action.yml vendored

View File

4

.github/actions/setup-xpu/action.yml vendored

View File

2

.github/ci_commit_pins/audio.txt vendored

View File

2

.github/ci_commit_pins/xla.txt vendored

View File

1

.github/labeler.yml vendored

View File

2

.github/merge_rules.yaml vendored

View File

2

.github/pytorch-probot.yml vendored

View File

2

.github/requirements-gha-cache.txt vendored

View File

1

.github/requirements/conda-env-macOS-ARM64 vendored

View File