pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-22 14:15:01 +08:00

Author	SHA1	Message	Date
pytorchbot	0fabc3ba44	CUDA aarch64 12.6 and 12.8 builds fix triton constraints (#165022 ) CUDA aarch64 12.6 and 12.8 builds fix triton constraints (#165013) Since we have introduced CUDA aarch64 builds for all cuda versions we need to remove this constraint. This was missed by https://github.com/pytorch/pytorch/pull/162364 Proper constraint on triton should be: ``` Requires-Dist: triton==3.5.0; platform_system == "Linux" ``` not: ``` Requires-Dist: triton==3.5.0; platform_system == "Linux" and platform_machine == "x86_64" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165013 Approved by: https://github.com/Camyll, https://github.com/nWEIdia, https://github.com/tinglvv (cherry picked from commit 81dbeb06f4b3eb6c56625ec25d377eb7c7c6c573) Co-authored-by: atalman <atalman@fb.com>	2025-10-08 21:09:57 -04:00
pytorchbot	26e023a973	[MPS] Update OS version in error message (#164949 ) [MPS] Update OS version in error message (#164946) Followup after https://github.com/pytorch/pytorch/pull/159912 Fixes https://github.com/pytorch/pytorch/issues/164943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164946 Approved by: https://github.com/Camyll (cherry picked from commit 01f3a43462da594b65a6c9e8b46c132cd360cea9) Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-10-08 11:11:48 -07:00
pytorchbot	6f12be2770	CUDA 13.0 builds fix on Amazon Linux 2023 (#164893 ) CUDA 13.0 builds fix on Amazon Linux 2023 (#164870) During 2.9 rc testing I am seeing an issue on Amazon Linux 2023 with CUDA 13.0 builds This is related to: https://github.com/pytorch/pytorch/issues/152756 Workflow: https://github.com/pytorch/test-infra/actions/runs/18324074610/job/52184079262 Error: ``` WARNING: There was an error checking the latest version of pip. + python3.11 .ci/pytorch/smoke_test/smoke_test.py --package torchonly Traceback (most recent call last): File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 333, in _load_global_deps ctypes.CDLL(global_deps_lib_path, mode=ctypes.RTLD_GLOBAL) File "/usr/lib64/python3.11/ctypes/__init__.py", line 376, in __init__ self._handle = _dlopen(self._name, mode) ^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: libcudart.so.13: cannot open shared object file: No such file or directory During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/pytorch/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 12, in <module> import torch File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 425, in <module> _load_global_deps() File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 383, in _load_global_deps _preload_cuda_deps(lib_folder, lib_name) File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 317, in _preload_cuda_deps raise ValueError(f"{lib_name} not found in the system path {sys.path}") Traceback (most recent call last): ValueError: libnvToolsExt.so.*[0-9] not found in the system path ['/pytorch/pytorch/.ci/pytorch/smoke_test', '/usr/lib64/python311.zip', '/usr/lib64/python3.11', '/usr/lib64/python3.11/lib-dynload', '/usr/local/lib64/python3.11/site-packages', '/usr/local/lib/python3.11/site-packages', '/usr/lib64/python3.11/site-packages', '/usr/lib/python3.11/site-packages'] File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module> main() File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main run_cmd_or_die(f"docker exec -t {container_name} /exec") File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}") RuntimeError: Command docker exec -t 7d9c5bd403cac9a9ee824d63a1d6f6057ecce89a7daa94a81617dbf8eff0ff2e /exec failed with exit code 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164870 Approved by: https://github.com/Camyll (cherry picked from commit 483f4e0db91166128ad8922d86dc7222338d4ecc) Co-authored-by: atalman <atalman@fb.com> Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2025-10-07 19:33:08 -07:00
pytorchbot	42f0c2c970	update the baseline data for the operator benchmark (#164789 ) update the baseline data for the operator benchmark (#162693) According to the results of the last four operator benchmark runs, we found that five models achieved more than a 30% improvement compared to the baseline. Therefore, we will update the operator benchmark baseline data. We use the average results from the four runs as the new baseline for the five models. And add a pull request trigger for the operator benchmark workflow Benchmarking Framework \| Benchmarking Module Name \| Case Name \| tag \| run_backward \| baseline old \| r1 \| r2 \| r3 \| r4 \| avg \| speedup -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- PyTorch \| add \| add_M1_N1_K1_cpu \| short \| FALSE \| 3.9497 \| 2.57 \| 2.54 \| 2.38 \| 2.31 \| 2.45 \| 1.61 PyTorch \| functional.hardtanh \| functional.hardtanh_dims(512 512)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 67.118 \| 50.02 \| 49.80 \| 46.78 \| 48.94 \| 48.88 \| 1.37 PyTorch \| relu6 \| relu6_dims(512 512)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 68.739 \| 51.17 \| 51.19 \| 48.07 \| 50.42 \| 50.21 \| 1.37 PyTorch \| relu6 \| relu6_dims(256 1024)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 69.1875 \| 51.97 \| 52.77 \| 50.00 \| 51.24 \| 51.50 \| 1.34 PyTorch \| functional.hardtanh \| functional.hardtanh_dims(256 1024)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 67.436 \| 50.98 \| 51.69 \| 49.06 \| 49.87 \| 50.40 \| 1.34 @chuanqi129 @huydhn @desertfire @jainapurva Pull Request resolved: https://github.com/pytorch/pytorch/pull/162693 Approved by: https://github.com/huydhn (cherry picked from commit f7ea4975abb0aeb0224894f0b54b1f8fd1fa70e3) Co-authored-by: LifengWang <lifeng.a.wang@intel.com>	2025-10-07 07:10:51 -07:00
pytorchbot	b015422da1	fix cpp extension distributed warning spew (#164785 ) fix cpp extension distributed warning spew (#162764) With the new change we only log the warning if we're running non distributed code or if we're in rank 0. Unit testing that certain messages get printed on certain ranks only feels kinda jank so test plan is below instead Test plan ```python # torchrun --nproc_per_node=2 demo_fix.py import os import logging logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG) import torch if 'RANK' in os.environ: torch.distributed.init_process_group('nccl') from torch.utils.cpp_extension import _get_cuda_arch_flags _get_cuda_arch_flags() print(f"Rank {os.environ.get('RANK', '0')} done") ``` Logs showing how how `TORCH_CUDA_ARCH_LIST`only shows up once if we explicitly set the the logging level to `logging.DEBUG`. It also improves the debug message to explain what the actual behavior will be ``` (source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *************************************** W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] ************************************* [rank0]:V0911 18:30:18.921000 1316753 pytorch/torch/utils/cpp_extension.py:2444] TORCH_CUDA_ARCH_LIST is not set, using TORCH_CUDA_ARCH_LIST='10.0+PTX' for visible GPU architectures. Set os.environ['TORCH_CUDA_ARCH_LIST'] to override. Rank 0 done Rank 1 done ``` But if we just use the default and comment out `logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)` Then we get ``` (source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] ************************************* W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *************************************** Rank 0 done Rank 1 done (source) [marksaroufim@devgpu005]~% ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162764 Approved by: https://github.com/ezyang, https://github.com/zou3519 (cherry picked from commit f7e83219619a05934a344ca699c33ee69d5a3642) Co-authored-by: Mark Saroufim <marksaroufim@meta.com>	2025-10-06 16:58:36 -07:00
pytorchbot	d4c4307032	Fix docker build issue after 164575 (#164779 ) Fix docker build issue after 164575 (#164774) Looks like https://github.com/pytorch/pytorch/pull/164575 introduced an issue. The command is wrong: ``` conda install -c "whl/nightly" -y python=3.11 conda=25.7.0 ``` Should be just using default conda channel: ``` conda install -y python=3.11 conda=25.7.0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164774 Approved by: https://github.com/Camyll (cherry picked from commit c1f40d33c89b361a1edad17aa25cfff1ab4014fd) Co-authored-by: atalman <atalman@fb.com>	2025-10-06 16:56:06 -04:00
pytorchbot	3b57315b1b	[ROCm] Increase binary build timeout to 5 hours (300 minutes) (#164770 ) [ROCm] Increase binary build timeout to 5 hours (300 minutes) (#163776) Despite narrowing down the [FBGEMM_GENAI build to gfx942](https://github.com/pytorch/pytorch/pull/162648), the nightly builds still timed out because they [didn't get enough time to finish the post-PyTorch-build steps](https://github.com/pytorch/pytorch/actions/runs/17969771026/job/51109432897). This PR increases timeout for ROCm builds for both [libtorch ](https://github.com/pytorch/pytorch/actions/runs/17969771026)and [manywheel](https://github.com/pytorch/pytorch/actions/runs/17969771041), because both of those are close to the 4hr mark currently. This PR is a more ROCm-targeted version of https://github.com/pytorch/pytorch/pull/162880 (which is for release/2.9 branch). Pull Request resolved: https://github.com/pytorch/pytorch/pull/163776 Approved by: https://github.com/jeffdaily (cherry picked from commit 0ec946a0522748332f42675a4d690ff32d773d42) Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-06 16:08:40 -04:00
pytorchbot	c74f05797d	Pin conda version for Docker builds (#164579 ) Pin conda version for Docker builds (#164575) Mitigates https://github.com/pytorch/pytorch/issues/164574 Remove unused CUDA_CHANNEL var - this was used before when we had pytorch install via conda. Please note: CUDA 13.0 failures are expected since the CI tries to build against prod and CUDA 13.0 is not available in prod yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164575 Approved by: https://github.com/malfet, https://github.com/Camyll (cherry picked from commit e40fe634b1a7aa33e278b1404ee02dea12277080) Co-authored-by: atalman <atalman@fb.com>	2025-10-03 11:44:46 -04:00
Andrey Talman	fd364580a9	[Cherry-Pick] Work Around exposing statically linked libstdc++ CXX11 ABI strong symbols (#163980 ) (#164508 ) * Work Around exposing statically linked libstdc++ CXX11 ABI strong symbols (#163980) Work Around for: https://github.com/pytorch/pytorch/issues/133437 Test plan: 1. Build whl in CI 2. Download 3. Run ``nm -D libtorch_cpu.so \| grep "recursive_directory_iterator"`` Test with check_binary_symbols.py: Success: ``` num_cxx11_symbols: 2326 num_pre_cxx11_symbols: 0 lib: /home/ec2-user/github/variant-repack/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so num_statically_linked_symbols (T): 0 ``` Fail when using "W" instead of "T" as type calling ``cxx11_statically_linked_symbols = grep_symbols( lib, STATICALLY_LINKED_CXX11_ABI, symbol_type="W" )`` : ``` num_cxx11_symbols: 2326 num_pre_cxx11_symbols: 0 lib: /home/ec2-user/github/variant-repack/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so num_statically_linked_symbols (T): 20 Traceback (most recent call last): File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 130, in <module> main() File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 126, in main check_lib_statically_linked_libstdc_cxx_abi_symbols(libtorch_cpu_path) File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 95, in check_lib_statically_linked_libstdc_cxx_abi_symbols raise RuntimeError( RuntimeError: Found statically linked libstdc++ symbols (recursive_directory_iterator), but there shouldn't be any, see: ['std::filesystem::__cxx11::recursive_directory_iterator::recursion_pending() const', 'std::filesystem::__cxx11::recursive_directory_iterator::depth() const', 'std::filesystem::__cxx11::recursive_directory_iterator::options() const', 'std::filesystem::__cxx11::recursive_directory_iterator::operator() const', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::operator bool() const', 'std::filesystem::__cxx11::recursive_directory_iterator::disable_recursion_pending()', 'std::filesystem::__cxx11::recursive_directory_iterator::pop(std::error_code&)', 'std::filesystem::__cxx11::recursive_directory_iterator::pop()', 'std::filesystem::__cxx11::recursive_directory_iterator::increment(std::error_code&)', 'std::filesystem::__cxx11::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::__cxx11::path const&, std::filesystem::directory_options, std::error_code)', 'std::filesystem::__cxx11::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::__cxx11::path const&, std::filesystem::directory_options, std::error_code)', 'std::filesystem::__cxx11::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::__cxx11::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::__cxx11::recursive_directory_iterator::operator=(std::filesystem::__cxx11::recursive_directory_iterator&&)', 'std::filesystem::__cxx11::recursive_directory_iterator::operator=(std::filesystem::__cxx11::recursive_directory_iterator const&)', 'std::filesystem::__cxx11::recursive_directory_iterator::operator++()', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>&&)', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr()', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>&&)', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr()'] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163980 Approved by: https://github.com/isuruf, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> fix --------- Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-10-02 17:49:44 -04:00
Lucas Kabela	2f6387e9a1	[CherrryPick][2.9] Cherry pick request for `Reapply "Make functionalization ViewMeta serializable with pickle #163769` (#163873 ) Reapply "Make functionalization `ViewMeta` serializable with pickle. (#143712)" (#163769) NOTE: This is a re-export of https://github.com/pytorch/pytorch/pull/161994 ; the changes between these two PRs is exclusively to the buck/build files (Summary from #161994 ) Attempted rebase of https://github.com/pytorch/pytorch/pull/143712. This reverts commit 6c713ccb5e0df227dd5b630057cbccd373cbe7d6. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela imported-using-ghimport Test Plan: Imported from OSS Differential Revision: D81524507 Pulled By: Lucaskabela Pull Request resolved: https://github.com/pytorch/pytorch/pull/163769 Approved by: https://github.com/dolpm (cherry picked from commit 7d710403b003e44bf31d367673a05468e49df75d) Co-authored-by: Brian Hirsh <hirsheybar@fb.com>	2025-10-02 16:07:51 -04:00
pytorchbot	017d857f5f	fix pickling for BitwiseFn (#163861 ) * fix pickling for BitwiseFn (#163571) Summary: ran into AttributeError: Can't get local object 'make_opaque_bitwise_fn.<locals>.BitwiseFn' looks like it was fixed for UnaryFn but not BitwiseFn in https://github.com/pytorch/pytorch/pull/138395 Fixes #147841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163571 Approved by: https://github.com/jamesjwu (cherry picked from commit cde5c9aebd7a2eda0c935de1ab7a40b6453c5813) * Fix lintrunner with -a --------- Co-authored-by: dolpm <34420038+dolpm@users.noreply.github.com> Co-authored-by: Lucas Kabela <lucaskabela@meta.com>	2025-10-02 15:35:40 -04:00
pytorchbot	d6e8411889	Make sure Windows CUDA 12.8 build follow same arches as Linux builds (#164477 ) Make sure Windows CUDA 12.8 build follow same arches as Linux builds (#164470) I believe ``set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0`` is the one thats actually used. Hence remove 6.1 to align the support with Linux support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164470 Approved by: https://github.com/tinglvv, https://github.com/nWEIdia, https://github.com/Camyll (cherry picked from commit 235b995ce18de632ab816940319fcd66b46039b8) Co-authored-by: Andrey Talman <atalman@fb.com>	2025-10-02 14:33:06 -04:00
pytorchbot	10b501fde9	[Flex] Fix silent correctness w/ backpropping grads (#164366 ) [Flex] Fix silent correctness w/ backpropping grads (#163677) Fixes #https://github.com/pytorch/pytorch/issues/162228 # Summary Majority of our tests are only compiling flex-attention in isolation. This means that for fake tensor propagation the input primals and all captured buffers dont do any intermediate computation below autograd. As a result result the by happen chance match the `require_grad`ness of the eager implementation and this check will pass. However if score_mod is a the result of some other intermediate fake tensor prop then it is not guaranteed to have accurate req_gradness, which was happening here. TLDR is that this was a boot and suspenders that was actually harmful and we should just let the joint graph handle creating the correct joint graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/163677 Approved by: https://github.com/ydwu4 (cherry picked from commit e2ce79e4cce5327b71fcf366fad1133030563285) Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-10-01 14:43:28 -07:00
pytorchbot	31c72b8a96	[a2av] Separate in/out splits into two tensors (#164028 ) [a2av] Separate in/out splits into two tensors (#163837) Old signature: `all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor(a!) in_out_splits, str group_name)` New signature: `all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name)` i.e. split `in_out_splits` into IN tensor and OUT tensor so that we can define the TORCH_LIBRARY signature better. Also to be in line with the 2D version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163837 Approved by: https://github.com/fduwjj ghstack dependencies: #163886 (cherry picked from commit bbf8aa43efe755b9c310347b3780962fca85bf9c) Co-authored-by: Ke Wen <kw2501@meta.com>	2025-10-01 14:43:19 -07:00
pytorchbot	1cd83de315	[Flex attention] Fix flex attention head broadcast (#164368 ) [Flex attention] Fix flex attention head broadcast (#163426) Fixes part of #163314 In particular bug: Bug 1: H=None Broadcasting Produces Incorrect Results This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (mask[:, :, i]). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting. The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error Pull Request resolved: https://github.com/pytorch/pytorch/pull/163426 Approved by: https://github.com/drisspg (cherry picked from commit 1a42656d6c43a9bb7eb90c511884ce451d29422f) Co-authored-by: Isalia20 <irakli.salia854@gmail.com>	2025-10-01 13:48:10 -07:00
pytorchbot	881c2ccae9	Update Gloo submodule (#164371 ) Update Gloo submodule (#163112) Which makes PyTorch buildable with gcc-15, tested by running the build inside `fedora:44` docker ``` docker run --rm -it fedora:44 bash -c "yum install -y g++ python3-devel git; git clone https://github.com/pytorch/pytorch; cd pytorch; git checkout 8f710acce8332979c9a7bf97e72666dfd35c43e6; python3 -mpip install -r requirements.txt; python3 setup.py bdist_wheel" ``` Fixes https://github.com/pytorch/pytorch/issues/156595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163112 Approved by: https://github.com/huydhn (cherry picked from commit 65845d72917fc27cd89a88b067e7c8f44bc0c987) Co-authored-by: Nikita Shulga <nshulga@meta.com>	2025-10-01 12:00:18 -07:00
pytorchbot	764f65584a	[MPS] Chunk fillBuffer into 4Gb slices (#164370 ) [MPS] Chunk fillBuffer into 4Gb slices (#164108) To avoid regression on MacOS 26, which one could observe by running the following script ```swift import Metal let bufferSize = 1<<32 + 4 guard let device = MTLCreateSystemDefaultDevice() else { fatalError("No Metal device found") } guard let buffer = device.makeBuffer(length: bufferSize, options: .storageModeShared) else { fatalError("Failed to create buffer") } guard let cmdQueue = device.makeCommandQueue() else { fatalError("Failed to create command queue") } guard let cmdBuffer = cmdQueue.makeCommandBuffer() else { fatalError("Failed to create command buffer") } guard let blitEncoder = cmdBuffer.makeBlitCommandEncoder() else { fatalError("Failed to create blit encoder") } blitEncoder.fill(buffer: buffer, range: 0..<bufferSize, value: 0x42) blitEncoder.endEncoding() cmdBuffer.commit() cmdBuffer.waitUntilCompleted() let tailOffs = 8 let hostPtr = buffer.contents().bindMemory(to: UInt8.self, capacity: bufferSize) let tail = Array(UnsafeBufferPointer(start: hostPtr + (bufferSize - tailOffs), count: tailOffs)) for (idx, val) in tail.enumerated() { print("Offs 0x\(String(bufferSize - tailOffs + idx, radix: 16)): 0x\(String(val, radix: 16))") } ``` Test plan: run `test_indexing.py` on MacOS-26 Fixes https://github.com/pytorch/pytorch/issues/161265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164108 Approved by: https://github.com/Skylion007 (cherry picked from commit 6db1b9dd217501e0b3171d96335bed7b2bb53c36) Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2025-10-01 11:59:56 -07:00
pytorchbot	3e8a062385	Update Microsoft C++ Redistributable to the latest version (#164369 ) Update Microsoft C++ Redistributable to the latest version (#161430) Update Microsoft C++ Redistributable link to the latest version as one of the libraries used by AMD currently has a dependency on that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161430 Approved by: https://github.com/malfet (cherry picked from commit 1330c638bef7fac64a42935b5a46ee32637ddd4d) Co-authored-by: Saman Khatir <saman.khatir@amd.com>	2025-10-01 11:57:53 -07:00
pytorchbot	3abee625e1	Fix warn message (#164367 ) Fix warn message (#163578) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163578 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/atalman, https://github.com/v0i0 (cherry picked from commit f3f67ff43a014b75b804d5ded0c7de3d8e0be65f) Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-10-01 11:57:16 -07:00
pytorchbot	f227c883f9	[MPSHooks] Release pending command encoder (#164365 ) [MPSHooks] Release pending command encoder (#164093) Before returning a comand buffer, as subsequent calle are very likely to allocate their own encoder, which results in the following runtime error ``` tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:]:1090: failed assertion `A command encoder is already encoding to this command buffer' ``` Added regression test to `test_mps_extension` Please note, that `torch::mps::get_command_buffer()` should be called with dispatch_queue held, both before and after this change, but many implementations skip that Fixes https://github.com/pytorch/pytorch/issues/163721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164093 Approved by: https://github.com/atalman, https://github.com/Skylion007 (cherry picked from commit 8f32adc90a7fee83583c9ba89dbdfabb317e0452) Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2025-10-01 11:56:42 -07:00
pytorchbot	a5feacb14b	[SDPA] [MPS] Fixes regression in 2.8.0 for scaled_dot_product_attention using mps (#164364 ) [SDPA] [MPS] Fixes regression in 2.8.0 for scaled_dot_product_attention using mps (#163598) Fixes #163597 - Updates fast SDPA implementations to take in query tensor stride info similar to key and value instead of assuming stride. - Updated tests with additional transpose/permutation layouts. New tests catch the regression. ### Benchmarking with script found in [implementation PR](https://github.com/pytorch/pytorch/pull/152781#:~:text=19.8%25%20speed%20improvement-,Script%20to%20get%20perf%3A,-import%20torch%0Aimport) Times are averaged over 100000 iterations. This change should not have any significant performance difference. Tested on an M3 Pro ### Vector Fast Path (q_len=1, k_len=256) - Before: 0.160 ms - After: 0.157 ms ### Vector 2-pass (q_len=1, k_len=4096) - Before: 0.342 ms - After: 0.339 ms ### Vector Fast Path (q_len=8, k_len=256) - Before: 0.228 ms - After: 0.231 ms ### Vector 2-pass (q_len=8, k_len=4096) - Before: 0.432 ms - After: 0.436 ms Pull Request resolved: https://github.com/pytorch/pytorch/pull/163598 Approved by: https://github.com/malfet (cherry picked from commit 1c12d7416bc4f1cf0bc8a229e64169fc361b688e) Co-authored-by: Vismai Khanderao <59114226+Vismai-Khanderao@users.noreply.github.com>	2025-10-01 11:37:14 -07:00
Svetlana Karslioglu	71282c8364	Update Sphinx theme (#164147 ) (#164254 ) Fix links in the top nav bar: `71e55749be` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164147 Approved by: https://github.com/albanD (cherry picked from commit e88cca069171ceb117dd1ceb73e8bf3e54aa83cf)	2025-10-01 09:59:45 -07:00
Huy Do	e70d9f5322	[vllm hash update] update the pinned vllm hash (#164190 ) (#164312 ) * [vllm hash update] update the pinned vllm hash (#164190) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164190 Approved by: https://github.com/pytorchbot * Cherry pick b7125b3c456d48445ab0b84fab28702577cd9557 Signed-off-by: Huy Do <huydhn@gmail.com> --------- Signed-off-by: Huy Do <huydhn@gmail.com> Co-authored-by: PyTorch UpdateBot <pytorchupdatebot@users.noreply.github.com>	2025-10-01 06:43:17 -07:00
pytorchbot	005e3e8d78	Clean up obsoleted vLLM tests (#164282 ) Clean up obsoleted vLLM tests (#163383) They have been removed in https://github.com/vllm-project/vllm/pull/25117 and https://github.com/vllm-project/vllm/pull/22772, thus failing in trunk at the moment after the latest pin commit update Pull Request resolved: https://github.com/pytorch/pytorch/pull/163383 Approved by: https://github.com/wdvr, https://github.com/seemethere, https://github.com/malfet (cherry picked from commit a31acf32bd18e115df910002aef42baf7a9b4a33) Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-30 14:40:57 -07:00
pytorchbot	72cf48ea43	[AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 (#164236 ) [AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 (#163988) See also #163972, which was intended to be this PR. Triton (release/3.5.x) by default ships CUDA12.8 ptxas. This PR tries to bundle a ptxas version for cuda13, so that it can help https://github.com/pytorch/pytorch/issues/163801 when users run on new devices like THOR and Spark. Fixes https://github.com/pytorch/pytorch/issues/163801 Test Plan: Check binary size increase against nightly or v2.9RC Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression. Reference: https://github.com/pytorch/pytorch/pull/119750 and `5c814e2527` Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary. However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then `c6ad34f7eb/python/triton/knobs.py (L216)` would still complain ptxas not found (if removed - it won't know this new one available) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163988 Approved by: https://github.com/atalman (cherry picked from commit 3b4ad4a17d69e2db495ecaf3bae8916282a4eb0d) Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-09-30 13:53:56 -04:00
pytorchbot	a21a4bf11a	[CI] Move libtorch-cpu-shared-with-deps-release-build to python 3.10 (#164182 ) [CI] Move libtorch-cpu-shared-with-deps-release-build to python 3.10 (#162877) Related to https://github.com/pytorch/pytorch/pull/162862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162877 Approved by: https://github.com/malfet (cherry picked from commit c9e57d7e9f326e427fc4ae5c318fd017cd4b75a9) Co-authored-by: atalman <atalman@fb.com>	2025-09-29 15:52:16 -07:00
pytorchbot	21fec65781	Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests (#164172 ) Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests (#163956) Workaround for https://github.com/pytorch/pytorch/issues/163658 Looks like the workflow passes on 12.8 build that use inux.g4dn.4xlarge.nvidia.gpu but its failing on 12.6 builds that use linux.4xlarge.nvidia.gpu: https://github.com/pytorch/pytorch/actions/runs/17953843505/job/51080623612#step:13:470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163956 Approved by: https://github.com/malfet (cherry picked from commit 349c960970f4e29eff0d37a9b3c1ca5ed86a121a) Co-authored-by: atalman <atalman@fb.com> Co-authored-by: Mark Saroufim <marksaroufim@meta.com>	2025-09-29 16:14:37 -04:00
pytorchbot	22d46b50ec	[CUDA] revert PR 130472 (#163379 ) [CUDA] revert PR 130472 (#162950) This change may also resolve https://github.com/pytorch/pytorch/issues/161789, though verification is still needed. PR #130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed (cherry picked from commit 4a160dae3cabaff358a6bb2490d0160dd1bf2cdf) Co-authored-by: thenumberouscode <dream20151224@163.com>	2025-09-29 16:05:26 -04:00
pytorchbot	d1b63e2b4a	Skip test_conv3d_cudnn_broken on ROCM (#164163 ) Skip test_conv3d_cudnn_broken on ROCM (#164138) Followup after https://github.com/pytorch/pytorch/pull/163903 Fixes https://github.com/pytorch/pytorch/issues/164137 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164138 Approved by: https://github.com/Camyll (cherry picked from commit 95be302889b8683b7ec7793a69ffa8891b6b5af8) Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-29 11:41:18 -07:00
pytorchbot	20100b7210	[c10d] P2P tensors must be dense (#163981 ) [c10d] P2P tensors must be dense (#163719) Fixes #161324 by adding `is_non_overlapping_and_dense` check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163719 Approved by: https://github.com/ngimel (cherry picked from commit 11a231ef52841a549913b7a6d423cc9004b6b58b) Co-authored-by: Ke Wen <kw2501@meta.com>	2025-09-29 11:27:24 -07:00
pytorchbot	a2c77043ee	Add operator benchmarking run to CI nightly (#164151 ) Add operator benchmarking run to CI nightly (#162530) This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode. Benchmark Configuration and Coverage: * Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI. * The CI benchmarking is running on various hardwares: H100, A100. * The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162530 Approved by: https://github.com/huydhn (cherry picked from commit 54b38f3b46c33a1cc4e8f7894619358afcbd7c89) Co-authored-by: jainapurva <apurvajain.kota@gmail.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-29 11:21:19 -07:00
pytorchbot	b64fc8e41e	Fix operator benchmark issue#162708 (#164140 ) Fix operator benchmark issue#162708 (#162744) This PR skips memory metric calculation for ops which don't take tensor input, fixing the operator_benchmark bug Fixes https://github.com/pytorch/pytorch/issues/162708 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162744 Approved by: https://github.com/huydhn (cherry picked from commit 5f66902ecfb9cb4f7b9c50cb86307217cec1dbe9) Co-authored-by: jainapurva <apurvajain.kota@gmail.com>	2025-09-29 09:34:26 -07:00
pytorchbot	709f4f62a0	[cuDNN][Convolution] Disable cuDNN for 3D convolutions with kernel size != 1 for cuDNN 9.8+ (#164027 ) [cuDNN][Convolution] Disable cuDNN for 3D convolutions with kernel size != 1 for cuDNN 9.8+ (#163581) To workaround #163539 Still confirming whether 9.10 is affected. The original test states that the convolution is "large," but note that the input size does not apepar to require 64-bit indexing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163581 Approved by: https://github.com/ngimel, https://github.com/malfet (cherry picked from commit e2817ac20426356278502db3b1614ea87cb7cff7) Co-authored-by: Eddie Yan <eddiey@nvidia.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-29 09:07:14 -07:00
pytorchbot	11f776c8ee	[cuDNN][SDPA] Disable dropout for cuDNN SDPA on 9.11 - 9.13 (#164026 ) [cuDNN][SDPA] Disable dropout for cuDNN SDPA on 9.11 - 9.13 (#163903) cuDNN introduced some broken heuristics for these cases so we need to disable dropout to avoid unexpected crashes due to heuristics refusing to proceed Pull Request resolved: https://github.com/pytorch/pytorch/pull/163903 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/atalman (cherry picked from commit ed3085814a870f7a07b7f9c696999a47d4f85376) Co-authored-by: Eddie Yan <eddiey@nvidia.com>	2025-09-29 09:06:23 -07:00
pytorchbot	45e257f046	[cuDNN][conv][64-bit] Disable cuDNN for 64-bit depthwise convs again (#164023 ) [cuDNN][conv][64-bit] Disable cuDNN for 64-bit depthwise convs again (#163171) test is breaking, will check if there's an older version that we can enable on to avoid completely dropping support Pull Request resolved: https://github.com/pytorch/pytorch/pull/163171 Approved by: https://github.com/ngimel, https://github.com/malfet (cherry picked from commit 0ea10f9912a9ec7c6d606bc71e3ec91f20372212) Co-authored-by: eqy <eddiey@nvidia.com>	2025-09-29 09:03:36 -07:00
pytorchbot	37e2626639	Update the operator benchmarking, to benchmark using torch.compile (#164101 ) Update the operator benchmarking, to benchmark using torch.compile (#161394) This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are: - Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit` - Added `--benchmark-name` argument for customizing the benchmark name in output - Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name Sample command to run a single operator: `python -m pt.mm_test --use-compile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394 Approved by: https://github.com/jbschlosser (cherry picked from commit af60398c3a057506363e028bf328843a755b4f24) Co-authored-by: jainapurva <apurvajain.kota@gmail.com>	2025-09-29 07:49:05 -07:00
pytorchbot	d7a703ea92	[SymmMem] Barrier on team instead of world (#163376 ) [SymmMem] Barrier on team instead of world (#163298) As titled. Avoiding a potential hang when running dispatch and combine in subgroups. The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163298 Approved by: https://github.com/fegin (cherry picked from commit f8fb437197033c33ecc435cd5e1e6a5b2bc5bf69) Co-authored-by: Ke Wen <kw2501@meta.com>	2025-09-26 16:41:18 -07:00
pytorchbot	daa3d04325	[SymmMem] Fix memory allocation hold-up (#163375 ) [SymmMem] Fix memory allocation hold-up (#162680) Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162680 Approved by: https://github.com/ezyang ghstack dependencies: #163298 (cherry picked from commit 7130b174e07dbc1a708934b18dede3d88e8f779f) Co-authored-by: Ke Wen <kw2501@meta.com>	2025-09-26 16:35:56 -07:00
pytorchbot	999304396f	[dist] handle discontiguous allgather/reducescatter inputs (#163987 ) [dist] handle discontiguous allgather/reducescatter inputs (#163712) Fixes #163483 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163712 Approved by: https://github.com/ezyang, https://github.com/kwen2501 (cherry picked from commit 71eec6a0bf69f712f4b9279fdc8d1459be0426e6) Co-authored-by: Natalia Gimelshein <ngimel@meta.com>	2025-09-26 16:21:08 -07:00
pytorchbot	5340e741df	[Reland][163423] Promote `@requires_nvshmem` instead of `enable_triton` (#163916 ) [Reland][163423] Promote `@requires_nvshmem` instead of `enable_triton` (#163549) #163423 was approved but reverted due to a revert of base. Relanding without base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163549 Approved by: https://github.com/wdvr (cherry picked from commit 6e6c899347db952f6a691feb4e8610fe9cca0279) Co-authored-by: Ke Wen <kw2501@fb.com> Co-authored-by: Wouter Devriendt <wouterdevriendt@meta.com>	2025-09-26 15:58:30 -07:00
pytorchbot	7cadf8ac04	[Inductor][Intel GPU] Save `threads_per_warp` from tirton compiled kernel for launching kernel correctly in cpp wrapper. (#163388 ) [Inductor][Intel GPU] Save `threads_per_warp` from tirton compiled kernel for launching kernel correctly in cpp wrapper. (#163315) On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163315 Approved by: https://github.com/EikanWang, https://github.com/desertfire (cherry picked from commit 9f8a311af09586ac4026d6a56fc7c4ac7acc62ed) Co-authored-by: xinan.lin <xinan.lin@intel.com>	2025-09-26 14:42:09 -04:00
pytorchbot	f9e495fe8e	Move inductor jobs 3.9->3.10 (#163954 ) Move inductor jobs 3.9->3.10 (#162323) Related to: https://github.com/pytorch/pytorch/issues/161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162323 Approved by: https://github.com/huydhn, https://github.com/Skylion007 (cherry picked from commit e8eeb060348f250975124abb957b1d7d9c4af9a0) Co-authored-by: atalman <atalman@fb.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-26 12:37:50 -04:00
pytorchbot	57dc68844d	[CI] Fix test_triton_wait_until hang (#163914 ) [CI] Fix test_triton_wait_until hang (#163886) I don't know why `nvshmem_barrier_all_kernel` leads the test to hang. Will investigate. But since it is an unnecessary call here, I am removing it to unblock other PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163886 Approved by: https://github.com/fegin (cherry picked from commit 96275dbf88372bb32a123c4ea918498128fbecb9) Co-authored-by: Ke Wen <kw2501@meta.com>	2025-09-26 12:16:00 -04:00
Cui, Yifeng	63da9d2730	[Release 2.9] Update torch-xpu-ops commit pin (#163622 ) Update commit pin to 789f59	2025-09-26 09:46:02 -04:00
pytorchbot	824d59fbf6	[CI] Install libuv for Win testing (#163907 ) [CI] Install libuv for Win testing (#163797) Current working theory why `f0078941cf` caused a regression, are because Windows CI no longer could be build with distributed, as it could not find libuv Pull Request resolved: https://github.com/pytorch/pytorch/pull/163797 Approved by: https://github.com/wdvr (cherry picked from commit cc660d38ac533b92f3ad4cb1105f7a16f74b9f09) Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-26 00:03:22 -07:00
pytorchbot	fc8bf12b38	Fix cpp build (#163887 ) Fix cpp build (#162774) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/162774 Approved by: https://github.com/malfet, https://github.com/atalman (cherry picked from commit b61bdc7cc4c841bf7574bc993f3fd445682f0997) Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-09-25 14:50:59 -07:00
pytorchbot	49dab18ecf	[CD] Add statically linked windows libraries to exclude list (#163862 ) [CD] Add statically linked windows libraries to exclude list (#163768) Fixes: https://github.com/pytorch/pytorch/issues/159514 Seeing following in the Wheel build logs: ``` Linking CXX static library lib\kineto.lib Linking CXX static library lib\dnnl.lib .... ``` These files are around 800MB uncompressed and 109MB compressed, hence provide ~50% size reduction for Windows CPU builds. Test Plan: Build Pytorch Windows binary. Build vision, audio and torchcodec with this binary. Smoke test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163768 Approved by: https://github.com/albanD, https://github.com/malfet (cherry picked from commit 98c4e35f14601909c113b4fd2857b6f0fb525316) Co-authored-by: atalman <atalman@fb.com>	2025-09-25 14:46:56 -07:00
Camyll Harajli	0154ca1d3d	[BE] Update Python min version to 3.10 (#162310 ) (#163885 ) * [BE] Update Python min version to 3.10 (#162310) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310 Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi * comment out executorch --------- Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2025-09-25 14:44:48 -07:00
Andrey Talman	132d9fac3b	Revert "[BE] Update Python min version to 3.10 (#162310 )" (#163882 ) Revert "[BE] Update Python min version to 3.10 (#162310) (#163802)" This reverts commit 7d024a6e299eee2830e9fbdae1913e432160bb23.	2025-09-25 10:54:12 -07:00
Camyll Harajli	87c5d4a858	[cherrypick] [CI] Move Windows build/tests to Python-3.10 #162862 (#163800 ) [CI] Move Windows build/tests to Python-3.10 (#162862) What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely: - `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts - With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such) - Copied test type dependencies from `be01a40157/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1 (L16)` into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps I think in the long run, one needs to update `4432e2cacd/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1` that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time Pull Request resolved: https://github.com/pytorch/pytorch/pull/162862 Approved by: https://github.com/wdvr, https://github.com/huydhn ghstack dependencies: #163339, #163341 Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2025-09-25 09:06:52 -07:00
Andrey Talman	b0dc90881c	[CD] Simplify NVIDIA driver installation step (#163349 ) (#163790 ) Undo changes introduced in https://github.com/pytorch/pytorch/pull/160956 as driver has been updated to 580 for both fleets Fixes https://github.com/pytorch/pytorch/issues/163342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163349 Approved by: https://github.com/seemethere Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-25 10:40:57 -04:00
Andrey Talman	c0577aad39	Use cuda nvrtc so file based on cuda version used by torch (#163642 ) (#163788 ) Fixes https://github.com/pytorch/pytorch/issues/162367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163642 Approved by: https://github.com/msaroufim	2025-09-25 10:40:09 -04:00
pytorchbot	9952b87600	[CD] CUDA 13.0 fix preload logic to include nvidia/cu13/lib/ (#163766 ) [CD] CUDA 13.0 fix preload logic to include nvidia/cu13/lib/ (#163661) Preload logic no longer works with CUDA 13.0 See the installation path: ``` ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/cu13/lib/ libcheckpoint.so libcudadevrt.a libcufft.so.12 libcufile_rdma.so.1 libcusolver.so.12 libnvJitLink.so.13 libnvperf_target.so libnvrtc.alt.so.13 libpcsamplingutil.so libcublas.so.13 libcudart.so.13 libcufftw.so.12 libcupti.so.13 libcusolverMg.so.12 libnvblas.so.13 libnvrtc-builtins.alt.so.13.0 libnvrtc.so.13 libcublasLt.so.13 libcudart_static.a libcufile.so.0 libcurand.so.10 libcusparse.so.12 libnvperf_host.so libnvrtc-builtins.so.13.0 libnvtx3interop.so.1 ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/ cu13 cudnn cusparselt nccl nvshmem ``` Test using script from : https://github.com/pytorch/pytorch/issues/162367 ``` Kernel test passed! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163661 Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/Camyll (cherry picked from commit 141fc7276ebc722b6076cc3afe4fbc6307a1b775) Co-authored-by: atalman <atalman@fb.com>	2025-09-25 10:38:16 -04:00
Andrey Talman	300bade202	[Cherry-Pick] [CD] CUDA 13 specific followup changes. Remove sm50-70 From CUDA 12.6 and CUDA 12.8 builds (#162455 ) (#163764 ) * [CD] CUDA 13 specific followup changes (#162455) Follow up for CUDA 13 bring up https://github.com/pytorch/pytorch/issues/159779 sm50-70 should not be added to sbsa build arch list, as previous archs had no support for arm. remove platform_machine from PYTORCH_EXTRA_INSTALL_REQUIREMENTS Pull Request resolved: https://github.com/pytorch/pytorch/pull/162455 Approved by: https://github.com/atalman * update --------- Co-authored-by: Ting Lu <tingl@nvidia.com>	2025-09-25 10:37:52 -04:00
pytorchbot	96f0c0fa07	Fix some edge cases (#163106 ) Fix some edge cases (#162295) ``` Summary 🔝 Top 5 Performance Differences (by absolute %): shape: (5, 7) ┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 56.937931 ┆ 58.960459 ┆ 1.035522 ┆ 3.552163 │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306 ┆ 86.295642 ┆ 0.967209 ┆ -3.27911 │ │ causal ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594 ┆ 114.380841 ┆ 1.025353 ┆ 2.535349 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149 ┆ 76.685445 ┆ 1.024793 ┆ 2.479344 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 55.279932 ┆ 56.369312 ┆ 1.019707 ┆ 1.97066 │ └────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘ 🔺 Top 5 Cases Where no_peel (change) is Faster than base (baseline): shape: (5, 7) ┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 56.937931 ┆ 58.960459 ┆ 1.035522 ┆ 3.552163 │ │ causal ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594 ┆ 114.380841 ┆ 1.025353 ┆ 2.535349 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149 ┆ 76.685445 ┆ 1.024793 ┆ 2.479344 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 55.279932 ┆ 56.369312 ┆ 1.019707 ┆ 1.97066 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 4096, 4, 4096, 64) ┆ 111.08814 ┆ 112.447047 ┆ 1.012233 ┆ 1.22327 │ └────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘ 🔻 Top 5 Cases Where no_peel (change) is Slower than base (baseline): shape: (5, 7) ┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306 ┆ 86.295642 ┆ 0.967209 ┆ -3.27911 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 1024, 4, 1024, 64) ┆ 78.23082 ┆ 76.693169 ┆ 0.980345 ┆ -1.965531 │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95663 ┆ 95.573333 ┆ 0.985733 ┆ -1.426717 │ │ alibi ┆ torch.bfloat16 ┆ (4, 16, 2048, 4, 2048, 64) ┆ 93.373473 ┆ 92.294147 ┆ 0.988441 ┆ -1.155924 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95147 ┆ 96.105389 ┆ 0.991273 ┆ -0.872685 │ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162295 Approved by: https://github.com/mlazos, https://github.com/v0i0 (cherry picked from commit 864ffe12d737403230e8257b9bce0a830bd590c1) Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-09-25 10:29:39 -04:00
Camyll Harajli	7d024a6e29	[BE] Update Python min version to 3.10 (#162310 ) (#163802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310 Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: #162862 Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2025-09-24 15:48:19 -07:00
pytorchbot	be29c5b207	Add analytics ID to cpp docs (#163695 ) Add analytics ID to cpp docs (#163370) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163370 Approved by: https://github.com/albanD (cherry picked from commit e6a9db58d71e474deac28276de1f611638c32eeb) Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-09-24 15:45:17 -07:00
pytorchbot	5322dab793	Update pytorch.org links in docs/conf.py (#163703 ) Update pytorch.org links in docs/conf.py (#163682) Update links in conf.py to docs.pytorch.org Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163682 Approved by: https://github.com/sekyondaMeta, https://github.com/albanD (cherry picked from commit 8c8416b021e59a5ec58aceb38eeffc63885a28bc) Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-09-24 15:44:43 -07:00
pytorchbot	1dadb6196b	[BE] Introduce `CONDA_ROOT_DIR` (#163805 ) [BE] Introduce `CONDA_ROOT_DIR` (#163341) Which equal to `%CONDA_PARENT_DIR%/Miniconda3`, and replace this pattern with `%CONDA_ROOT_DIR%` throughout the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/163341 Approved by: https://github.com/clee2000 ghstack dependencies: #163339 (cherry picked from commit a273475b01e912f402378a522bb9c4ed37e8413a) Co-authored-by: Nikita Shulga <nshulga@meta.com>	2025-09-24 15:42:16 -07:00
pytorchbot	6c058c1262	Move ROCM trunk wheel builds to 3.10 (#163804 ) Move ROCM trunk wheel builds to 3.10 (#163339) This code is a delicious spaghetti: Sometimes python version is defined in jinja template (see https://github.com/pytorch/pytorch/pull/162297 ) sometimes in shell script (see https://github.com/pytorch/pytorch/pull/162877 ), but this time around it's in a python file (and there is another one called `generate_binary_build_matrix.py` that defines `FULL_PYTHON_VERSIONS`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163339 Approved by: https://github.com/clee2000 (cherry picked from commit 52dd7a898c117305b4407c7f26bbcc7b39f20aaa) Co-authored-by: Nikita Shulga <nshulga@meta.com>	2025-09-24 15:41:55 -07:00
pytorchbot	715dca6725	[export] Remove .contiguous() when saving weights to raw bytes (#163662 ) [export] Remove .contiguous() when saving weights to raw bytes (#163587) Summary: `.contiguous()` will discard the original storage size of the tensor, and could lead to issues during loading. Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_1D_tensor_slicing buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_2D_tensor_slicing Differential Revision: D83016250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163587 Approved by: https://github.com/angelayi (cherry picked from commit 720a7b2887ca4efc8d63b32373182bc97918c76e) Co-authored-by: Yiming Zhou <yimingzhou@meta.com>	2025-09-23 10:15:06 -07:00
pytorchbot	47cb45e4f6	Update pytorch_sphinx_theme2 to latest hash (#163655 ) Update pytorch_sphinx_theme2 to latest hash (#163269) The updated theme: - Fixes articleBody in the json+ld that caused previous Google Search issues - Other minor fixes - 404.html fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/163269 Approved by: https://github.com/albanD (cherry picked from commit 68e75be86ab618bb6b1dc32b603a780ff6046262) Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-09-23 10:13:51 -07:00
pytorchbot	4966d058f2	CUDA 13.0 Warning update for supported architectures (#163633 ) CUDA 13.0 Warning update for supported architectures (#163585) Please see build script: `8da008678f/.ci/manywheel/build_cuda.sh (L69-L71)` This should display correct warning: `` Please install PyTorch with a following CUDA configurations: 12.6 12.8 13.0 following instructions at https://pytorch.org/get-started/locally/ `` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163585 Approved by: https://github.com/malfet (cherry picked from commit 3c64b2abab5a23809140da5bd6272307b776e459) Co-authored-by: atalman <atalman@fb.com>	2025-09-23 10:13:06 -07:00
pytorchbot	579794ed7b	[SymmMem] Fix put_signal + wait_until hang (#163458 ) [SymmMem] Fix put_signal + wait_until hang (#163194) The test used a wrong ptr to refer to remote address: ``` dst_ptr = out_hdl.buffer_ptrs[peer] src_ptr = inp_hdl.buffer_ptrs[rank] sig_ptr = out_hdl.signal_pad_ptrs[peer] ``` All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang. Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163194 Approved by: https://github.com/ngimel ghstack dependencies: #163025, #163152 (cherry picked from commit 80f8be9840c20c3efe1274266b52ab098f4d1030) Co-authored-by: Ke Wen <kw2501@meta.com>	2025-09-23 10:10:02 -07:00
David Berard	7cf37ae3cb	[2.9 cherry pick][triton] update 3.5 pin to bbb06c0334a6772b92d24bde54956e675c8c6604 (#163382 ) (#163583 ) Includes: * https://github.com/triton-lang/triton/pull/8211 to work around a PTXAS bug that was causing 03-matrix-multiplication tutorial matmuls to underperform due to excessive WGMMA waits * https://github.com/triton-lang/triton/pull/8157 to fix a convert_layout bug Verified that this passes Triton CI in https://github.com/pytorch/pytorch/pull/159158 and improves gemm perf (see https://github.com/pytorch/pytorch/issues/159704) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163382 Approved by: https://github.com/Camyll, https://github.com/atalman	2025-09-22 18:20:20 -07:00
Richard Zou	f83cf0714e	[graph partition] Add way to register custom rule (#163310 ) (#163395 ) This PR adds an experimental way to register a custom rule for if inductor should partition the graph around an operator. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/163310 Approved by: https://github.com/ProExpertProg, https://github.com/BoyuanFeng, https://github.com/eellison ghstack dependencies: #162117, #162307, #162651	2025-09-22 18:18:07 -07:00
pytorchbot	ddd5074afc	[CI] Update NVIDIA driver to `580.82.07` (#163522 ) [CI] Update NVIDIA driver to `580.82.07` (#163111) To make CI machines capable of running CUDA-13 tests. Unfortunately, this upgrade regresses NUMBA integration, so live patch it with `6e08c9d08e` This fix was suggested in https://github.com/pytorch/pytorch/issues/162878#issuecomment-3288635745 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163111 Approved by: https://github.com/huydhn (cherry picked from commit 8dbac62edb48815dfca84dfdcca40d6a24d0652b) Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2025-09-22 11:45:48 -04:00
pytorchbot	35c55da805	[Graph Partition] improve custom op output alias (#163380 ) [Graph Partition] improve custom op output alias (#163227) For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. `72fedf0575/torch/_inductor/ir.py (L7976-L7982)` According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163227 Approved by: https://github.com/zou3519 (cherry picked from commit 4967ad8baa724b8b1acc123698bb1265723feb87) Co-authored-by: Boyuan Feng <boyuan@meta.com>	2025-09-19 16:36:03 -07:00
pytorchbot	a576d48637	Skip test_ind_worker_queue on Windows and macOS (flaky) (#163363 ) Skip test_ind_worker_queue on Windows and macOS (flaky) (#162555) Fixes https://github.com/pytorch/pytorch/issues/68643 It was closed by the bot yesterday and the issue was still there https://github.com/pytorch/pytorch/actions/runs/17595694816/job/49989589647. It's better to just skip it directly in the code as this test has been disabled on Windows and MacOS since 2021 O_o Pull Request resolved: https://github.com/pytorch/pytorch/pull/162555 Approved by: https://github.com/clee2000 (cherry picked from commit 98e22c8a693644c6d235d7a858dc411b1aefafa7) Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-19 13:07:00 -07:00
pytorchbot	25d8c0be68	Add decomp rule to assert_tensor_metadata for BatchedTensors (#163361 ) Add decomp rule to assert_tensor_metadata for BatchedTensors (#163008) Whenever there is device move, export introduces assert_tensor_metadata aten operator to make sure to guard for device specialization. This aten op didn't work with Vmap because we didn't register explicit decomp rule saying we just skip BatchedTensor and call it on underlying tensor Differential Revision: [D82483979](https://our.internmc.facebook.com/intern/diff/D82483979) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163008 Approved by: https://github.com/huydhn (cherry picked from commit e28983be76aa4651e3cb69dc3a4234d75038d938) Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>	2025-09-19 13:00:57 -07:00
Boyuan Feng	b1aae80953	[Cherry Pick][Graph Partition] allow sharing default device context (#163097 ) cherry pick PR 162873	2025-09-19 11:10:29 -07:00
eqy	76bebf38de	[Release 2.9] [cuDNN][SDPA][submodule] Roll-back cuDNN frontend upgrade, update Met… (#163265 ) [cuDNN][SDPA][submodule] Roll-back cuDNN frontend upgrade, update Meta registration (#163104) For https://github.com/pytorch/torchtitan/issues/1713 Also note that we will need to rollback the cuDNN frontend upgrade in 2.9 as it currently introduces a segmentation fault by assuming tensors have their strides and sizes populated at graph creation time `1a7b4b78db/include/cudnn_frontend/node/sdpa_support_surface.h (L447%C2%A0)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163104 Approved by: https://github.com/drisspg	2025-09-19 10:53:04 -07:00
pytorchbot	bc158ebdc7	[SymmMem] Fix NVSHMEM plugin + Triton 3.5 (#163262 ) [SymmMem] Fix NVSHMEM plugin + Triton 3.5 (#163152) 1. The dispatch signatures defined in `core.extern_elementwise` call must match the C signature of the NVSHMEM functions, in particular the dtypes. Otherwise, there would be weird errors, such as IMA or hang. When matched, most of time the NVSHMEM device function will be inlined into the generated PTX. When not matched, it is represented as a function call in the PTX (not sure if it is the function call that goes wrong). 2. When calling the `core.extern` wrappers from the `triton.jit` kernels, the input must be cast to match the signatures defined in 1, e.g. via `nbytes.to(tl.int64)`. Otherwise, Triton will report a key error when searching for such kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163152 Approved by: https://github.com/ngimel ghstack dependencies: #163025 (cherry picked from commit 57a54a04b6eb78e0aa7d13b48e25fb8c0c49fd60) Co-authored-by: Ke Wen <kw2501@meta.com>	2025-09-19 10:51:02 -07:00
Camyll Harajli	ffa6f63fe2	Revert "Make distributed modules importable even when backend not bui… (#163024 ) Revert "Make distributed modules importable even when backend not built (#159889)" (#162568) This reverts commit a0d026688cd69583d5a4e0c6f3e5fda141a7f4a9. Revert "Always build USE_DISTRIBUTED. (#160449)" This reverts commit d80297a6846f1f2c36fd4f19e22919f2abe8fcea. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162568 Approved by: https://github.com/huydhn Co-authored-by: Edward Yang <ezyang@meta.com>	2025-09-19 10:34:55 -07:00
pytorchbot	baab5c6c8b	[ONNX] Update export docstring & Set fallback=False by default (#162637 ) * [ONNX] Update export docstring (#162622) Update export docstring to reflect the latest configuration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162622 Approved by: https://github.com/titaiwangms (cherry picked from commit 7e2e83cdbe532b230dee40cfe0454116c9b64710) * Change fallback option to False in ONNX export * Change fallback parameter default to False --------- Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-09-16 17:23:47 -07:00
pytorchbot	9718af107e	Support vmap + custom autograd function/improve DTensor constructor inefficiency (#162738 ) Support vmap + custom autograd function/improve DTensor constructor inefficiency (#162240) This makes gemma3 exportable on transformers=4.55.4 In HF, there is a torch funciton mode called TransformGetItemToIndex which internally calls custom autograd function. When this custom autograd function is called under vmap, It triggers CustomFunctionHigherOrderOP which error-ed because there was no pre-dispatch proxy mode implementation. Since there are number of requests lately to add various operators in pre-dispatch IR, I introduce a decorator in export that works similar to `allow_in_graph`. Basically: 1) We intercept custom_autograd_function.apply at pre-dispatch mode when this decorator is applied 2) We apply `flat_apply` HOP to hide the pytree spec for this autograd function. Note that this adds restriction that this custom autograd function needs to take in fx-able types. 3) subclass constructor decorator is implemented similarly, so we just refactor it to use similar implementation as this new decorator. eventually we should delete the subclass constructor decorator. 4) Move some code in subclass constructor decorator to exit early in non-export environment which should shave off some inefficiency (around 1% according to @swolchok 's benchmark) Fixes: https://github.com/pytorch/pytorch/issues/161563#issuecomment-3246309758 Differential Revision: [D82141316](https://our.internmc.facebook.com/intern/diff/D82141316) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162240 Approved by: https://github.com/ydwu4 (cherry picked from commit 463fbc8ca0537e5635236190d2ca38ce6fcef831) Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>	2025-09-16 17:22:16 -07:00
pytorchbot	7f8ba48c2a	Fix the regression issue caused by non-arrch64 platforms not hitting the MKLDNN path. (#162778 ) Fix the regression issue caused by non-arrch64 platforms not hitting the MKLDNN path. (#162168) This issue was introduced by the commit in issue #161065. Added an extra check to provide a proper path for other platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162168 Approved by: https://github.com/mingfeima, https://github.com/malfet (cherry picked from commit 563921619b3e820b170475b9278ff94ee6e1a32c) Co-authored-by: Yuxingwang-intel <yuxing.wang@intel.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-16 17:21:10 -07:00
Cui, Yifeng	aebf427c53	[Release 2.9] Update torch-xpu-ops commit pin (#162935 ) Update commit pin to f8408a	2025-09-16 17:19:31 -07:00
pytorchbot	44baf2ff8d	fix deterministic scatter_add path for multi-d tensors (#162977 ) fix deterministic scatter_add path for multi-d tensors (#162866) PReviously for more than 2d tensor `select` didn't work correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162866 Approved by: https://github.com/valentinandrei (cherry picked from commit bf6b40da3e3be7718b8ddc94eed2da8cabaa5e86) Co-authored-by: Natalia Gimelshein <ngimel@meta.com>	2025-09-16 17:17:36 -07:00
pytorchbot	1076941ff7	[ONNX] Fix rotary_embedding_23 implementation (#163041 ) [ONNX] Fix rotary_embedding_23 implementation (#162865) The implementation of rotary_embedding_23 when input is 3D was incorrect. ## Tested Locally with ```py import onnx_ir as ir import onnx import torch import os import numpy as np base_path = "/home/justinchu/dev/onnx/onnx/backend/test/data/node" test_names = [ "test_rotary_embedding", "test_rotary_embedding_3d_input", "test_rotary_embedding_interleaved", "test_rotary_embedding_no_position_ids", "test_rotary_embedding_no_position_ids_interleaved", "test_rotary_embedding_no_position_ids_rotary_dim", "test_rotary_embedding_with_interleaved_rotary_dim", "test_rotary_embedding_with_rotary_dim", ] model_paths = [os.path.join(base_path, name) for name in test_names] for path in model_paths: print(f"Checking {path} for issues...") model = onnx.load(os.path.join(path, "model.onnx")) input0 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_0.pb")) ).numpy() input1 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_1.pb")) ).numpy() input2 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_2.pb")) ).numpy() if os.path.exists(os.path.join(path, "test_data_set_0", "input_3.pb")): input3 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_3.pb")) ).numpy() else: input3 = None output0 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "output_0.pb")) ).numpy() m = ir.from_proto(model) node = m.graph[-1] print(node) assert node.op_type == "RotaryEmbedding" interleaved = node.attributes.get_int("interleaved", 0) num_heads = node.attributes.get_int("num_heads", 0) rotary_embedding_dim = node.attributes.get_int("rotary_embedding_dim", 0) torch_out = torch.onnx.ops.rotary_embedding( torch.tensor(input0), torch.tensor(input1), torch.tensor(input2), position_ids=torch.tensor(input3) if input3 is not None else None, interleaved=bool(interleaved), num_heads=num_heads, rotary_embedding_dim=rotary_embedding_dim, ) torch_out = torch_out.detach().cpu().numpy() np.testing.assert_allclose(torch_out, output0) ``` Fix https://github.com/pytorch/pytorch/issues/162848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162865 Approved by: https://github.com/kunal-vaishnavi, https://github.com/titaiwangms (cherry picked from commit fdf68fa5d70abebee1c5090a51ea30c7aa40b9b0) Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-09-16 17:16:23 -07:00
pytorchbot	0ac9fa4413	[ez][CI] Fix docs push in nightly workflow (#163085 ) [ez][CI] Fix docs push in nightly workflow (#162657) HUD metrics page says docs push hasn't happened in 21 days <img width="293" height="142" alt="image" src="https://github.com/user-attachments/assets/f930aab8-0503-4bf2-b962-8c375dec6b78" /> I guess main branch docs just haven't been updated? Did anyone notice? Do we care? Either way I think this should fix it Likely started after https://github.com/pytorch/pytorch/pull/161182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162657 Approved by: https://github.com/huydhn (cherry picked from commit 2f533959430c2a41fe16ef79fe4d680a5c4e0585) Co-authored-by: Catherine Lee <csl@fb.com>	2025-09-16 12:04:17 -07:00
pytorchbot	152383b745	fix typo: summit -> submit (#162597 ) fix typo: summit -> submit (#162587) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162587 Approved by: https://github.com/justinchuby (cherry picked from commit fefc406a3d0d90db0f808419fb88045f90b213cd) Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>	2025-09-12 11:41:11 -04:00
pytorchbot	c31a8186c1	[CD] Aarch64 Fix packaging ``libarm_compute.so`` and other libraries to the aarch64 CUDA wheels (#162596 ) [CD] Aarch64 Fix packaging ``libarm_compute.so`` and other libraries to the aarch64 CUDA wheels (#162566) Fixes aarch64 linux packaging, following error: https://github.com/pytorch/vision/actions/runs/17612462583/job/50037380487#step:15:62 ``` Traceback (most recent call last): File "/__w/vision/vision/pytorch/vision/setup.py", line 13, in <module> import torch File "/__w/_temp/conda_environment_17612462583/lib/python3.11/site-packages/torch/__init__.py", line 415, in <module> from torch._C import * # noqa: F403 ^^^^^^^^^^^^^^^^^^^^^^ ImportError: libarm_compute.so: cannot open shared object file: No such file or directory ``` Due to missing dependencies. Current Error: File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl renamed as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl Hence the repackaging does not take any effect. This PR does following File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl deleted File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl Looks like after migrating from zipping the wheel to wheel pack renaming the wheel is no longer necessary. Hence removing renaming and deleting old file. ``` 2025-09-10T10:10:05.9652454Z Using nvidia libs from pypi - skipping CUDA library bundling 2025-09-10T10:10:05.9656595Z Copying to /pytorch/dist/tmp/torch/lib/libgomp.so.1 2025-09-10T10:10:05.9873843Z Copying to /pytorch/dist/tmp/torch/lib/libgfortran.so.5 2025-09-10T10:10:06.0410041Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute.so 2025-09-10T10:10:06.2869242Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute_graph.so 2025-09-10T10:10:06.4385740Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_lp64_gomp.so.0 2025-09-10T10:10:06.5461372Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_lp64_gomp.so.0 2025-09-10T10:10:06.5728970Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_core.so.0 2025-09-10T10:10:06.6231872Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_core.so.0 2025-09-10T10:10:14.1503110Z Updated tag from Tag: cp310-cp310-linux_aarch64 2025-09-10T10:10:14.1503482Z to Tag: cp310-cp310-manylinux_2_28_aarch64 2025-09-10T10:10:14.1503682Z 2025-09-10T10:10:41.6498892Z Repacking wheel as /pytorch/dist/torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK 2025-09-10T10:10:41.9394460Z Renaming torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl wheel to torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl ``` Test Plan, Executed on local file: ``` inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/WHEEL inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/entry_points.txt inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/top_level.txt inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/RECORD Bundling CUDA libraries with wheel Updated tag from Tag: cp310-cp310-manylinux_2_28_aarch64 to Tag: cp310-cp310-manylinux_2_28_aarch64 Repacking wheel as ubuntu/dist/torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK Copying torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl to artifacts Build Complete. Created torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl.. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162566 Approved by: https://github.com/jeanschmidt, https://github.com/NicolasHug (cherry picked from commit 3d32bb114bf0d5bd0193dc40f20253635dddf080) Co-authored-by: atalman <atalman@fb.com>	2025-09-10 12:22:02 -04:00
pytorchbot	ce928e17c1	CUDA 13.0 Windows Nvidia Driver Update to 580.88 (#162501 ) CUDA 13.0 Windows Nvidia Driver Update to 580.88 (#162425) Related to https://github.com/pytorch/pytorch/issues/162333 https://github.com/pytorch/pytorch/issues/159779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162425 Approved by: https://github.com/tinglvv, https://github.com/malfet (cherry picked from commit e38e953432764e00f16999c8b7df6346ad357a16) Co-authored-by: atalman <atalman@fb.com>	2025-09-09 14:27:57 -04:00
Andrey Talman	cd2c98a5b5	[Release 2.9] Release only changes (#162493 )	2025-09-09 11:15:20 -07:00
Huy Do	4840a1a591	Run vLLM tests on all trunk commits before 2.9 branch cut (#161797 ) This makes it easier to bisect issue now given that we don't have lots of time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161797 Approved by: https://github.com/yangw-dev	2025-09-09 05:56:41 +00:00
Yang Wang	d49205fe1f	Add more tests for vllm and clean out the old vllm test (#162292 ) Test failure coverage from pytorch 2.8 release issues [internal access only](https://docs.google.com/document/d/1zvK1eUAHubHGGHg9jKxd-QlP89fzgfqOBvE2m9mUs90/edit?tab=t.0 ) See coverage mapping \| Given test / pattern \| Suite ID (from config) \| \|---\|---\| \| pytest -v -s basic_correctness/test_cumem.py \| vllm_basic_correctness_test \| \| pytest -v -s entrypoints/openai/test_sleep.py \| vllm_entrypoints_test \| \| pytest -v -s entrypoints/openai/test_translation_validation.py::test_long_audio_request \| vllm_entrypoints_test \| \| pytest -v -s lora/test_quant_model.py \| vllm_lora_28_failure_test \| \| pytest -v -s -x tests/lora/test_llama_tp.py \| vllm_lora_tp_test_distributed \| \| pytest -v -s distributed/test_sequence_parallel.py -k test_tp_sp_generation \|vllm_distributed_test_28_failure_test \| \| pytest -v -s distributed/test_sequence_parallel.py::test_tp_sp_generation[...] \| vllm_distributed_test_28_failure_test \| \| pytest models/language/generation/test_mistral.py::test_models[...] \| vllm_languagde_model_test_extended_generation_28_failure_test \| \| pytest models/multimodal/pooling/test_jinavl_reranker.py::test_model_text_image[...] \| vllm_multi_model_test_28_failure_test \| \| tests/lora/test_qwen2vl.py::test_qwen2vl_lora \| vllm_lora_test \| \| tests/lora/test_qwen2vl.py::test_qwen25vl_lora \| vllm_lora_test \| \| tests/lora/test_qwen2vl.py::test_qwen2vl_lora_beam_search \| vllm_lora_test \| \| tests/lora/test_phi.py::test_phi2_lora \| DIDN'T FIND IT IT IN VLLM \| \| models/multimodal/generation/test_voxtral.py::test_models_with_multiple_audios[5-128-half] \| vllm_multi_model_test_28_failure_test \| \| models/test_initialization.py::test_can_initialize[VoxtralForConditionalGeneration] \| vllm_basic_models_test \| \| pytest -v -s -x lora/test_chatglm3_tp.py -k test_chatglm3_lora_tp4_fully_sharded_loras \| vllm_lora_tp_test_distributed \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/162292 Approved by: https://github.com/atalman, https://github.com/huydhn	2025-09-09 05:53:46 +00:00
James Wu	d85392a88e	Add BundledAOTAutogradSerializableCallable (#162170 ) This PR hooks up the python wrapper inductor backend to aot_compile. This is not the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now. In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162170 Approved by: https://github.com/zhxchen17 ghstack dependencies: #162169	2025-09-09 05:42:19 +00:00
Chien-Chin Huang	7feb8fc589	[SymmMEM] Allow to import _SymmetricMemory when NVSHMEM is not available (#162142 ) Summary: As we have multiple backends, _SymmetricMemory should not be imported together with NVSHMEM related modules Pull Request resolved: https://github.com/pytorch/pytorch/pull/162142 Approved by: https://github.com/dcci, https://github.com/kwen2501	2025-09-09 05:37:43 +00:00
PyTorch MergeBot	60d009267e	Revert "testing infra and some fixes (#162183 )" This reverts commit d8b6622bb6a3879d3832ab6cdc26ff4188ea4a2d. Reverted https://github.com/pytorch/pytorch/pull/162183 on behalf of https://github.com/huydhn due to Failing a test on macos ([comment](https://github.com/pytorch/pytorch/pull/162183#issuecomment-3268922096))	2025-09-09 05:26:32 +00:00
Isuru Fernando	4590438329	[fx] fix qualified name for methods of torch.Tensor (#162407 ) This fixes an error in the previous PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162407 Approved by: https://github.com/ezyang, https://github.com/XuehaiPan	2025-09-09 05:14:43 +00:00
Jeffro	8494afb837	Add missing fstream include to fix std::ofstream compilation error (#162421 ) ## Summary This PR adds a missing `#include <fstream>` to fix a compilation error that occurred with the clang compiler on the standard Google internal compile setup (built with bazel). ## Details The `std::ofstream` type was implicitly instantiated, which can cause compilation to fail with certain compilers. In this case, the clang compiler within the Google internal compile setup failed with an implicit instantiation error of `std::basic_ofstream<char>`. By explicitly including the `<fstream>` header, this PR resolves the error and ensures proper compilation in a wider range of setups and compilers. ## Error message: ``` torch/csrc/distributed/c10d/FlightRecorder.cpp:8:17: error: implicit instantiation of undefined template 'std::basic_ofstream<char>' 8 \| std::ofstream file(filename_, std::ios::binary); \| ^ libcxx/include/__fwd/fstream.h:26:7: note: template is declared here 26 \| class basic_ofstream; \| ^ 1 error generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162421 Approved by: https://github.com/ezyang	2025-09-09 05:14:32 +00:00
PyTorch UpdateBot	7ad40de60e	[audio hash update] update the pinned audio hash (#162437 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162437 Approved by: https://github.com/pytorchbot	2025-09-09 04:41:34 +00:00
PyTorch UpdateBot	607327beae	[vllm hash update] update the pinned vllm hash (#162356 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162356 Approved by: https://github.com/pytorchbot	2025-09-09 04:40:25 +00:00
Ke Wen	f216d64bfe	[SymmMem] Better tuning of A2AV based on accurate node boundary (#162003 ) Use `world_within_direct_access()` to distinguish intra- vs inter- node. Previously we assumed a fixed node size of 8, which is not true for NVL72. Also added env var `TORCH_SYMMMEM_NBLOCKS` for control. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162003 Approved by: https://github.com/ngimel, https://github.com/fduwjj	2025-09-09 04:18:17 +00:00
Nikita Shulga	847d7f21af	[CUDA-13] Implement workaround for cudaErrorNotSupported (#162412 ) See https://github.com/pytorch/pytorch/issues/162333#issuecomment-3267929585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162412 Approved by: https://github.com/eqy, https://github.com/atalman	2025-09-09 04:12:10 +00:00
Ke Wen	065c446193	[SymmMem] Use global pe for put and get (#162394 ) NVSHMEM put/get APIs take global PE instead of local counterpart. So we'd need to do a translation within the kernel. Also added a sub-group test for dispatch and combine mimic'ing the Expert Parallel cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162394 Approved by: https://github.com/ngimel, https://github.com/fegin ghstack dependencies: #162320	2025-09-09 03:58:48 +00:00
Ke Wen	98ecc0f374	[SymmMem] Add team pool to hold duplicated teams for the same rank group (#162320 ) When multiple threadblocks call device-side collectives concurrently, NVSHMEM requires each call being made on a separate team struct, see [Collective operations scopes and active sets](https://docs.nvidia.com/nvshmem/api/gen/api/collectives.html?highlight=nvshmem_barrier_all#collective-operations-scopes-and-active-sets). This PR adds a util `get_n_teams` for creating duplicated nvshmem teams for the same rank group, i.e. team pool. So that we can use them on device side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162320 Approved by: https://github.com/ngimel	2025-09-09 03:58:48 +00:00
Arsh Zahed	4c45090cf7	[DTensor] Check if tracing for sharding propagation to handle unhashable keys (#160798 ) Fixes #159590 This is similar to the reverted commit #156868, except it resolves an issue with two caches becoming misaligned, leading to incorrect objects for stateful placements (i.e. `_MaskPartial`) as in issue #159601. This adds little to no overhead in eager ([see past benchmarks](https://github.com/pytorch/pytorch/pull/156868#issuecomment-3047831149)). This also handles cases such as #159590 where dynamo is disabled during tracing by entering the Python Dispatcher ahead of the sharding propogation during compile. Tests are added/modified to handle these, and the list/tuple inputs with the cat op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160798 Approved by: https://github.com/bdhirsh	2025-09-09 03:52:05 +00:00
PyTorch MergeBot	1641606aa4	Revert "Add BundledAOTAutogradSerializableCallable (#162170 )" This reverts commit 5babb4d5c04b1ff7ed5f96f7aea1898cd4faef5a. Reverted https://github.com/pytorch/pytorch/pull/162170 on behalf of https://github.com/huydhn due to This PR has a merge conflict with D81793200 on aot_compile.py where PRs and diffs are landed in reverted order ([comment](https://github.com/pytorch/pytorch/pull/162170#issuecomment-3268735428))	2025-09-09 03:33:36 +00:00
Shunting Zhang	7b8a64557d	[inductor] fix 3d tiled online softmax (#162341 ) The online_softmax_reduce runtime helper previously assumes the input tl.Tensor's are 2d tensors. But with tiled reduction, they can be 3d (y, x, r). Pull Request resolved: https://github.com/pytorch/pytorch/pull/162341 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #162311	2025-09-09 02:59:52 +00:00
Tugsbayasgalan Manlaibaatar	d8b6622bb6	testing infra and some fixes (#162183 ) This PR is quite large in that it covers most of rough edges in the new strict export flow: 1. Handle nn_module_stack correctly now that we are tracing wrapper module 2. module_call_spec needs to get queried from source directly because we are not running the bytecode anymore. 3. Correct input and output handling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162183 Approved by: https://github.com/zhxchen17 ghstack dependencies: #162167	2025-09-09 02:42:11 +00:00
Yiming Zhou	a965f09793	[export] Update PT2 archive docs (#162308 ) Summary: Minor updates based on the recent refactoring for weight saving and loading Test Plan: doc change only Rollback Plan: Differential Revision: D81821994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162308 Approved by: https://github.com/angelayi	2025-09-09 02:08:13 +00:00
Kurt Mohler	583bbf7761	[MPS] Add `native_dropout` and `native_dropout_backward` (#162108 ) Fixes #162002 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162108 Approved by: https://github.com/malfet	2025-09-09 01:44:06 +00:00
Scott Wolchok	e025c0f459	Dynamo: set_eval_frame microoptimization (#162220 ) Optimize for common case and remove a pair of refcount operations (see new comments.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162220 Approved by: https://github.com/jansel, https://github.com/williamwen42 ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219	2025-09-09 01:10:06 +00:00
Scott Wolchok	a8a187b2cf	Overload _get_operation_for_overload_or_packet & friends to accept ArrayRef (#162219 ) Avoids requiring vector allocation to call this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162219 Approved by: https://github.com/Skylion007 ghstack dependencies: #161591, #161595, #161633, #161634, #161692	2025-09-09 01:10:06 +00:00
Scott Wolchok	12db2a7889	Call checkLong in is_int_or_symint, completing TODO (#161692 ) Calling this first minimizes overhead for plain old ints, making cheap things cheap. Differential Revision: [D81530098](https://our.internmc.facebook.com/intern/diff/D81530098) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161692 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #161591, #161595, #161633, #161634	2025-09-09 01:10:06 +00:00
Scott Wolchok	eab2afeff7	fastpath type Tensor in THPVariable_NewWithVar (#161634 ) It is cheap to do an exact check against Tensor and much faster when it works (PyType_IsSubtype does not have this fastpath, I checked [source](`9ee0214b5d/Objects/typeobject.c (L2889)`)). Spot-checked in perf on detach-DTensor-in-a-loop benchmark; small win but clear. Differential Revision: [D81530101](https://our.internmc.facebook.com/intern/diff/D81530101) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161634 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #161591, #161595, #161633	2025-09-09 01:10:06 +00:00
Scott Wolchok	a951f435fd	Avoid redundant PyTuple_GetSize call in _maybe_handle_torch_function (#161633 ) py::args::size() calls PyTuple_GetSize. Compiler can't know the two calls will always return the same result, so we have to consolidate them ourselves. Differential Revision: [D81530096](https://our.internmc.facebook.com/intern/diff/D81530096) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161633 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #161591, #161595	2025-09-09 01:10:06 +00:00
karthickai	6eb14ac60f	[Inductor] Fix cross-device scalar lowering - cpu scalar with cuda tensor fails in torch.compile (#161447 ) This PR fixes bug in TorchInductor where cross-device scalar indexing fails during compilation, causing discrepancies from eager mode behavior. Fixes: #140457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161447 Approved by: https://github.com/mlazos	2025-09-09 01:07:02 +00:00
PyTorch MergeBot	ed77e23b68	Revert "[dynamo] Constant fold torch.autograd._profiler_enabled (#158482 )" This reverts commit d7e1b8b11d7430c7633dcad6f6596b5df8fa02f7. Reverted https://github.com/pytorch/pytorch/pull/158482 on behalf of https://github.com/borgstrom due to NCCL hangs in S560336 ([comment](https://github.com/pytorch/pytorch/pull/158482#issuecomment-3268426781))	2025-09-09 00:21:05 +00:00
Ting Lu	897c4e70a7	Move to small wheel approach for CUDA SBSA wheel (#160720 ) https://github.com/pytorch/pytorch/issues/160673 Use download.pytorch.org's dependencies like x86 build instead of bundling libs into the wheel Pull Request resolved: https://github.com/pytorch/pytorch/pull/160720 Approved by: https://github.com/atalman	2025-09-09 00:18:43 +00:00
Zhengxu Chen	8485aac873	[precompile] Fix inlined source tracking with generators. (#162389 ) Summary: When compiled code has generator, code.co_firstlineno will be inconsistent with the result from inspect.getsource, which returns the toplevel enclosing code source rather than the inner code location. In this case, it seems simpler to just use the toplevel enclosing code location rather than the co_firstlineno field. Test Plan: test_package.py -k test_code_with_generator Rollback Plan: Differential Revision: D81929751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162389 Approved by: https://github.com/dolpm, https://github.com/hrithick-codes	2025-09-09 00:13:54 +00:00
atalman	c0fc86b511	Fix aarch64 wheel pack (#159481 ) PR that introduced the change: https://github.com/pytorch/builder/pull/1775 Use wheel pack instead of zip to repack the wheel. It should regenerate the RECORD file and update all the hashes correctly. TODO: Apply wheel pack instead of zip to Rest of builds Add validation test to make sure wheel contents matches RECORD file Pull Request resolved: https://github.com/pytorch/pytorch/pull/159481 Approved by: https://github.com/malfet	2025-09-08 23:36:50 +00:00
Thomas Bohnstingl	07f07309c6	[associative_scan] Autograd separated (#139939 ) This PR implements the Autograd feature of the associative_scan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139939 Approved by: https://github.com/huydhn	2025-09-08 23:30:11 +00:00
Laith Sakka	189a054cfb	Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. [attempt2] (#160869 ) [relanding again after fixing internal build] Summary: This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous() but want to find those call sites to handle this properly by calling is_contiguous_or_false() and not is_contiguous() explitly when appropriate. I had to fix one issue after removing the implicit size oblivious reasoning. here is context we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE. when people call is_contiguous we do sym_is_contiguous().guard_bool() when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false() one issue not handled well was this path ``` c10::SymBool TensorImpl::sym_is_contiguous_custom( at::MemoryFormat memory_format) const { if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) { return pyobj_slot_.load_pyobj_interpreter()->is_contiguous( this, memory_format); } return sym_is_contiguous_default(memory_format); } ``` namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format); This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning. once we removed that implicit size oblivious reasoning, the right thing we want is to call return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format); otherwise we would get DDE even if the caller is doing sym_is_contiguous. so I had to define it for pyinterpreter, and then I had to override it for nested tensors. Approved by: https://github.com/ezyang Test Plan: contbuild & OSS CI, see `e444cd24d4` Rollback Plan: Differential Revision: D80435179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160869 Approved by: https://github.com/ezyang	2025-09-08 22:59:13 +00:00
Colin Peppler	5fd6b6a2db	[refactor] add helper sizevars function, is_size_one, for size==1 checks (#162189 ) ## Summary - document guard behavior in `SizeVarAllocator.is_size_one` - use `is_size_one` for broadcast/expand checks. - This diff is a no-op since we'd use `shape_env.evaluate_expr(... fallback_value=False)` `a4f9132a17/torch/_inductor/sizevars.py (L450-L453)` ------ https://chatgpt.com/codex/tasks/task_e_68b8d0d1f2c48328b2d38c00e738bc8c Pull Request resolved: https://github.com/pytorch/pytorch/pull/162189 Approved by: https://github.com/laithsakka	2025-09-08 22:48:16 +00:00
drisspg	ac9ccd0dc2	Add return-max-scores to flex-attention (#161667 ) # Summary ### Update API ```Py class AuxRequest(NamedTuple): """Request which auxiliary outputs to compute from flex_attention. Each field is a boolean indicating whether that auxiliary output should be computed. """ lse: bool = False max_scores: bool = False class AuxOutput(NamedTuple): """Auxiliary outputs from flex_attention operation. Fields will be None if not requested, or contain the tensor if requested. """ lse: Optional[Tensor] = None max_scores: Optional[Tensor] = None out_only = flex_attention(query, key, value, score_mod) out_max, aux_max = flex_attention( query, key, value, score_mod, return_aux=FlexAttentionAuxRequest(max_scores=True), ) out_both, aux_both = flex_attention( query, key, value, score_mod, return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True), ) ``` Returns the max post mod scores from flex attention. Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups. Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args. We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors. ### Req Grad I currently dont return a max_scores that supports backproping grads. I think this might be feasible but since max is essentially 1 hot on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch). For now no grad, we can re-visit if needed. ## Perf I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path. ```Shell 🔝 Top 5 TFlops Deltas (by absolute %): shape: (5, 7) ┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡ │ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │ │ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘ 🔺 Top 5 Positive TFlops Deltas (highest +%): shape: (5, 7) ┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡ │ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │ │ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 161.031318 ┆ 158.597808 ┆ 2.43351 ┆ 1.534391 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘ 🔻 Top 5 Negative TFlops Deltas (lowest -%): shape: (5, 7) ┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡ │ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, ┆ 175.546923 ┆ 177.81205 ┆ -2.265127 ┆ -1.273888 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, ┆ 156.282597 ┆ 158.209134 ┆ -1.926537 ┆ -1.217715 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16, ┆ 232.542929 ┆ 235.140136 ┆ -2.597207 ┆ -1.104536 │ │ ┆ ┆ 2048, 128) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 169.652791 ┆ 171.475986 ┆ -1.823195 ┆ -1.063236 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng	2025-09-08 22:44:48 +00:00
Avik Chaudhuri	711c8c821e	shape guards (#161178 ) Summary: This PR introduces shape guards to export. Previously only value ranges, equalities, and specializations would be tracked for symbolic expressions, and we had a forward hook to check them. Instead now we create a function to check shape guards and call it in the exported program. Test Plan: updated several tests Rollback Plan: Differential Revision: D80713603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161178 Approved by: https://github.com/tugsbayasgalan	2025-09-08 22:44:09 +00:00
Laith Sakka	2c538c9acf	rewrite __maybe_broadcast should_expand check for unbacked (#162109 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162109 Approved by: https://github.com/aorenste ghstack dependencies: #162084, #162099	2025-09-08 22:41:18 +00:00
Laith Sakka	85fe94e933	make should_swap more dde friendly (#162099 ) unblock customers for common cases with DDE ,until @pianpwk land the change to should_swap https://github.com/pytorch/pytorch/pull/160473. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162099 Approved by: https://github.com/aorenste ghstack dependencies: #162084	2025-09-08 22:41:18 +00:00
Hao Wu	fecd9686f5	Graph split event tracker (#159795 ) Summary: A tool to track events in graph split, specifically on how nodes being end up in acc or cpu subgraphs. Usage: use env var to specify a mode and necessary arguments. FX_NET_ACC_SPLITTER_TRACKER_MODE: Tracker mode. ``` Different modes of the event tracker: "0": Tracker not enabled (by default) "1": Tracker enabled but no dumps. Information available by setting breakpoints and visually inspect in pdb. "2": Tracker enabled and dumps all events to DUMP_PREFIX_all.txt "3": In addition to events dump, track nodes specified by ENV_FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES recusrively and dump to DUMP_PREFIX_nodex.txt "4:: In addition to events dump, track all nodes with more than 1 event recusrively and dump to DUMP_PREFIX_nodex.txt ``` FX_NET_ACC_SPLITTER_TRACKER_DUMP_PATH: overriding dump path. Leave empty for `~`. FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES: Nodes to track for mode "3". Test Plan: New unit test Reviewed By: georgiaphillips Differential Revision: D79203595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159795 Approved by: https://github.com/ezyang	2025-09-08 21:30:17 +00:00
PyTorch MergeBot	dd44faa9d9	Revert "Modify ROCm MI2xx-based workflows to run on cron schedule (#162103 )" This reverts commit 0af70e2353e1dcda83175fd4834ecb7b63e009e0. Reverted https://github.com/pytorch/pytorch/pull/162103 on behalf of https://github.com/jithunnair-amd due to Cirrascale network outage resolved. Reverting back to running per commit to aid in triage and CI health ([comment](https://github.com/pytorch/pytorch/pull/162103#issuecomment-3267977825))	2025-09-08 20:53:05 +00:00
PyTorch MergeBot	5d819f3faf	Revert "[associative_scan] Autograd separated (#139939 )" This reverts commit 103f725afa8dbf0204a1be6a042ab93aa16d85d8. Reverted https://github.com/pytorch/pytorch/pull/139939 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing a weird failure after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/139939#issuecomment-3267945657))	2025-09-08 20:42:47 +00:00
Nikita Shulga	015423bef8	Add fp16-overflow regression test (#162401 ) Discovered while debugging https://github.com/pytorch/pytorch/issues/160841 where sdpa returned NaNs, because during the computation intermediate values were cast back to fp16 before normalization, which was fixed by https://github.com/pytorch/pytorch/pull/161999 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162401 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-09-08 20:33:23 +00:00
William Wen	26a1b9cce2	[dynamo] fix resume_execution.py KeyError in Python 3.11+ (#162318 ) Fixes https://github.com/pytorch/pytorch/issues/162313 Differential Revision: [D81938289](https://our.internmc.facebook.com/intern/diff/D81938289) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162318 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/anijain2305	2025-09-08 20:26:24 +00:00
Benjamin Glass	8f114650eb	Add std::any_of to ConvParams struct (#162334 ) Removes some for-loops that didn't short-circuit in favor of std::any_of. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162334 Approved by: https://github.com/Skylion007	2025-09-08 20:12:20 +00:00
Aaron Gokaslan	ec2c1371af	[BE]: Update cudnn frontend submodule to 1.14.1 (#162347 ) Fixes a few bugs introduced to CUDNN 1.11 which affects all our CUDA13 builds. Also adds support for new CUDNN features whenever we choose to update. @eqy pretty sure this addresses the concern you had over the previous upgrade since that bugfix is now merged. This is a simple header only update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162347 Approved by: https://github.com/eqy, https://github.com/atalman	2025-09-08 20:03:23 +00:00
IvanKobzarev	8ec01f34e9	[bucketing] custom_ops mode to hide inductor copies overhead (#161499 ) Adding "_custom_ops" bucketing to temporary fallback to eager execution of for_each, to workaround too many generated kernels on inductor side. This PR also reverts parts of bucketing changes for cycles detection that resulted in accuracy problems. Differential Revision: [D81152293](https://our.internmc.facebook.com/intern/diff/D81152293) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161499 Approved by: https://github.com/eellison	2025-09-08 20:03:08 +00:00
Ting Lu	9c991b63ff	[CD] [aarch64] Add CUDA 12.6 and 12.8 to build matrix, remove 12.9 build (#162364 ) https://github.com/pytorch/pytorch/issues/159779 Add the full CUDA support matrix to sbsa build (12.6, 12.8) Same arch support as x86 build Remove 12.9 sbsa build Pull Request resolved: https://github.com/pytorch/pytorch/pull/162364 Approved by: https://github.com/atalman	2025-09-08 20:00:25 +00:00
rzou	4e50651c5f	[DTensor] fix F.one_hot (#162307 ) F.one_hot(dtensor) used to run into a mixed DTensor-Tensor operation due to an arange call creating a new Tensor (not DTensor). This PR fixes it by allowing implicit replication of Tensors for the arange call and the one consumer of the arange call (the at::eq call). Test Plan: - new test. Also, F.one_hot(num_classes=-1) is broken so we skip that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162307 Approved by: https://github.com/ezyang ghstack dependencies: #162117	2025-09-08 19:37:08 +00:00
Edward Z. Yang	a0d026688c	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-08 19:10:36 +00:00
Edward Yang	d80297a684	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-08 19:10:36 +00:00
angelayi	fbcabb4fbd	Handle f([]) vs. f() in fake tensor caching (#162284 ) Fixes https://github.com/pytorch/pytorch/issues/162279 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162284 Approved by: https://github.com/manuelcandales, https://github.com/aorenste	2025-09-08 18:28:05 +00:00
PyTorch UpdateBot	314d47a210	[audio hash update] update the pinned audio hash (#162315 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162315 Approved by: https://github.com/pytorchbot	2025-09-08 18:26:33 +00:00
atalman	bc4176c92a	CD Windows CUDA 13.0 build - fix packaging of cuda dlls (#162383 ) Trying to fix https://github.com/pytorch/pytorch/issues/162333 CUDA 13.0 file structure changed. Instead of keeping most of dlls in bin folder its now in ``bin\x64`` except for cudnn dll. See attached picture : <img width="511" height="361" alt="Screenshot 2025-09-08 at 9 46 26 AM" src="https://github.com/user-attachments/assets/d2e630ee-930f-4da6-9b81-f9ef48fde7ce" /> <img width="490" height="333" alt="Screenshot 2025-09-08 at 9 46 34 AM" src="https://github.com/user-attachments/assets/194cbf43-b6ef-4218-b516-db37b91302be" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162383 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/malfet	2025-09-08 17:57:22 +00:00
eqy	de5dc1f038	[cuDNN][SDPA][Nested Tensor] add forward/backward caching support for cuDNN SDPA Nested tensor/varlen (#161434 ) Don't recompile every time Pull Request resolved: https://github.com/pytorch/pytorch/pull/161434 Approved by: https://github.com/drisspg	2025-09-08 17:51:13 +00:00
morrison-turnansky	72e6717d00	Avoid crash with release_available_cached_blocks (#162269 ) updated release behavior for cached blocks Fixes #159567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162269 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-09-08 17:46:43 +00:00
Shunting Zhang	ebd29a13fe	[inductor] fuse for scalar shared data (#162311 ) LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: https://github.com/pytorch/pytorch/pull/162311 Approved by: https://github.com/jansel	2025-09-08 17:20:46 +00:00
fengqing.lu	5793dd7875	[Intel GPU] Integrate OneDNN SDPA training forward and backward (#161058 ) This PR is the first split PR of https://github.com/pytorch/pytorch/pull/156272, only contains the OneDNN code. Please help review. Pending on OneDNN v3.9 commit update. Don't merge. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161058 Approved by: https://github.com/guangyey, https://github.com/EikanWang	2025-09-08 17:07:31 +00:00
Scott Wolchok	49c446c617	Add C++ function for torch.distributed.tensor._op_schema.is_view_op (#161595 ) This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better. Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161595 Approved by: https://github.com/ezyang, https://github.com/bdhirsh ghstack dependencies: #161466, #161586, #161590, #161591	2025-09-08 16:28:08 +00:00
Scott Wolchok	8e076d889c	Don't call check_has_torch_dispatch in THPVariable_NewWithVar if we already know (#161591 ) We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap. Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161591 Approved by: https://github.com/ezyang ghstack dependencies: #161466, #161586, #161590	2025-09-08 16:28:08 +00:00
Chien-Chin Huang	f044fa2902	[AsyncTP] Use assertEqual instead of allClose for bf16 tests (#162041 ) The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162041 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel ghstack dependencies: #162040	2025-09-08 16:12:52 +00:00
PyTorch MergeBot	a92773eeb1	Revert "Use vectorized stores for all dtypes in cat (#161649 )" This reverts commit 377033757ae5ca524ea842f1b0a5f446ed3d8fe0. Reverted https://github.com/pytorch/pytorch/pull/161649 on behalf of https://github.com/ngimel due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/161649#issuecomment-3266963044))	2025-09-08 15:58:58 +00:00
PyTorch MergeBot	53297f6ad0	Revert "[audio hash update] update the pinned audio hash (#162315 )" This reverts commit c9ac8c25ef9ad020542898ab569910a9d0cd1f7e. Reverted https://github.com/pytorch/pytorch/pull/162315 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if this introduced the failure https://github.com/pytorch/pytorch/actions/runs/17539536914/job/49810513700 ([comment](https://github.com/pytorch/pytorch/pull/162315#issuecomment-3266932718))	2025-09-08 15:52:30 +00:00
IvanKobzarev	25c170b72e	[inductor] Runtime estimations: use nccl estimator; mm only benchmark mode (#161405 ) During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms. Adding optional usage of: - c10d.time_estimator for collectives, which is based on NCCL estimator Benchmark mode only for matmuls, as they are highly dependent on mm backend - The logic mostly copied from Ruisi's PRs for inductor simple_fsdp https://github.com/pytorch/pytorch/pull/157572 This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()` Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161405 Approved by: https://github.com/eellison	2025-09-08 14:33:19 +00:00
David Berard	3f5993316e	[upstream triton] update triton pin to triton 3.5 (#162278 ) Update PyTorch to the latest Triton release candidate branch (release/3.5.x in triton-lang/triton) Notably: * this does not include the version number bump from 3.4 -> 3.5 (we'll do that in a follow-up PR) * sam_fast is still failing, so we've disabled it temporarily https://github.com/pytorch/pytorch/issues/162282 and we are committed to fixing it, ideally before the branch cut but possibly as a cherry-pick into the release branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162278 Approved by: https://github.com/atalman ghstack dependencies: #162244, #162309	2025-09-08 14:29:24 +00:00
PyTorch UpdateBot	e101411b9f	Update slow tests (#161395 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161395 Approved by: https://github.com/pytorchbot	2025-09-08 13:33:32 +00:00
PyTorch UpdateBot	32911ff541	[xla hash update] update the pinned xla hash (#162372 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162372 Approved by: https://github.com/pytorchbot	2025-09-08 11:31:16 +00:00
Chien-Chin Huang	5b90e85112	[AsyncTP] Fixes AsyncMM (#162040 ) The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect. Removing the alpha and beta fixes the issue. Thanks @ngimel to figure out the root cause. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162040 Approved by: https://github.com/danielvegamyhre	2025-09-08 10:53:59 +00:00
David Berard	31d5c67539	[inductor][triton] support static cuda launcher after triton # 7866 (#162309 ) Fixes static cuda launcher after https://github.com/triton-lang/triton/pull/7866. Static cuda launcher checks to make sure that no hook knobs are set (and if they are, it throws an error). But Triton has changed the semantics of hooks so that "empty hooks" are now represented by empty `HookChain`s instead of being represented by `None`. This PR changes the way we define "empty hooks" to account for HookChains. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162309 Approved by: https://github.com/aakhundov ghstack dependencies: #162244	2025-09-08 07:57:48 +00:00
David Berard	fb0afa853e	[inductor][triton] more JITCallable._hash_lock support (#162244 ) Follow-up to #161768. Context: ProcessPool pickles the outputs before sending them back to the main process. Triton kernels have some un-pickleable fields, so `prepare_for_pickle()` is used to strip out those fields. Previously, in the standard case (without triton_bundler.py), `prepare_for_pickle()` would strip out the un-pickleable fields and they would never be added back after unpickling, because the un-pickleable fields were not actually needed after compilation finished. In #161768 updated `prepare_for_pickle` to also strip out the `fn._hash_lock` field, a newly added field in JITCallable instances which is a `threading.RLock()`, which is not pickleable. It turns out that we do need to restore the `fn._hash_lock` field, even in the non-triton_bundler case - the MultiKernel case uses the hash lock. To do this, we add `restore_after_unpickle()` which will restore fields (or if the old fields are not provided, initialize just the hash_lock) Compile time benchmarks look good, maybe a very minor regression (see the comment below on the PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162244 Approved by: https://github.com/atalman	2025-09-08 07:57:48 +00:00
PyTorch MergeBot	1e0656f063	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit de893e96c775023aa3be895060848fac3296772c. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))	2025-09-08 07:04:36 +00:00
PyTorch MergeBot	29e09a6545	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit 01edcd4df8bf0c7b4cc2d3ec868bd2059eeea83b. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))	2025-09-08 07:04:36 +00:00
PyTorch UpdateBot	c9ac8c25ef	[audio hash update] update the pinned audio hash (#162315 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162315 Approved by: https://github.com/pytorchbot	2025-09-08 04:17:23 +00:00
Thomas Bohnstingl	103f725afa	[associative_scan] Autograd separated (#139939 ) This PR implements the Autograd feature of the associative_scan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139939 Approved by: https://github.com/ydwu4	2025-09-08 03:21:17 +00:00
James Wu	5babb4d5c0	Add BundledAOTAutogradSerializableCallable (#162170 ) This PR hooks up the python wrapper inductor backend to aot_compile. This is not the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now. In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162170 Approved by: https://github.com/zhxchen17 ghstack dependencies: #162169	2025-09-07 23:37:31 +00:00
James Wu	eb9073a6b7	[easy] [precompile] Convert CompileArtifacts to callable (#162169 ) The goal of this PR stack is to be able to implement `aot_compile_module`, which AOT precompiles a torch.nn.Module. Step 1 is a simple refactor to make CompileArtifacts itself the callable, which makes it easier to use directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162169 Approved by: https://github.com/zhxchen17	2025-09-07 23:37:31 +00:00
Yidi Wu	ec2e3687c7	[while_loop][autograd] support autograd_key of while_loop (#160483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160483 Approved by: https://github.com/zou3519	2025-09-07 21:55:29 +00:00
PyTorch MergeBot	ff2de5d522	Revert "[2/N]Port several test files under test/distributed to Intel GPU (#159473 )" This reverts commit 040d00af048967dde7938d358d7f5988cbd18388. Reverted https://github.com/pytorch/pytorch/pull/159473 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal signals, @d4l3k please help the author to have this change landed. [D81718444](https://www.internalfb.com/diff/D81718444) ([comment](https://github.com/pytorch/pytorch/pull/159473#issuecomment-3264046983))	2025-09-07 21:06:38 +00:00
PyTorch MergeBot	8235c4f65d	Revert "[ROCm] Enabling several UTs (#161715 )" This reverts commit b9ba612f7a968f7b27e121ca8f4d0a4d954f5354. Reverted https://github.com/pytorch/pytorch/pull/161715 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/159473, feel free to merge it back once conflicts are cleared ([comment](https://github.com/pytorch/pytorch/pull/161715#issuecomment-3264040604))	2025-09-07 21:03:17 +00:00
PyTorch MergeBot	e246a85b76	Revert "[1/N] Port 5 _composable/fsdp distributed test cases to Intel GPU (#159118 )" This reverts commit 5c473e9f5ee0ef0fc38e6cf34a95b547f8cdc8d5. Reverted https://github.com/pytorch/pytorch/pull/159118 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/159473 ([comment](https://github.com/pytorch/pytorch/pull/159118#issuecomment-3264037799))	2025-09-07 21:00:29 +00:00
PyTorch MergeBot	df59c21768	Revert "[BE] Cleanup stale comments/copy from `gemm` (#162001 )" This reverts commit 6087ef41e54c2494b117ffd923faf20f515a6806. Reverted https://github.com/pytorch/pytorch/pull/162001 on behalf of https://github.com/jeanschmidt due to breaks internal ads signal, see [D81845017](https://www.internalfb.com/diff/D81845017) ([comment](https://github.com/pytorch/pytorch/pull/162001#issuecomment-3264034312))	2025-09-07 20:53:16 +00:00
PyTorch MergeBot	093ab5f477	Revert "[inductor] add kernel template choice (ktc) (#161347 )" This reverts commit 9a8d454c464c0b811fc4586ff104424bccf1da0c. Reverted https://github.com/pytorch/pytorch/pull/161347 on behalf of https://github.com/jeanschmidt due to Seems to have broken internal builds, see [D81520569](https://www.internalfb.com/diff/D81520569) ([comment](https://github.com/pytorch/pytorch/pull/161347#issuecomment-3264027436))	2025-09-07 20:39:39 +00:00
PyTorch MergeBot	4348db0b92	Revert "[inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers (#161348 )" This reverts commit c32111149921b48bfef909293f1049e21619ed76. Reverted https://github.com/pytorch/pytorch/pull/161348 on behalf of https://github.com/jeanschmidt due to Seems to have broken internal builds, see [D81520569](https://www.internalfb.com/diff/D81520569) ([comment](https://github.com/pytorch/pytorch/pull/161347#issuecomment-3264027436))	2025-09-07 20:39:39 +00:00
Vinayak Pawar	9ad5e8edb1	Improve typing of ONNX decorators with ParamSpec (#162332 ) ## Summary This PR improves typing in ONNX-related modules by replacing TypeVar bound to Callable[..., Any] with ParamSpec to preserve parameter types and avoid type erasure in decorator functions. ## Changes - `torch/onnx/_internal/exporter/_flags.py`: Replace TCallable TypeVar with ParamSpec - `torch/onnx/ops/_impl.py`: Replace _T TypeVar with ParamSpec for _onnx_op decorator - `torch/onnx/_internal/exporter/_torchlib/_torchlib_registry.py`: Replace _T TypeVar with ParamSpec ## Motivation The previous implementation used TypeVar bound to Callable which erased parameter type information to Any. ParamSpec preserves the exact parameter types and return types, providing better type safety and IDE support. ## Testing - Verified all changes compile and import correctly - Created comprehensive test suite to validate ParamSpec functionality - No linting errors introduced - Maintains backward compatibility Fixes #142306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162332 Approved by: https://github.com/Skylion007	2025-09-07 18:06:03 +00:00
PyTorch MergeBot	7a83cf430e	Revert " [while_loop][autograd] support autograd_key of while_loop (#160483 )" This reverts commit 2b8a83901c58a0858ea9e4ce00055f48e6ed164c. Reverted https://github.com/pytorch/pytorch/pull/160483 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but some trunk tests are failing either from this PR or the previous one in the stack ([comment](https://github.com/pytorch/pytorch/pull/160483#issuecomment-3263597325))	2025-09-07 08:50:49 +00:00
PyTorch MergeBot	ada43ed39c	Revert "[inductor] pdl inductor option (disabled by default) (#160928 )" This reverts commit 9458d1ac3bd70c2af316a8ba95d2c6c9c1199c9c. Reverted https://github.com/pytorch/pytorch/pull/160928 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160928#issuecomment-3263560378))	2025-09-07 07:37:37 +00:00
Huy Do	93fb23d6fa	Build vLLM nightly wheels (#162000 ) This uses the same approach as building triton wheel where we publish a nightly wheel for vLLM whenever its pinned commit is updated. The key change is to use `pytorch/manylinux2_28-builder` as the base image to build vLLM, so there are a couple of changes on the vLLM Dockerfile used by lumen_cli 1. `pytorch/manylinux2_28-builder` is RedHat instead of Debian-based, so no apt-get 2. Fix a bug in `.github/actions/build-external-packages/action.yml` where `CUDA_VERSION` is not set correctly, preventing CUDA 12.9 build 3. Fix a bug in `.github/actions/build-external-packages/action.yml` where `TORCH_WHEELS_PATH` is not set correctly and always defaulted to `dist` 4. In vLLM Dockerfile, use the correct index for the selected CUDA version, i.e. https://download.pytorch.org/whl/nightly/cu12[89] for CUDA 12.[89] 5. Install torch, vision, audio in one command. Unlike the CI image `pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm`, `pytorch/manylinux2_28-builder` doesn't have any torch dependencies preinstalled 6. Bump xformers version to 0.0.32.post2 now that PyTorch 2.8.0 has been landed on vLLM We need to prepare 3 wheels for vLLM, xformers, and flashinfer-python. And I rename them in the same convention as PyTorch nightlies `MAJOR.MINOR.PATCH.devYYYYMMDD` so that vLLM nightlies will work with torch nightlies on the same date. ### Usage * Install latest nightlies ``` pip install --pre torch torchvision torchaudio vllm xformers flashinfer_python \ --index-url https://download.pytorch.org/whl/nightly/cu129 ``` * Install a specific version ``` pip install --pre torch==2.9.0.dev20250903 torchvision torchaudio \ vllm==1.0.0.dev20250903 \ xformers=0.0.33.dev20250903 \ flashinfer_python=0.2.14.dev20250903 \ --index-url https://download.pytorch.org/whl/nightly/cu129 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162000 Approved by: https://github.com/atalman	2025-09-07 06:09:17 +00:00
PyTorch MergeBot	104f2680e0	Revert "Add return-max-scores to flex-attention (#161667 )" This reverts commit 486b20b73cfcf32a773a4301b1b97f91c157ce76. Reverted https://github.com/pytorch/pytorch/pull/161667 on behalf of https://github.com/huydhn due to Sorry for reverting your change but reverting https://github.com/pytorch/pytorch/pull/161730 does not seem to fix all trunk failures ([comment](https://github.com/pytorch/pytorch/pull/161667#issuecomment-3263512642))	2025-09-07 06:00:55 +00:00
PyTorch MergeBot	eac3d6f04c	Revert "[inductor] fuse for scalar shared data (#162311 )" This reverts commit 2a45837e98c63cae9d1a2e2133a727b829e549d5. Reverted https://github.com/pytorch/pytorch/pull/162311 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is breaking lint ([comment](https://github.com/pytorch/pytorch/pull/162311#issuecomment-3263511162))	2025-09-07 05:57:43 +00:00
PyTorch UpdateBot	fea20775ad	[vllm hash update] update the pinned vllm hash (#162314 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162314 Approved by: https://github.com/pytorchbot	2025-09-07 04:29:23 +00:00
Shunting Zhang	2a45837e98	[inductor] fuse for scalar shared data (#162311 ) LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: https://github.com/pytorch/pytorch/pull/162311 Approved by: https://github.com/jansel ghstack dependencies: #162028, #162221, #162303	2025-09-07 01:48:45 +00:00
Yiming Zhou	b919560c4a	[nativert] AOTI lowering and packaging as NativeRT delegate (#162285 ) Summary: A demo for creating AOTI delegate for NativeRT in OSS. - It supports full graph lowering only. - It leverages `executorch_call_delegate` HOP but doesn't rely on `executorch`. - The delegate graph is obtained by tracing a `LoweredBackendModule` whose forward function calls `executorch_call_delegate`. - The main difference between `executorch_call_delegate` and `aoti_call_delegate` is that the delegate graph from `executorch_call_delegate` doesn't have weights lifted as inputs. - original_ep and delegate_ep are treated as flat EP dictionary and there is no nested structure. - The naming contract is enforced by `model_name` and `backend_id` Test Plan: CI Rollback Plan: Differential Revision: D81641157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162285 Approved by: https://github.com/dolpm	2025-09-07 01:29:54 +00:00
Animesh Jain	e3068cdb44	[dynamo] Use relaxed CLOSURE_MATCH guard then ID_MATCH (#162247 ) I am unable to write a test that would fail here. The reason is that when we do _dynamo.disable(fn) in the compiled frame, the id of disabled function changes but currently we guard on the original function - `fn` whose id is not changing. This PR still guards on the `fn.__code__` just to be more precise. Thanks to @thenumberouscode for pointing this out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162247 Approved by: https://github.com/StrongerXi, https://github.com/jansel	2025-09-07 01:25:52 +00:00
Yiming Zhou	5211f1f908	[export] Move example inputs in move_to_device_pass (#162301 ) Summary: If i have a EP that's exported on CPU and want to AOTI compile it for CUDA. I need to use `move_to_device_pass`. But in `torch._inductor.aoti_compile_and_package()`, it directly uses the `example_inputs` attached to the EP, so we should move the example inputs as well if applicable. Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_move_device_example_inputs Rollback Plan: Differential Revision: D81812366 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162301 Approved by: https://github.com/angelayi	2025-09-06 23:54:54 +00:00
Yidi Wu	2b8a83901c	[while_loop][autograd] support autograd_key of while_loop (#160483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160483 Approved by: https://github.com/zou3519 ghstack dependencies: #160548, #160467	2025-09-06 21:26:33 +00:00
Yidi Wu	48e3be3ab6	[while_loop][autograd] add hop while_loop_stack_output (#160467 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160467 Approved by: https://github.com/zou3519 ghstack dependencies: #160548	2025-09-06 21:26:33 +00:00
mansiag05	5927a70934	NLLLoss: validate target is 0D when input is 1D (#161412 ) Add a shape check in nll_loss_forward to error out when both input and target are 1D. Added a unit test to cover the incompatible 1D/1D case. Fixes #157420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161412 Approved by: https://github.com/ngimel	2025-09-06 20:58:42 +00:00
Shunting Zhang	1a588ace46	[inductor] rename deps during refreshing (#162303 ) Skiping renaming cause wrong dependencies when mutations are involved. Test: CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162303 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #162028, #162221	2025-09-06 20:38:28 +00:00
Shunting Zhang	541aa23de5	[inductor] fix TemplateBuffer.extract_read_writes (#162221 ) Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162221 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #162028	2025-09-06 20:38:28 +00:00
Tugsbayasgalan Manlaibaatar	047603d35b	New export implementation with flat inp/out (#162167 ) This is my first attempt of building new export API. The main thing it addresses is correctly getting input and output relations. Subsequent diffs willl add functionality for dynamic shapes, nn_module_stack etc. Differential Revision: [D81793205](https://our.internmc.facebook.com/intern/diff/D81793205) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162167 Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri	2025-09-06 20:03:52 +00:00
Deng, Daisy	ae0edc133e	[3/N] Enable 6 fsdp test on Intel GPU (#161601 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR is created base on PR https://github.com/pytorch/pytorch/pull/158533 and https://github.com/pytorch/pytorch/pull/159473 and will work on some test files under test/distributed/fsdp. We could enable Intel GPU with following methods and try the best to keep the original code styles in this PR: 1. add allow_xpu=True in instantiate_device_type_tests() if needed. 2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend 3. enabled XPU for some test path Pull Request resolved: https://github.com/pytorch/pytorch/pull/161601 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-06 16:47:13 +00:00
Daniel Vega-Myhre	b6d0a9ea90	MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump (#162209 ) ## Summary - We just landed 2d-2d support for mxfp8 grouped gemm in FBGEMM: https://github.com/pytorch/FBGEMM/pull/4816 - This is needed for backward pass of mxfp8 MoE training with grouped gemms - Changes: - Add dispatching + input validation for mxfp8 grouped gemm in `torch._scaled_grouped_mm` - Add meta registration input validation for mxfp8 grouped gemm, for composability with compile - Add unit tests exercising torch._scaled_grouped_mm with mxfp8 inputs - Bump FBGEMM third party submodule to include: - https://github.com/pytorch/FBGEMM/pull/4816 - https://github.com/pytorch/FBGEMM/pull/4820 - https://github.com/pytorch/FBGEMM/pull/4821 - https://github.com/pytorch/FBGEMM/pull/4823 #### How fbgemm dependency was bumped Documenting this since I haven't found it documented elsewhere: - `cd ~/pytorch/third_party/fbgemm` - `git fetch` - `git checkout <hash>` - `cd ~/pytorch` - `git add third_party/fbgemm` ## Test plan #### Test build ``` USE_FBGEMM_GENAI=1 python -m pip install --no-build-isolation -v -e . ... Successfully installed torch-2.9.0a0+gitf5070f3 ``` [full build log](https://www.internalfb.com/phabricator/paste/view/P1933787581) #### Unit tests ``` pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm_ ... test/test_matmul_cuda.py ......... [100%] ============================================================== 9 passed, 1668 deselected in 5.34s =============================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162209 Approved by: https://github.com/ngimel	2025-09-06 15:25:30 +00:00
eqy	5985e28912	[CUDA 13][cuDNN][Windows] Roll back cuDNN upgrade from 9.13 to 9.12 on Windows (#162322 ) Forward fix for #162268 CC @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/162322 Approved by: https://github.com/atalman, https://github.com/nWEIdia	2025-09-06 13:32:07 +00:00
Blaine Burton Rister	9aedb3cd87	[AOTI-FX] Support registering custom FX backends (#162317 ) # Feature Currently, `torch._inductor.compile_aot` always uses the `WrapperFxCodegen` class. In contrast, Python and C++ codegen allow users to register custom backends. This PR brings that feature to FX codegen. # Test plan Added a CI test registering a custom FX backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162317 Approved by: https://github.com/jansel	2025-09-06 07:32:03 +00:00
PyTorch MergeBot	0ff8eabf13	Revert "[dynamo] Graph break on on user-defined class in compiled region (#161670 )" This reverts commit 146371483318e17929daefd37c8e459d9d6d47bb. Reverted https://github.com/pytorch/pytorch/pull/161670 on behalf of https://github.com/jeanschmidt due to seems to have introduced https://github.com/pytorch/pytorch/actions/runs/17507127561/job/49733379267 and https://github.com/pytorch/pytorch/actions/runs/17507127561/job/49733379271 ([comment](https://github.com/pytorch/pytorch/pull/161670#issuecomment-3261241229))	2025-09-06 06:18:57 +00:00
Jeffro	28f4ab0737	Add -Wno-ctad-maybe-unsupported compiler flag (#162223 ) When running bazel build, we (Google) run into the following error. The `-Wctad-maybe-unsupported` warning would be raised to an error and break the build in certain cases. So, we propose to suppress the warning to make the build with bazel more smooth. This is the error message we got: ``` c10/util/IntrusiveList.h:166:12: error: 'std::reverse_iterator' may not intend to support class template argument deduction [-Werror,-Wctad-maybe-unsupported] 166 \| return std::reverse_iterator{end()}; \| ^ c10/test/util/IntrusiveList_test.cpp:24:18: note: in instantiation of member function 'c10::IntrusiveList<(anonymous namespace)::ListItem>::rbegin' requested here 24 \| auto it = c1.rbegin(); \| ^ c10/test/util/IntrusiveList_test.cpp:43:5: note: in instantiation of function template specialization '(anonymous namespace)::check_containers_equal<(anonymous namespace)::ListItem>' requested here 43 \| check_containers_equal(l, v); \| ^ libcxx/include/__iterator/reverse_iterator.h:51:7: note: add a deduction guide to suppress this warning 51 \| class reverse_iterator \| ^ 1 error generated. ``` @haifeng-jin Pull Request resolved: https://github.com/pytorch/pytorch/pull/162223 Approved by: https://github.com/ezyang	2025-09-06 06:11:37 +00:00
Codeboi007	c98ddaca6d	Fixed comment to match logic in distributed_c10d.py (#162158 ) inconsistent with the logic introduced in #162157 and modified in #142216.This update ensures the documentation matches the actual behavior of the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162158 Approved by: https://github.com/wconstab	2025-09-06 05:37:49 +00:00
morrison-turnansky	bc505977fb	torch.zeros bound checks for symint (#161976 ) Fixes #161490 I added a bounds check for negative symints to create a better error message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161976 Approved by: https://github.com/ezyang	2025-09-06 05:37:42 +00:00
orangeH25	aac1a50a19	Add api info for torch._C._nn.pyi (#162148 ) Fix part of #148404 APis involved are as followed: - cross_entropy_loss - hardsigmoid_ - hardswish - hardswish_ - huber_loss Pull Request resolved: https://github.com/pytorch/pytorch/pull/162148 Approved by: https://github.com/FFFrog, https://github.com/ezyang	2025-09-06 05:21:40 +00:00
Isuru Fernando	20b47acef8	[fx] fix qualified name for methods of torch.Tensor (#162224 ) Fixes #160077, #154721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162224 Approved by: https://github.com/ezyang	2025-09-06 05:16:19 +00:00
Mario Šaško	da4db4b33d	Fix `DeviceMesh._flatten` docstring example (#162277 ) Fix the `DeviceMesh._flatten` docstring example of use. Alternative fix would be to replace `mesh_3d["dp", "cp"]` with `mesh_3d["cp", "tp"]`. (I verified the fix using the `gloo` backend) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162277 Approved by: https://github.com/ezyang	2025-09-06 05:00:00 +00:00
PyTorch MergeBot	a3e5466002	Revert "Resize to 0 if not going to be used (#161730 )" This reverts commit 081cab045472ce045634548cc6c14a4870641e23. Reverted https://github.com/pytorch/pytorch/pull/161730 on behalf of https://github.com/davidberard98 due to functorch/test_aotdispatch.py::TestAOTModuleSimplified::test_flex_attn_noncontiguous_tangents [GH job link](https://github.com/pytorch/pytorch/actions/runs/17506617662/job/49731934012) [HUD commit link](`081cab0454`) ([comment](https://github.com/pytorch/pytorch/pull/161730#issuecomment-3260492575))	2025-09-06 04:17:08 +00:00
Boyuan Feng	c0983e6cc0	[Graph Partition] interface for custom cg wrapper (#162207 ) This PR adds an interface to allow users to specify custom cudagraph wrapper. User example: [vllm](https://github.com/vllm-project/vllm/pull/24281) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162207 Approved by: https://github.com/zou3519, https://github.com/eellison, https://github.com/ProExpertProg	2025-09-06 03:13:01 +00:00
Edward Z. Yang	b2b4add0e7	Docs on export joint with descriptors (#159006 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159006 Approved by: https://github.com/SherlockNoMad	2025-09-06 03:02:58 +00:00
Gabriel Ferns	20629b1619	Add contiguous subgraph transformation threshold (#162192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162192 Approved by: https://github.com/coconutruben	2025-09-06 02:48:00 +00:00
Raman Kumar	c3ceca2995	codebase structure documentation to include torchgen (#162261 ) 📚 The doc update adding description about torchgen folder in code structure guide Pull Request resolved: https://github.com/pytorch/pytorch/pull/162261 Approved by: https://github.com/ezyang	2025-09-06 02:10:57 +00:00
Eddie Yan	145a3a7bda	[CUDA 13][cuDNN] Bump CUDA 13 to cuDNN 9.13.0 (#162268 ) Fixes some `d_qk` != `d_v` cases on Hopper that are broken by cuDNN 9.11-9.12 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162268 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2025-09-06 01:59:03 +00:00
ruisizhang123	291cd11f2d	[inductor] estimate peak memory in codegen only when buffer reuse (#162300 ) As titled, this PR ensures peak memory is estimated only when buffer reuse is enabled. Without this config, some nodes' successor nodes are eliminated from memory estimation after inductor bucketing, which can cause errors. The original codegen peak memory estimation code is from this PR: https://github.com/pytorch/pytorch/pull/159530 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162300 Approved by: https://github.com/eellison, https://github.com/v0i0	2025-09-06 01:30:38 +00:00
Yang Wang	7f4ff79210	remove deprecated vllm test (#162306 ) Fixes https://github.com/pytorch/pytorch/issues/162274 the test is removed from vllm side Pull Request resolved: https://github.com/pytorch/pytorch/pull/162306 Approved by: https://github.com/malfet	2025-09-06 01:27:13 +00:00
Will Feng	0f45aaf441	Disable autocast when running joint graph passes (#162304 ) Fixes #159469. See https://github.com/pytorch/pytorch/issues/159469#issuecomment-3221474027 for root-cause analysis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162304 Approved by: https://github.com/bdhirsh, https://github.com/zou3519, https://github.com/eellison	2025-09-06 00:57:58 +00:00
dolpm	4f72d932fe	re-land triton runtime implementation" (#162217 ) Summary: original pr - https://github.com/pytorch/pytorch/pull/161798 Test Plan: ci Rollback Plan: Differential Revision: D81724234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162217 Approved by: https://github.com/SherlockNoMad	2025-09-06 00:52:29 +00:00
Rob Timpe	1463714833	[dynamo] Graph break on on user-defined class in compiled region (#161670 ) Currently, user-defined classes inside of a compiled frame will cause the whole frame to be skipped by dynamo. This change defers the Unsupported exception until the __build_class__ builtin is actually called, which allows a graph break to be inserted. Fixes #161562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161670 Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas	2025-09-06 00:04:57 +00:00
drisspg	081cab0454	Resize to 0 if not going to be used (#161730 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #161730 * #161667 ```Py with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32) buf1 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32) buf2 = empty_strided_cuda((2, 32, 1024, 64), (2097152, 65536, 64, 1), torch.float32) # Topologically Sorted Source Nodes: [flex_attention], Original ATen: [] stream0 = get_raw_stream(0) triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, buf1, arg4_1, arg3_1, arg5_1, arg6_1, buf2, 8, 2, 32, stream=stream0) del arg0_1 del arg1_1 del arg2_1 del arg3_1 del arg4_1 del arg5_1 del arg6_1 del buf0 del buf1 return (buf2, ) ``` Vs ```Py with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32) buf1 = empty_strided_cuda((0, ), (1, ), torch.float32) buf2 = empty_strided_cuda((2, 32, 1024, 64), (2097152, 65536, 64, 1), torch.float32) # Topologically Sorted Source Nodes: [flex_attention], Original ATen: [] stream0 = get_raw_stream(0) triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, buf1, arg4_1, arg3_1, arg5_1, arg6_1, buf2, 8, 2, 32, stream=stream0) del arg0_1 del arg1_1 del arg2_1 del arg3_1 del arg4_1 del arg5_1 del arg6_1 del buf0 del buf1 return (buf2, ) ``` <img width="428" height="145" alt="Screenshot 2025-08-28 at 12 37 11 PM" src="https://github.com/user-attachments/assets/240a7bca-97e1-40c4-bf93-f075fdc1a40d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161730 Approved by: https://github.com/Skylion007, https://github.com/BoyuanFeng ghstack dependencies: #161667	2025-09-05 23:21:46 +00:00
drisspg	486b20b73c	Add return-max-scores to flex-attention (#161667 ) # Summary ### Update API ```Py class AuxRequest(NamedTuple): """Request which auxiliary outputs to compute from flex_attention. Each field is a boolean indicating whether that auxiliary output should be computed. """ lse: bool = False max_scores: bool = False class AuxOutput(NamedTuple): """Auxiliary outputs from flex_attention operation. Fields will be None if not requested, or contain the tensor if requested. """ lse: Optional[Tensor] = None max_scores: Optional[Tensor] = None out_only = flex_attention(query, key, value, score_mod) out_max, aux_max = flex_attention( query, key, value, score_mod, return_aux=FlexAttentionAuxRequest(max_scores=True), ) out_both, aux_both = flex_attention( query, key, value, score_mod, return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True), ) ``` Returns the max post mod scores from flex attention. Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups. Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args. We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors. ### Req Grad I currently dont return a max_scores that supports backproping grads. I think this might be feasible but since max is essentially 1 hot on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch). For now no grad, we can re-visit if needed. ## Perf I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path. ```Shell 🔝 Top 5 TFlops Deltas (by absolute %): shape: (5, 7) ┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡ │ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │ │ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘ 🔺 Top 5 Positive TFlops Deltas (highest +%): shape: (5, 7) ┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡ │ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │ │ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 161.031318 ┆ 158.597808 ┆ 2.43351 ┆ 1.534391 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘ 🔻 Top 5 Negative TFlops Deltas (lowest -%): shape: (5, 7) ┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡ │ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, ┆ 175.546923 ┆ 177.81205 ┆ -2.265127 ┆ -1.273888 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, ┆ 156.282597 ┆ 158.209134 ┆ -1.926537 ┆ -1.217715 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16, ┆ 232.542929 ┆ 235.140136 ┆ -2.597207 ┆ -1.104536 │ │ ┆ ┆ 2048, 128) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 169.652791 ┆ 171.475986 ┆ -1.823195 ┆ -1.063236 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng	2025-09-05 23:21:46 +00:00
Xuan Zhang	4d4abec80f	allow user to pass in custom partitioner function (#157580 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157580 Approved by: https://github.com/bdhirsh	2025-09-05 22:49:39 +00:00
Nikita Shulga	9c03d6be87	[CD][BE] Delete Python-3.9 case (#162265 ) And raise error when building for an unsupported version Pull Request resolved: https://github.com/pytorch/pytorch/pull/162265 Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: #162297	2025-09-05 22:46:36 +00:00
Nikita Shulga	8d50355d97	[CD][EZ] Update libtorch python version to 3.10 (#162297 ) Not sure why it was at 3.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162297 Approved by: https://github.com/clee2000, https://github.com/atalman	2025-09-05 22:46:36 +00:00
dolpm	e0a62b266c	[aot-precompile] default-filter global guards (#162090 ) if the user doesn't provide their own guard filter fn, we should by default filter global guards. pytest test/dynamo/test_aot_compile.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/162090 Approved by: https://github.com/zhxchen17	2025-09-05 22:44:55 +00:00
Saurabh Mishra	01ab325cc2	[DCP][Quantization] Fix the issue when scale vector is in a different SafeTensors file (#162214 ) Summary: The current dequantization implementation assumes that the weight and scale tenors are in the same SafeTensors files. This diff fixes the issue to support the case when these could be in different files. Test Plan: buck test fbcode//caffe2/test/distributed/checkpoint\:test_quantized_hf_storage Buck UI: https://www.internalfb.com/buck2/532bf151-bb40-41fd-b080-ff898675afe2 Test UI: https://www.internalfb.com/intern/testinfra/testrun/15199648851011082 Rollback Plan: Differential Revision: D81718598 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162214 Approved by: https://github.com/wwwjn	2025-09-05 22:43:58 +00:00
Laith Sakka	79fcd5247a	symbolic cpp channels_last_contiguous (#160402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160402 Approved by: https://github.com/aorenste	2025-09-05 21:40:32 +00:00
rzou	70d36e047d	Making batching rule for F.embedding DTensor-aware (#162117 ) `vmap(F.embedding)(DTensor, DTensor)` was failing because F.embedding's batching rule generates a new tensor via at::arange, at::arange generates a regular tensor, and DTensor rightfully errors on mixed DTensor-regular Tensor operations. This PR fixes the problem by activating DTensor implicit replication on just the at::arange and the subsequent add operation. In order to accomplish this I move the DTensor implicit replication flag to C++ (most batching rules are in C++). Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/162117 Approved by: https://github.com/bdhirsh	2025-09-05 21:40:14 +00:00
Nikita Shulga	a00cdc1e41	[CD][BE] Get rid of SETUPTOOLS and PYYAML extra pins (#162266 ) As those weren't really a pins to begin with, and requirments.txt already has those Pull Request resolved: https://github.com/pytorch/pytorch/pull/162266 Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: #162263, #162264	2025-09-05 21:32:52 +00:00
Shunzhi Wen	c10195e723	[C10d][Gloo] Enable complex datatype support in ProcessGroupGloo (#156633 ) - Enable communication of tensors with Complex datatype in ProcessGroupGloo, similar to how ProcessGroupNCCL handles it. - Move a function, which checks if Complex datatype is supported by a reduce operation, from ProcessGroupNCCL.cpp into a new file to be shared with ProcessGroupGloo. Fixes #156632 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156633 Approved by: https://github.com/d4l3k	2025-09-05 21:24:36 +00:00
Boyuan Feng	771f369448	[Inductor] Improve RoPE (#161420 ) This PR fuses ROPE from 2 kernels into 1 kernel. Shape: ``` q: [B, Hq, S, D] k: [B, Hkv, S, D] ``` `Hq=32, Hkv=8, D=128` following Llama3 setting. <img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161420 Approved by: https://github.com/shunting314	2025-09-05 20:55:20 +00:00
henrylhtsang	92a43025e0	[cutlass backend] Add FP8 tests for multiple linears (#160782 ) Adding a test that is closer to real use case. Thanks @mlazos for fixing a few issues so this test works for most cases. We still have to skip the AOTI and dynamic case due to accuracy issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160782 Approved by: https://github.com/mlazos	2025-09-05 20:23:25 +00:00
Xuehai Pan	2fa0520a64	[BE][pytree] cleanup parameterized pytree tests (#160842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160842 Approved by: https://github.com/Skylion007	2025-09-05 20:15:29 +00:00
Edward Z. Yang	01edcd4df8	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-05 20:15:11 +00:00
Edward Yang	de893e96c7	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-05 20:15:11 +00:00
Nikita Shulga	6087ef41e5	[BE] Cleanup stale comments/copy from `gemm` (#162001 ) Followup after https://github.com/pytorch/pytorch/pull/154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/162001 Approved by: https://github.com/drisspg ghstack dependencies: #161999	2025-09-05 19:59:51 +00:00
Nikita Shulga	a3c7f77e50	[EZ][CD] Update MacOS deployment platform to 11.0 (#162264 ) Fixes following warning ``` MACOSX_DEPLOYMENT_TARGET is set to a lower value (10.15) than the version on which the Python interpreter was compiled (11.0) ``` Update deployment platform in `README.MD` as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/162264 Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: #162263	2025-09-05 19:58:04 +00:00
Justin Chu	3771380f83	[ONNX] Hide draft export under a flag (#162225 ) Use `TORCH_ONNX_ENABLE_DRAFT_EXPORT` to control whether draft_export should be used as a strategy in onnx export. Follow up of https://github.com/pytorch/pytorch/pull/161454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162225 Approved by: https://github.com/xadupre, https://github.com/titaiwangms	2025-09-05 19:54:50 +00:00
PyTorch MergeBot	adae7f66aa	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit c37103234afc832dcad307e9016230810957c9d5. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))	2025-09-05 18:58:47 +00:00
PyTorch MergeBot	70f865ac9b	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit ef3be6726f7ff4b77c22db10cec5b686f9107ea9. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))	2025-09-05 18:58:47 +00:00
Scott Wolchok	88d94d17e8	Add torch.Tensor._make_dtensor to accelerate DTensor.__new__ further (#161590 ) This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from #160580 (120ish usec -> 110ish usec) Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161590 Approved by: https://github.com/albanD ghstack dependencies: #161466, #161586	2025-09-05 18:43:41 +00:00
Ruben Rodriguez Buchillon	c321111499	[inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers (#161348 ) \# why - every callsite just executes the generator on the spot - previous pr adds the ability to add an override before expensive generators are executed, so we don't need this generator anymore \# what - rather than yielding the ChoiceCaller, just return the list of all valid ChoiceCallers \# testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520574](https://our.internmc.facebook.com/intern/diff/D81520574) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161348 Approved by: https://github.com/eellison ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345, #161346, #161347	2025-09-05 18:02:53 +00:00
Ruben Rodriguez Buchillon	9a8d454c46	[inductor] add kernel template choice (ktc) (#161347 ) # why - gather everything up to make choices, without running potentially expensive generators - enables overrides where we toss the entire list of configs from inductor, without having to enumrate it (expensive) # what - add a holding class that just gets all the components necessary to generate a ChoiceCaller - use that class to generate ChoiceCallers - this does not (yet) add the override function, but just prepares the scene ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520569](https://our.internmc.facebook.com/intern/diff/D81520569) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161347 Approved by: https://github.com/eellison ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345, #161346	2025-09-05 18:02:53 +00:00
Ruben Rodriguez Buchillon	e02e9edb55	[inductor] V.choice.get_mm_configs takes a stack of templates (#161346 ) # why - enables us to just gather relevant templates and get all choices at once - that in turns allows us to make op wide override decisions # what - V.choice.get_mm_configs takes a stack of templates - all callsites just provide a stack of size 1 right now but do not merge everything yet (other features pending) # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520583](https://our.internmc.facebook.com/intern/diff/D81520583) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161346 Approved by: https://github.com/eellison ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345	2025-09-05 18:02:46 +00:00
Ruben Rodriguez Buchillon	d63ad53a99	[inductor][ez] return choicecallers directly (#161345 ) # why - remove repeat patterns - we have everything to make the choicecallers - templates - input_nodes - layouts - all the kwargs # what - yield a choicecaller directly from V.choices.get_mm_configs # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520577](https://our.internmc.facebook.com/intern/diff/D81520577) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161345 Approved by: https://github.com/jansel ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344	2025-09-05 18:02:38 +00:00
Ruben Rodriguez Buchillon	031d79cb51	[inductor] move max-autotune logic inside V.choices.get_mm_configs (#161344 ) # why - heuristics providers know decide whether to (or which choices to add) in the max-autotune case - enables an eventual override point to gracefully fallback to the standard behavior # what - max-autotune is determined inside V.choices.get_mm_configs because it's mm only right now, we can just do `config.max_autotune or config.max_autotune_gemm` a TODO indicates that this can change in the future when this expands to more templates # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520573](https://our.internmc.facebook.com/intern/diff/D81520573) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161344 Approved by: https://github.com/jansel ghstack dependencies: #162075, #161340, #161341, #161342, #161343	2025-09-05 18:02:30 +00:00
Ruben Rodriguez Buchillon	a301dc3b60	[inductor][ez] pass template rather than template.uid (#161343 ) # why - simpler interface - enables future of extracting more things out of the template e.g. a hash # what V.choices.get_mm_configs now takes the whole template rather than just the template.uid # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520576](https://our.internmc.facebook.com/intern/diff/D81520576) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161343 Approved by: https://github.com/jansel ghstack dependencies: #162075, #161340, #161341, #161342	2025-09-05 18:02:22 +00:00
Ruben Rodriguez Buchillon	af590cb729	[inductor][aten] treat like a template in GEMMs (#161342 ) # why - central point to analyze and override all generated choices # what - add a pseudo heuristic for aten that just yields a single, empty kwargs - add a pseudo heuristic with the bias_addmm logic for it - add an addmm specific heuristic that yields a single choice, but also expands it with alpha and beta kwargs - replace all the aten.bind calls with V.choices.get_mm_configs using the now matching API for aten # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520580](https://our.internmc.facebook.com/intern/diff/D81520580) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161342 Approved by: https://github.com/jansel ghstack dependencies: #162075, #161340, #161341	2025-09-05 18:02:10 +00:00
Ruben Rodriguez Buchillon	4902c76c65	[inductor][ez] add template/externchoice uid (#161341 ) # why - to have a central registry of templates/externkernelchoice to match them to heuristics etc, they need unique names - mm is both the triton template name and the aten_mm name # what - add a uid() to KernelTemplate/ExternKernelChoice that returns name - override in ExternKernel to prepend "aten::" - override in TritonTemplate to prepend "triton::" This id is just use to find template heuristics, so it has no other impact # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520579](https://our.internmc.facebook.com/intern/diff/D81520579) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161341 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #162075, #161340	2025-09-05 18:01:58 +00:00
Ruben Rodriguez Buchillon	9602590b15	[inductor] move scaled_mm input nodes logic (#161340 ) # why - a step towards a unified interface for all choices, where any adjustment to nodes (e.g. unsqueezing) happens as part of choice specific preprocessing, behind a common point # what - move the unsqueeze logic for triton nodes for scaled_mm inside the new hookup for adjusting the kernel inputs for template heuristics # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "scale" ``` Differential Revision: [D81520582](https://our.internmc.facebook.com/intern/diff/D81520582) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161340 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #162075	2025-09-05 18:01:44 +00:00
Ruben Rodriguez Buchillon	2ef665ae19	[inductor][contigous mm] mild refactor (#162075 ) # why - use the new heuristics logic better to handle kwargs # what - move all checks into the heuristics to yield a single choice or not choices if the decomposition should not be used - fix `hip` device type, which should be `cuda` - let heuristics handle the kwarg passing # testing in ci Differential Revision: [D81706776](https://our.internmc.facebook.com/intern/diff/D81706776) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162075 Approved by: https://github.com/exclamaforte, https://github.com/jansel	2025-09-05 18:01:07 +00:00
Mikayla Gawarecki	b18bb6796f	Add const to stable amax (#162082 ) Fixes https://github.com/pytorch/pytorch/issues/161826 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162082 Approved by: https://github.com/soulitzer	2025-09-05 17:37:49 +00:00
PyTorch MergeBot	d711f27845	Revert "[ROCm] [CK] Composable Kernel integration for inductor backend (#158747 )" This reverts commit 019fed39aa6b2dd8c69347378d53423e5efae8d4. Reverted https://github.com/pytorch/pytorch/pull/158747 on behalf of https://github.com/jithunnair-amd due to Broke linux-binary-manywheel-rocm / manywheel-py3_9-rocm6_4-test: `019fed39aa/1` ... PR didn't have this job run successfully due to CI outage ([comment](https://github.com/pytorch/pytorch/pull/158747#issuecomment-3259212343))	2025-09-05 17:27:45 +00:00
Nikita Shulga	261a84a176	[CD][BE] Remove unnecessary checks for XCode version (#162263 ) None of them have worked for a while, PyTorch for Mac is build with XCode-15.4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162263 Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi	2025-09-05 17:02:36 +00:00
xinan.lin	98374612fc	[Intel GPU] Update Intel triton commit pin to Triton 3.5.x (#161777 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161777 Approved by: https://github.com/EikanWang	2025-09-05 16:55:47 +00:00
Eddie Yan	c2a3024617	[cuBLASLt][FP8] `cuBLASLt` appears to support float8 rowwise-scaling on H100 (#161305 ) Following #157905 I think the macro around ``` TORCH_INTERNAL_ASSERT(use_rowwise == false, "rowwise scaled_gemm not supported with blaslt"); ``` was never updated and this would cause `float8` tests to fail. Also it appears the `Lt` accepts two inputs with `e4m3` and `e5m2` dtypes simultaneously, so removing that check here as well... CC @lw Pull Request resolved: https://github.com/pytorch/pytorch/pull/161305 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-05 16:55:09 +00:00
Xingyuan Li	b2c7b9ad2d	[Intel GPU][FlexAttention] Enable TMA path on Intel GPU (#162138 ) The existing `can_use_tma` has some conditions that are unnecessary for Intel GPUs. We have removed these useless conditions on the Intel GPU path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162138 Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/jansel, https://github.com/etaf	2025-09-05 16:54:51 +00:00
PyTorch MergeBot	f3cebec39e	Revert "Rename propagate_tensor_meta to make private again (#161744 )" This reverts commit 734ce8eba9c69381f187359bf0fef1d71d84cd20. Reverted https://github.com/pytorch/pytorch/pull/161744 on behalf of https://github.com/jeanschmidt due to seems to break internal tests, see D81657000 for more details ([comment](https://github.com/pytorch/pytorch/pull/161744#issuecomment-3258934519))	2025-09-05 16:20:29 +00:00
Saurabh Mishra	06da7c0730	[DCP][Quantization] Fix for FP8 multiplication during dequantization (#162202 ) Summary: Weight vector needs to be upcasted since some FP8 formats (like Float8_e4m3fn) don't have CPU implementations in PyTorch. Reference: https://docs.pytorch.org/docs/stable/tensors.html#id13 We will use FP32 for the scale vector multiplication and convert to the target dtype. Upcasting helps with the following: 1. Full CPU support: `float32` has complete CPU kernel implementations for all operations 2. Numerical stability: `float32` provides more precision during intermediate calculations 3. Compatibility: Works across all devices (CPU/GPU) and PyTorch versions Test Plan: UTs Rollback Plan: Differential Revision: D81711093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162202 Approved by: https://github.com/wwwjn	2025-09-05 16:06:21 +00:00
Edward Yang	2dd529df00	A basic CLAUDE.md based on bad things I see claude code doing (#162163 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162163 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-09-05 14:52:36 +00:00
Shunting Zhang	a714437093	[ez][inductor] add a few outer dimension reduction cases for LOAF (#162028 ) For the not able to fuse issue reported here: https://github.com/pytorch/pytorch/issues/93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162028 Approved by: https://github.com/jansel, https://github.com/eellison	2025-09-05 09:30:13 +00:00
atalman	bffc7dd1f3	[CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds (#161916 ) Related to https://github.com/pytorch/pytorch/issues/159779 Adding CUDA 13.0 libtorch builds, followup after https://github.com/pytorch/pytorch/pull/160956 Removing CUDA 12.9 builds, See https://github.com/pytorch/pytorch/issues/159980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>	2025-09-05 07:47:54 +00:00
Zeng, Xiangdong	5c473e9f5e	[1/N] Port 5 _composable/fsdp distributed test cases to Intel GPU (#159118 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - enabled XPU for some test path - skip some test cases which Intel GPU does not support Pull Request resolved: https://github.com/pytorch/pytorch/pull/159118 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-05 05:52:15 +00:00
Pian Pawakapan	5da573c42c	[PGO] handle PGO profile merges (#162097 ) Avoid merges from extra PGO key, if same source has different rank. Unlikely to happen (needs code hash match & source variable type to change), but being safe. Differential Revision: D81299840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162097 Approved by: https://github.com/bobrenjc93	2025-09-05 04:58:15 +00:00
PyTorch UpdateBot	494878a11b	[audio hash update] update the pinned audio hash (#162114 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162114 Approved by: https://github.com/pytorchbot	2025-09-05 04:32:16 +00:00
PyTorch UpdateBot	3bbc2e3e4f	[vllm hash update] update the pinned vllm hash (#162226 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162226 Approved by: https://github.com/pytorchbot	2025-09-05 04:32:08 +00:00
Nick Riasanovsky	b67c410398	[BE] [Inductor] Add Kernel name to all coor-desc tuning (#161409 ) Summary: When running coordinate descent tuning the logging is difficult to parse if the results are parallelized at all. This includes the kernel name in each step so post-processing can unify the results, even if run in parallel. Test Plan: NFC. Just a logging change. Rollback Plan: Differential Revision: D80942794 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161409 Approved by: https://github.com/PaulZhang12	2025-09-05 02:53:13 +00:00
Colin L Reliability Rice	be5b03dde9	Allow for using a dedicated binary for the torch subproc pool. (#162093 ) Summary: The binary torch is running inside of can be larger than needed and in certain situations, this can cause a loss of memory. Test Plan: We've manually run tests via ``` TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_WORKER_SUPPRESS_LOGGING=0 make mc8-train-publish-cint-datafm-toy -C minimal_viable_ai/models/ifr_mtml/main_v1/ 2>&1 \| tee ~/run_out ``` and overriding the binary used to be the built fbpkg in /packages. We've also kicked off manual runs at ``` fire-feid-20250903-1051-ae8c6827 ``` Which do show the binary running - https://fburl.com/scuba/procprint/e6lwv32m Rollback Plan: steps: - jk.update: jk: pytorch/compiler:subproc_worker_binary constant_bool: null consistent_pass_rate: null fractional_host_rollout: null sampling_rate: null - manual.note: content: '' Differential Revision: D81616624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162093 Approved by: https://github.com/masnesral	2025-09-05 01:43:46 +00:00
Eddie Yan	73eb4511fb	[B200][NVFP4] Fix argument passing in `test_blockwise_mxfp8_nvfp4_mxfp4_numerics_` (#162185 ) to unblock https://github.com/pytorch/pytorch/pull/159494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162185 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-09-05 01:24:59 +00:00
Jeffro	29280864d9	Add new parameter for gen_pyi.py to make it more configureable. (#161772 ) This is a reposting of PR #128519. This change is important to how we maintain PyTorch at Google. From the previous PR: " This will make the script more flexible for the directory where it is executed. ... We plan to use the deprecated_yaml from a blaze genrule that invokes pyi.py. As the input to the pyi.py, genrule requires the input file to be explicitly listed out. When we feed the value of tools/autograd/deprecated.yaml to genrule, it failed to resolve since tools/autograd is a package from blaze perspective. Any file under a blaze package will a proper blaze target to be access. " Pull Request resolved: https://github.com/pytorch/pytorch/pull/161772 Approved by: https://github.com/albanD Co-authored-by: Haifeng Jin <haifeng-jin@users.noreply.github.com>	2025-09-05 00:48:15 +00:00
angelayi	5c67426d68	[dynamo] Add support for const prop on .item (#162204 ) Fixes some of the errors in https://fb.workplace.com/groups/1028545332188949/permalink/1303030824740397/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/162204 Approved by: https://github.com/williamwen42	2025-09-05 00:28:49 +00:00
Nikita Shulga	d2d4c8e9b2	[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999 ) Followup after https://github.com/pytorch/pytorch/pull/154012 Fixes CPU part of https://github.com/pytorch/pytorch/issues/160841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161999 Approved by: https://github.com/drisspg	2025-09-04 23:35:27 +00:00
Eddie Yan	c7e41071a0	[B200][MXFP8] Fix regex in `test_blockwise_mxfp8_nvfp4_error_messages_recipe_mxfp8_cuda` (#162180 ) to unblock https://github.com/pytorch/pytorch/pull/159494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162180 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/nWEIdia	2025-09-04 23:29:10 +00:00
xinan.lin	9499c8761c	[Inductor][Intel GPU] Register triton template heuristic for addmm tma. (#162132 ) Fixes #162048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162132 Approved by: https://github.com/jansel	2025-09-04 23:01:57 +00:00
Nan Zhang	3a207816cc	Forward fix for user defined triton kernel grid calc (#162162 ) Summary: This change fixes the test: inductor:fxir_backend - test_custom_triton_autotune_dynamic which was broken by https://github.com/pytorch/pytorch/pull/160997 Test Plan: inductor:fxir_backend - test_custom_triton_autotune_dynamic Rollback Plan: Differential Revision: D81679217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162162 Approved by: https://github.com/eellison, https://github.com/jansel	2025-09-04 22:51:23 +00:00
Yiming Zhou	09be1890d7	[export] Fix torch.export.load with storage offset (#162172 ) Summary: As titled Test Plan: CI Rollback Plan: Differential Revision: D81687701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162172 Approved by: https://github.com/angelayi	2025-09-04 22:50:33 +00:00
Pian Pawakapan	0d84ff3b78	[PGO] log add_extra_remote PGO to tlparse (#161751 ) Summary: log when additional PGO profile is merged in, from added read key Test Plan: test_pgo Rollback Plan: Differential Revision: D81284190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161751 Approved by: https://github.com/bobrenjc93	2025-09-04 22:47:03 +00:00
PyTorch MergeBot	1ec2c15914	Revert "Fix Arm64 OSS pytorch build with FBGEMM (#161527 )" This reverts commit dbec08729fb9848bebed6048c63831b87170d061. Reverted https://github.com/pytorch/pytorch/pull/161527 on behalf of https://github.com/malfet due to This breaks all Mac builds, see `b04e922712/1` ([comment](https://github.com/pytorch/pytorch/pull/161527#issuecomment-3256034443))	2025-09-04 22:29:38 +00:00
Shangdi Yu	b04e922712	Fix memory leak in AOTI when calling `aoti_torch_as_strided` (#162118 ) Summary: Fix memory leak in AOTI when calling `aoti_torch_as_strided` If you have something like `AtenTensorHandle buf_handle`; and you allocated memory to it, you have to make it a `RAIIAtenTensorHandle` to release the ownership. Otherwise you have leaked the memory because even when the program ends, there's still a pointer pointing to the underlying storage of `buf_handle_restrided`, and the storage is never freed. Test Plan: ``` buck run fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_pad_non_zero_memory_leak ``` Also verified by looking at `print(f"Allocated memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB")` Differential Revision: D81640339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162118 Approved by: https://github.com/angelayi	2025-09-04 22:17:06 +00:00
Brian Hirsh	0d71a9dd5b	fix incorrect interaction between DDPOptimizer and donated buffers (#160745 ) This should fix https://x.com/wightmanr/status/1953147089518772254?t=ng_R4t0-tRhO_qQE8NqOhw&s=19. Still working on adding a reasonable test. You can see more of a description of the problem in the code comments. But the TLDR is that: * When using DDPOptimizer, we partition the graph and compile several subgraphs. So 1 dynamo graphs becomes N AOT/inductor artifacts * We have some existing logic to stash graph metadata (`fw_metadata`) in dynamo's TracingContext. When using DDPOptimizer, we generate one `fw_metadata` per AOT graph, and we stash it on the 1 TracingContext from dynamo. So we end up clobbering the `fw_metadata` for graph i-1 when AOT and inductor start compiling graph i * This is normally ok, but it becomes a problem if inductor ever wants to read from this `fw_metadata` during backward compilation. Why? We (by default) compile the backwards lazily. So when using DDPOptimizer, we will compile backward graph N, then bw graph N-1, etc. But... at the time that we have stated compiling bw graph N-1, its corresponding fw_metadata has already been clobbered! So we end up reusing graph N's metadata for all of our backward graph compilations. With donated buffer metadata, that means we end up donated and writing into incorrect input buffers The fix that I added was to add more dedicated DDPOptimizer metadata into the TracingContext, so we can properly switch between these N different `fw_metadata` objects in the backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160745 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-09-04 21:57:27 +00:00
Ke Wen	89d41d3f61	[SymmMem] Feed tensor.data_ptr instead of handle.buffer_ptr into kernels (#162193 ) After MemPool support, `get_buffer_ptrs` points to base address of allocation segment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162193 Approved by: https://github.com/ngimel	2025-09-04 21:26:05 +00:00
Ke Wen	9bdcee01f8	[SymmMem] Add root argument to broadcast op (#161090 ) It was missing earlier. Also added range check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161090 Approved by: https://github.com/fegin	2025-09-04 21:09:54 +00:00
Prachi Gupta	b9ba612f7a	[ROCm] Enabling several UTs (#161715 ) All these UTs are working as is, just removing the skip - test_p2p_ipc - test_repros.py: working, added fp8 support - test_activation_checkpointing.py - test_content_store.py - test_cuda_multigpu.py - test_compute_comm_reordering.py - test_segment_reductions.py - test_dataloader.py - test_math_ops.py - test_loop_ordering.py - test_control_flow.py - distributed_test.py - test_mem_tracker.py - test_fsdp_optim_state.py - test_fully_shard_mixed_precision.py: skippped for < ROCm7.0 - test_aot_inductor_custom_ops.py - test_c10d_ops_nccl.py - test_eager_transforms.py - test_sparse_csr.py - test_inductor_collectives.py - test_fake_tensor.py - test_cupy_as_tensor.py - test_cuda.py: enable UTs that are working - test_matmul_cuda.py: enable UTs that are working Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-09-04 20:43:03 +00:00
PyTorch MergeBot	d5b38410b5	Revert "[SymmMem] Add root argument to broadcast op (#161090 )" This reverts commit 3c0ff1b569c45cfa6935ad8031a9d4cf1551aa3f. Reverted https://github.com/pytorch/pytorch/pull/161090 on behalf of https://github.com/jeanschmidt due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/161090#issuecomment-3255574093))	2025-09-04 20:42:31 +00:00
PyTorch MergeBot	48bedd753d	Revert "Fix usage of forwarding references (#161094 )" This reverts commit 1ebd70d0c0d562d3be9abdee2a21906584af7d99. Reverted https://github.com/pytorch/pytorch/pull/161094 on behalf of https://github.com/jeanschmidt due to checking if revert will fix https://github.com/pytorch/pytorch/actions/runs/17470601839/job/49621447581 ([comment](https://github.com/pytorch/pytorch/pull/161094#issuecomment-3255541480))	2025-09-04 20:35:41 +00:00
Wang, Eikan	a3d72b09ae	Apply Triton tensor descriptor for flex-decoding for performance (#161643 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161643 Approved by: https://github.com/drisspg	2025-09-04 20:10:41 +00:00
Edward Z. Yang	ef3be6726f	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-04 20:05:50 +00:00
PyTorch MergeBot	95ee0bfea9	Revert "[nativert] triton runtime implementation (#161798 )" This reverts commit 3dde5d7f9bf80dd6623a712bc429e9e4302464b5. Reverted https://github.com/pytorch/pytorch/pull/161798 on behalf of https://github.com/jeanschmidt due to introducing linting failures ([comment](https://github.com/pytorch/pytorch/pull/161798#issuecomment-3255412085))	2025-09-04 20:05:24 +00:00
Ben Niu	dbec08729f	Fix Arm64 OSS pytorch build with FBGEMM (#161527 ) Summary: X-link: https://github.com/pytorch/FBGEMM/pull/4775 Without this change, Arm64 OSS pytorch build with FBGEMM failed with the following error. Undefined symbols for architecture arm64: "fbgemm::FindMinMax(float const, float, float*, long long)", referenced from: at::native::fbgemm_linear_int8_weight_fp32_activation(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&, at::Tensor const&) in QuantizedLinear.cpp.o at::native::fbgemm_linear_quantize_weight(at::Tensor const&) in QuantizedLinear.cpp.o PackedConvWeight<2>::apply_dynamic(at::Tensor const&, bool) in qconv_dynamic.cpp.o PackedConvWeight<3>::apply_dynamic(at::Tensor const&, bool) in qconv_dynamic.cpp.o at::Tensor PackedLinearWeight::apply_dynamic_impl<false>(at::Tensor, bool) in qlinear_dynamic.cpp.o at::Tensor PackedLinearWeight::apply_dynamic_impl<true>(at::Tensor, bool) in qlinear_dynamic.cpp.o ld: symbol(s) not found for architecture arm64 This change fixed the issue by moving FindMinMax's implementation from QuantUtilsAvx2.cc to QuantUtils.cc. FindMinMax is a platform-agnostic function with AVX2-specific optimizations so conceptually it can be put in QuantUtils.cc. Test Plan: With this change, Arm64 OSS pytorch built successfully with FBGEMM enabled. Rollback Plan: Reviewed By: q10 Differential Revision: D81052327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161527 Approved by: https://github.com/q10	2025-09-04 20:01:13 +00:00
PyTorch MergeBot	c3d54dea9f	Revert "[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999 )" This reverts commit 02c83f13348631d80aa23f57aaff6b7d1223bbdd. Reverted https://github.com/pytorch/pytorch/pull/161999 on behalf of https://github.com/jeanschmidt due to break a few internal tests ([comment](https://github.com/pytorch/pytorch/pull/161999#issuecomment-3255381925))	2025-09-04 19:56:48 +00:00
PyTorch MergeBot	afa6e5604d	Revert "[BE] Cleanup stale comments/copy from `gemm` (#162001 )" This reverts commit b40d9432be44a6b5974ee62e7d19c3c61c5ece37. Reverted https://github.com/pytorch/pytorch/pull/162001 on behalf of https://github.com/jeanschmidt due to break a few internal tests ([comment](https://github.com/pytorch/pytorch/pull/161999#issuecomment-3255381925))	2025-09-04 19:56:48 +00:00
PyTorch MergeBot	9e5247f51d	Revert "[MPS] enable cat op for sparse (#162007 )" This reverts commit 2c03f0acc53ed13fe8ebfe809129f25996e009a0. Reverted https://github.com/pytorch/pytorch/pull/162007 on behalf of https://github.com/jeanschmidt due to Breaks internal builds see [D81588372](https://www.internalfb.com/diff/D81588372), @malfet may you help the author? ([comment](https://github.com/pytorch/pytorch/pull/162007#issuecomment-3255357336))	2025-09-04 19:49:44 +00:00
Edward Yang	c37103234a	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-04 19:43:17 +00:00
dolpm	3dde5d7f9b	[nativert] triton runtime implementation (#161798 ) Summary: att Test Plan: ci Rollback Plan: Reviewed By: minjang Differential Revision: D80828148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161798 Approved by: https://github.com/minjang, https://github.com/SherlockNoMad	2025-09-04 19:00:15 +00:00
Aaron Gokaslan	1f51056bd6	[BE]: Update cpp-httplib submodule to 0.26.0 (#162181 ) Update cpp-httplib with better error handling, bugfixes, and performance. Header only library update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162181 Approved by: https://github.com/jansel	2025-09-04 18:59:32 +00:00
Animesh Jain	6b1900c22f	[dynamo][hops] Remove const outputs from the speculated subgraph (#161355 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161355 Approved by: https://github.com/zou3519	2025-09-04 18:52:01 +00:00
mansiag05	9480cdc0b6	Modified the docs to add example for torch.is_floating_point and torc… (#161951 ) …h.is_complex. The PR proposes adding a simple, self-explanatory example to the documentation page. The example demonstrates the function's output for tensors with various data types, showing both True and False return values. Fixes #161859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161951 Approved by: https://github.com/zou3519	2025-09-04 18:50:19 +00:00
eqy	6f7608d603	[cuDNN][SDPA] Enable cuDNN SDPA by default for SM 9.0, SM 10.0 (#162073 ) for 2.9 🙏 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162073 Approved by: https://github.com/drisspg	2025-09-04 18:46:28 +00:00
Albert W	d1a15abfdc	export: add explicit decomposition for aten.expand_copy and unit test (#161688 ) Fixes #161080 torch.export.export fails with TypeError: expand() got an unexpected keyword argument 'implicit' when calling torch.expand_copy(..., implicit=True). This happened because expand_copy = _make_copy_from_view(aten.expand) register aten. expand as the decomposition path for aten.expand_copy, which doesn’t accept the implicit argument. I have added an explicit a decomposition for aten.expand_copy in torch/_decomp/decompositions.py to ignore the implicit argument, and a simple unit test to demonstrate the bug being fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161688 Approved by: https://github.com/angelayi, https://github.com/can-gaa-hou	2025-09-04 18:16:56 +00:00
Animesh Jain	33028597bf	[dynamo] Make the MRO walk more narrow (#162105 ) I dont have a failing test case but just saw an extra guard somewhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162105 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel	2025-09-04 17:54:33 +00:00
vasiliy	9eadb37cdd	enable float32 and float16 in `torch._grouped_mm` fallback (#162059 ) Summary: Enables `torch.float32` and `torch.float16` options in `torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`, `mat_b`, and `out_dtype` are `torch.bfloat16`. Saving for future PRs: 1. enabling testing on more platforms 2. supporting out_dtype != mat_a.dtype 3. opinfo 4. better compile support Test Plan: ```bash // on A100 and H100 pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x // on H100 pytest test/test_matmul_cuda.py -s -k test_scaled_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/162059 Approved by: https://github.com/ngimel, https://github.com/eqy ghstack dependencies: #161407, #161717	2025-09-04 17:48:52 +00:00
vasiliy	61fb632cfb	move `_grouped_mm` fallback to composite explicit autograd (#161717 ) Summary: Moves the `torch._grouped_mm` fallback from cuda-only code to a place where it can be used by multiple backends. Specifically: 1. make the fallback path and util functions reusable and move them to `ATen/native/GroupedMMUtils.h` 2. register a backend-agnostic kernel to composite explicit autograd key 3. refactor the grouped_mm tests to their own test case and enable CPU At the end of this PR, here is the support matrix: * CUDA SM90+: fast path with test coverage (no change) * CUDA SM80+: fallback with test coverage (no change) * CPU: fallback works, but without test coverage (new in this PR) * other SM versions and other backends: will probably already work, but let's leave this to future PRs * float32/float16: will probably already work, but let's leave this to future PRs Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/161717 Approved by: https://github.com/ngimel, https://github.com/drisspg ghstack dependencies: #161407	2025-09-04 17:48:52 +00:00
vasiliy	8a736fa1ea	create torch._grouped_mm fallback path with for loops / bmm (#161407 ) Summary: Creates a fallback path for `torch._grouped_mm`, using the naive for loop implementation (or bmm). For the sake of keeping the PR small, this PR only enables SM80+ (CUDA capability 8.0 and up), since I am testing this on an A100 machine. In future PRs, we can increase the coverage of the fallback to: 1. float32 and float16, which will extend the GPU coverage 2. cpu Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_3d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_2d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_2d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_3d -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/161407 Approved by: https://github.com/drisspg, https://github.com/eqy	2025-09-04 17:48:44 +00:00
Ke Wen	8bb213b6d5	[SymmMem] Increase signal pad size for NVL72 (#162026 ) so that the signal calls do not step on each other's foot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162026 Approved by: https://github.com/ngimel	2025-09-04 17:41:38 +00:00
Ke Wen	869cbcc16e	[SymmMem] Add a helper API to distinguish intra- and inter- node (#161984 ) Added a helper API to tell if the world is entirely within a P2P domain or crosses network. This is mainly for nblocks tuning purpose. (In later PRs) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161984 Approved by: https://github.com/ngimel ghstack dependencies: #161983	2025-09-04 17:37:59 +00:00
Frank Lin	0c0e056a9e	[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 ) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it capturing graph) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True\|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * Free marker: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * Terminal: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered reusable on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352 Approved by: https://github.com/ngimel, https://github.com/eqy Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-04 17:21:26 +00:00
William Wen	f36f285953	[dynamo] change error_on_graph_break/fullgraph semantics (#161747 ) This PR implements the semantics change to `torch._dynamo.error_on_graph_break`: - ~`torch.compile` now has a new `error_on_graph_break` kwarg that serves as a lower-priority toggle for erroring/continuing on graph breaks~ - `error_on_graph_break` is a new internal `torch.compile `setting that is lower-priority than `fullgraph`. It allows the user to toggle erroring/continuing on graph breaks. - `error_on_graph_break` does nothing when `fullgraph=True` - `error_on_graph_break` does NOT guarantee a single graph Followup [DONE]: need to change the programming model docs to reflect the 3 graph break modes for compilation: - `fullgraph=True`: enforce one graph, no graph breaks, cannot be toggled - `fullgraph=False, error_on_graph_break=True`: errors on graph breaks, latter can be toggled during compile time - `fullgraph=False, error_on_graph_break=False`: resumes tracing on graph breaks, latter can be toggled during compile time Pull Request resolved: https://github.com/pytorch/pytorch/pull/161747 Approved by: https://github.com/mlazos ghstack dependencies: #161739	2025-09-04 17:10:17 +00:00
Cui, Yifeng	ba7f546ccc	Update torch-xpu-ops commit pin (#162062 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@83c5a5](`83c5a5a551`), includes: - Revert "Disable xccl timer avoid drlm hang" because XPU time event issue has been fixed - Fallback lu_factor kernel to CPU for single batch - Enable aten::linalg_inv and aten::linalg_inv_ex on XPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/162062 Approved by: https://github.com/EikanWang	2025-09-04 17:05:33 +00:00
Lakshay Garg	43b7c86a2c	Add dependency-groups.dev to pyproject.toml (#161216 ) [PEP 735](https://peps.python.org/pep-0735) introduces the [dependency-groups] table for a number of use-cases one of which includes specifying development dependencies for projects. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161216 Approved by: https://github.com/seemethere	2025-09-04 16:51:36 +00:00
iupaikov-amd	019fed39aa	[ROCm] [CK] Composable Kernel integration for inductor backend (#158747 ) This is a part of our effort for integrating Composable Kernel library for Inductor backend. Currently we have a submodule, but would prefer to have commit pin control over the library as with Triton. We intentionally avoid putting all installation logic in CI scripts to allow locally built versions to have this functionality. The idea is to have CK as a pytorch dependency in pytorch 2.9 release to allow people to use it with inductor and AOT inductor and then gradually step away from submodule usage. Right now CK usage in SDPA/Gemm is tied to submodule files. This PR is a remake of due to branch error: https://github.com/pytorch/pytorch/pull/156192 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158747 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-09-04 16:51:06 +00:00
Oguz Ulgen	81aeefa657	Add torch.compile support for triton.constexpr_function (#162106 ) Fixes #161868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162106 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-09-04 16:46:55 +00:00
Edward Yang	248355faf5	Don't require FakeStore to be passed into fake backend (#162164 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162164 Approved by: https://github.com/bdhirsh, https://github.com/albanD, https://github.com/wconstab	2025-09-04 16:43:49 +00:00
Lakshay Garg	1ebd70d0c0	Fix usage of forwarding references (#161094 ) I found a number of places that seem to want forwarding references but the type signature does not reflect that Pull Request resolved: https://github.com/pytorch/pytorch/pull/161094 Approved by: https://github.com/malfet	2025-09-04 16:34:39 +00:00
Alexander Grund	cc5bdd1240	Keep default `CMAKE_PREFIX_PATH` in test_aot_inductor_package (#161907 ) `CMAKE_PREFIX_PATH` is a list of paths used to find dependencies. The test overwrites that with a single path causing dependencies such as protobuf or Abseil not being found. Instead prepend the path to the existing value. This fixes a test failure: > pytorch-v2.7.1/test/inductor/test_aot_inductor_package.py", line 242, in test_compile_after_package > self.assertTrue(so_path.exists()) > AssertionError: False is not true Caused by: ``` /software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::utility: No such file or directory /software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::variant: No such file or directory collect2: error: ld returned 1 exit status ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161907 Approved by: https://github.com/Skylion007	2025-09-04 16:27:57 +00:00
Yu, Guangye	3a20a20e70	Fix largeTensorTest malfunction on XPU (#161988 ) # Motivation https://github.com/pytorch/pytorch/pull/143553/files#diff-6492991193449e118ff0c8d42ca544cc38a73604e505ff246a3c711aeab91748R1345 makes `largeTensorTest` malfunction on XPU. This PR aims to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161988 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-09-04 16:10:03 +00:00
PyTorch MergeBot	6b8b3ac440	Revert "[ROCm] Use MI325 (gfx942) runners for binary smoke testing (#162044 )" This reverts commit cd529b686d54bbaa443f5b310140de48422d96c7. Reverted https://github.com/pytorch/pytorch/pull/162044 on behalf of https://github.com/jeffdaily due to mi200 backlog is purged, and mi300 runners are failing in GHA download ([comment](https://github.com/pytorch/pytorch/pull/162044#issuecomment-3254427869))	2025-09-04 16:06:30 +00:00
Boyuan Feng	601ae8e483	[CUDAGraph] add config to error on skipping cudagraph (#161862 ) Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error. This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161862 Approved by: https://github.com/ezyang, https://github.com/mlazos	2025-09-04 15:52:39 +00:00
PyTorch MergeBot	b7dad7dd49	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit 90b08643c3a6eb1f3265b7d1388bd76660759f46. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Already discussed with @ezyang about the internal quirks and errors ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3254219358))	2025-09-04 15:25:07 +00:00
Alexander Grund	e532c9d4f1	Relax tolerance for test_quick_baddbmm_cpu_complex64 (#152424 ) On Zen 2 (AMD EPYC) and Intel Sapphire Rapids this fails with small differences when compiled with native targeted optimizations. I.e. it fails with `-march=znver2` but succeeds with `-march=znver1`. I assume some operator fusing is being used by GCC. Small differences like using `vmovdqa` can be seen in the minimized code of the baddbmm kernel: https://godbolt.org/z/jsxMa91Wb The greatest differences are consistent and the same on both CPU architectures: ``` Greatest absolute difference: 3.43852152582258e-05 at index (1, 2, 1) (up to 1e-05 allowed) Greatest relative difference: 3.6034286949870875e-06 at index (1, 2, 1) (up to 1.3e-06 allowed) ``` Hence I assume this is in the expected tolerances especially as `complex128` and all other types pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152424 Approved by: https://github.com/malfet	2025-09-04 13:26:42 +00:00
PyTorch MergeBot	34aa78274d	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit 4ae57d448c0a7d37e4cfd5c27d977fad2cef4051. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Failing internal tests, probably typechecks. See D81588399 ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3253651785))	2025-09-04 13:13:52 +00:00
Deng, Daisy	040d00af04	[2/N]Port several test files under test/distributed to Intel GPU (#159473 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - use requires_accelerator_dist_backend to allow both nccl and xccl test - enabled XPU for some test path - Change the hardcoded world_size according to device_count. - Unify some common code under torch/testing/_internal for multiple backend, for example: Added xpu for Backend.backend_capability and dist.Backend.register_backend() Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-04 12:53:17 +00:00
Klaus Zimmermann	9c957723a0	Replace setup.py develop with pip install -e (#156710 ) #156027 already replaced most use of `python setup.py develop`. This PR only adds a few more occurrences. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156710 Approved by: https://github.com/atalman	2025-09-04 11:07:44 +00:00
fengqing.lu	acece97c3a	[Intel GPU] Upgrade OneDNN XPU Tag to v3.9.1 (#161932 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161932 Approved by: https://github.com/EikanWang, https://github.com/Skylion007, https://github.com/guangyey	2025-09-04 11:05:10 +00:00
kbabiuchx	ea1883dfd3	Fixes #154982 : add missing to_result_dtype in vector_norm (#155111 ) Fixes #154982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155111 Approved by: https://github.com/isuruf, https://github.com/eellison	2025-09-04 10:49:08 +00:00
Shangdi Yu	d67c29ad22	[inductor] Fix int64 from MutationOutput Buffer (#162020 ) Summary: When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Differential Revision: D81530083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162020 Approved by: https://github.com/davidberard98, https://github.com/eellison	2025-09-04 09:47:57 +00:00
vishalgoyal316	09587daf8c	Adding missing example of torch.full_like Issue#161899 (#162051 ) Fixes #161899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162051 Approved by: https://github.com/zou3519	2025-09-04 08:45:49 +00:00
Chong Gu	c024b1f5a1	[AMD] [Reland] Fix AMD User Defined Kernel Autotune (#161521 ) Summary: This is a reland of D80285441, fixed the unit test. Test Plan: ``` buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR ``` will succeed after this diff. Rollback Plan: Differential Revision: D80971224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161521 Approved by: https://github.com/frank-wei	2025-09-04 08:41:18 +00:00
zeshengzong	8fd3c9ce91	Optimize AMP custom_backend_name error message (#162037 ) Print out amp target dtype and let custom backend easier find out expected dtype while integration. ## Test Result ### Before ```python In [1]: import torch ...: import torch_openreg ...: ...: a = torch.randn(3, 4) ...: b = torch.randn(4, 2) ...: with torch.autocast("openreg", dtype=torch.float16): ...: torch.mm(a, b) ...: /home/coder/code/pytorch/torch/amp/autocast_mode.py:332: UserWarning: In openreg autocast, but the target dtype is not supported. Disabling autocast. openreg Autocast only supports dtypes of torch.float32 currently. warnings.warn(error_message ``` ### After ```python In [1]: import torch ...: import torch_openreg ...: ...: a = torch.randn(3, 4) ...: b = torch.randn(4, 2) ...: with torch.autocast("openreg", dtype=torch.float16): ...: torch.mm(a, b) ...: /home/coder/code/pytorch/torch/amp/autocast_mode.py:332: UserWarning: In openreg autocast, but the target dtype torch.float16 is not supported. Disabling autocast. openreg Autocast only supports dtypes of torch.float32 currently. warnings.warn(error_message) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162037 Approved by: https://github.com/zou3519	2025-09-04 08:27:56 +00:00
Liao, Wei	e19e02c84c	port distributed tensor test files for Intel GPU (#161604 ) In this pr, we port test/distributed/tensor test filesfor Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: Use torch.accelerator for general gpu Skip the case if running on xpu which has known issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/161604 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-04 07:49:25 +00:00
Chris Thi	69a25f6888	[ROCm] Enable USE_FBGEMM_GENAI (#160676 ) Summary: X-link: https://github.com/pytorch/FBGEMM/pull/4703 X-link: https://github.com/facebookresearch/FBGEMM/pull/1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Rollback Plan: Differential Revision: D79564024 Test Plan: Ensure builds with: - `USE_FBGEMM_GENAI=1` and without gfx942 - `USE_FBGEMM_GENAI=1` and with gfx942 - `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](`9491d289b3/.ci/docker/libtorch/build.sh (L48)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160676 Approved by: https://github.com/drisspg	2025-09-04 07:13:17 +00:00
tqchen	890626632d	[DLPACK] Optimize toDLPack Conversion Speed (#162111 ) Previously in gh-83069, the toDLPack converter introduces a normalization step that changes the strides to 1 when shape[i] == 1 This step, however, calls as_strided during toDLPack, and can slow down the toDLPack about 3x. This causes PyTorch's DLPack conversion to be around 0.6 us overhead per call from the < 0.2us. This PR updates the logic by adding a need_normalize_strides check, to first confirm if the strides normalization is necessary. In most common cases, when the tensor is continguous, such normalization is not necessary. We confirmed that having this additional step would recover the speed of toDLPack to below 0.2us and can help significantly speedup eager mode integration of DLPack with PyTorch. If we detect that there is normalization needs, the older path will be invoked. Fixes #162113 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162111 Approved by: https://github.com/msaroufim	2025-09-04 05:27:05 +00:00
Guilherme Leobas	480c739112	Capture TypeError in `CONTAINS_OP` (#161069 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161069 Approved by: https://github.com/anijain2305	2025-09-04 04:49:09 +00:00
Gabriel Ferns	66f3b4a682	Contiguous subgraph decomposition (#161241 ) ## Summary Adds a subgraph decomposition for addmm and mm that performs well on large `K` compared to `M` and `N`, and functions well as an alternative to `split-k` on AMD (transposed only), which does not support AMD currently. ## Background On AMD (MI300x), for a matmul A * B, if B is non-contiguous, the resulting matmul is quite a bit slower. For example: ``` args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[1, 178176])) )) ``` is a lot slower than: ``` args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[6144, 1])) )) ``` This PR adds a subgraph decomposition to test out whether making B contiguous is faster than just using the normal kernels. ## Data I ran this on unique non-contiguous shapes from torchbench/huggingface and got these speedups: ``` Parsed 420 unique shapes from benchmark output addmm improvements when best: addmm_16448x512x2048: +0.14% addmm_128x2048x2048: +0.01% addmm_128x768x1000: +0.75% addmm_12672x3072x768: +1.08% addmm_512x768x32000: +0.62% addmm_12608x384x384: +0.00% addmm_4160x1024x4096: +0.90% addmm_16x768x2: +0.56% addmm_12608x3072x768: +0.09% addmm_64x4096x1000: +2.77% addmm_256x1024x512: +1.99% addmm_30x256x256: +1.12% addmm_100480x128x384: +0.91% addmm_6400x2048x512: +0.25% addmm_61568x1024x256: +0.08% addmm_1x768x768: +0.93% addmm_12544x384x384: +0.19% addmm_128x512x1000: +0.77% addmm_2048x128x128: +1.32% addmm_128x3072x1000: +0.24% addmm_7936x512x2048: +0.07% addmm_8192x512x2048: +0.33% addmm_64x1024x1000: +1.43% addmm_128x2304x1000: +0.01% addmm_32768x256x2: +0.75% addmm_64x384x1152: +0.79% addmm_64x640x1000: +0.01% addmm_100480x128x128: +0.87% addmm_1152x3072x768: +1.13% addmm_8192x256x2048: +1.40% addmm_4096x128x768: +0.01% addmm_128x2560x1000: +0.01% addmm_12544x2048x512: +0.43% addmm_200704x24x96: +0.14% addmm_8448x512x2048: +0.96% addmm_50176x256x1024: +0.62% addmm_4160x4096x1024: +0.22% addmm_4096x768x768: +0.32% addmm_220x2048x512: +0.56% addmm_8x2048x1000: +1.12% addmm_256x197951x512: +26.99% addmm_401536x64x192: +0.60% addmm_2040x2048x512: +0.47% addmm_512x1024x256: +1.32% addmm_128x4096x1000: +1.67% addmm_12672x768x768: +0.34% addmm_128x368x1000: +0.77% addmm_96x1280x1000: +0.01% addmm_12544x512x2048: +0.41% addmm_6272x320x1280: +0.76% addmm_12544x3072x768: +0.09% addmm_64x384x1000: +0.39% mm improvements when best: mm_200704x128x512: +1.29% mm_663552x16x16: +0.80% mm_4096x768x768: +0.51% mm_131072x64x31: +0.24% mm_12544x1152x384: +0.11% mm_128x2048x2: +0.46% mm_262144x16x23: +0.62% mm_50176x576x192: +0.37% mm_131072x16x31: +0.26% ================================================================================ BENCHMARK ANALYSIS RESULTS ================================================================================ Operation: addmm ---------------------------------------- Total shapes analyzed: 247 Average Subgraph placement: 3.38 Median Subgraph placement: 2.0 Subgraph is best choice: 52/247 shapes (21.1%) Average improvement when best: 1.15% Median improvement when best: 0.58% Largest improvement when best: +26.99% Operation: bmm ---------------------------------------- Total shapes analyzed: 85 Average Subgraph placement: 24.00 Median Subgraph placement: 21.0 Subgraph is best choice: 0/85 shapes (0.0%) Average improvement when best: N/A (never best) Median improvement when best: N/A (never best) Largest improvement when best: N/A (never best) Operation: mm ---------------------------------------- Total shapes analyzed: 88 Average Subgraph placement: 15.08 Median Subgraph placement: 4.0 Subgraph is best choice: 9/88 shapes (10.2%) Average improvement when best: 0.52% Median improvement when best: 0.46% Largest improvement when best: +1.29% ``` ## Results The largest shape gain, `256,197951,512`, seemed to be driven by a case where the extern kernel is way faster than the best triton configs on the recursive autotune: ``` addmm,Extern,extern_kernels.addmm,256,197951,512,0.38024500012397766 addmm,Triton,256,197951,512,32,256,16,2,2,4,2.005444049835205 addmm,Triton,256,197951,512,32,128,32,2,4,8,2.04189395904541 addmm,Triton,256,197951,512,64,128,16,2,4,8,2.1911399364471436 addmm,Triton,256,197951,512,64,128,32,2,4,8,2.496040105819702 addmm,Triton,256,197951,512,64,128,64,2,8,16,2.9306790828704834 addmm,Triton,256,197951,512,64,64,32,2,4,8,3.0347819328308105 ... ``` Compared to the non-transposed autotune: ``` addmm,Subgraph,contiguous_addmm_1384,256,197951,512,0.5024129748344421 addmm,Extern,extern_kernels.addmm,256,197951,512,0.6881489753723145 addmm,Triton,256,197951,512,32,256,16,2,2,4,2.5115010738372803 addmm,Triton,256,197951,512,32,128,32,2,4,8,2.5167479515075684 addmm,Triton,256,197951,512,64,128,16,2,4,8,2.9507460594177246 addmm,Triton,256,197951,512,64,256,64,2,8,4,2.9673290252685547 addmm,Triton,256,197951,512,64,128,64,2,8,16,3.3906331062316895 addmm,Triton,256,197951,512,64,128,32,2,4,8,3.496859073638916 ``` It seems to perform really well for high values of `K` vs `N` and `M`. Testing this hypothesis with some custom shapes: ``` Parsed 64 unique shapes from benchmark output addmm improvements when best: addmm_128x16384x128: +0.18% addmm_128x262144x256: +38.24% addmm_128x200000x512: +14.76% addmm_256x800000x128: +0.06% addmm_131072x128x256: +0.27% addmm_128x256x131072: +0.25% addmm_2048x200000x64: +12.45% mm improvements when best: mm_128x16384x128: +0.18% mm_128x262144x256: +38.05% mm_128x200000x512: +9.47% mm_256x800000x128: +0.99% mm_512x6400000x256: +3.17% mm_524288x64x64: +0.29% mm_2048x200000x64: +11.19% mm_8192x1000000x256: +34.14% mm_128x4096x100000: +0.40% mm_128x3072x150000: +0.27% ================================================================================ BENCHMARK ANALYSIS RESULTS ================================================================================ Operation: addmm ---------------------------------------- Total shapes analyzed: 33 Average Subgraph placement: 4.39 Median Subgraph placement: 2.0 Subgraph is best choice: 7/33 shapes (21.2%) Average improvement when best: 9.46% Median improvement when best: 0.27% Largest improvement when best: +38.24% Operation: mm ---------------------------------------- Total shapes analyzed: 30 Average Subgraph placement: 7.63 Median Subgraph placement: 2.0 Subgraph is best choice: 10/30 shapes (33.3%) Average improvement when best: 9.81% Median improvement when best: 2.08% Largest improvement when best: +38.05% ``` ## Conclusion Contiguous Subgraph Decompositionseems worthwhile for `mm` and `addmm`, but not `bmm`, and has a very large improvment on low `M`, low `N`, and high `K` shapes. Data gathering scripts: https://gist.github.com/exclamaforte/4a896c064d301b27bf5ca0a4f8fc3866 ## Test Plan: New unit tests. Differential Revision: D80771648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161241 Approved by: https://github.com/eellison	2025-09-04 04:43:58 +00:00
PyTorch UpdateBot	302df2ac5d	[vllm hash update] update the pinned vllm hash (#162115 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162115 Approved by: https://github.com/pytorchbot	2025-09-04 04:26:34 +00:00
Shangdi Yu	dec72ea4b0	[reland] Add inductor provenance mapping for cpp extern kernel (#161656 ) (#162069 ) Summary: Add inductor provenance mapping for cpp extern kernel Test Plan: ``` buck run fbcode//caffe2/test/inductor:provenance_tracing -- -r test_cpu_extern_kernel ``` Differential Revision: D81598857 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162069 Approved by: https://github.com/angelayi	2025-09-04 04:18:43 +00:00
Richard Howell	8975cda252	[pt] strip error messages in profile builds (#162076 ) Summary: Profile builds should match production builds, and error messages result in large static initializers running. Omit them for profile builds too. Test Plan: Before: ``` $ buck build //xplat/caffe2:aten_native_cpuApple -c user.sandcastle_build_mode=profile --show-output $ llvm-nm buck-out/v2/gen/fbsource/31fc3668aa0b4012/xplat/caffe2/__aten_native_cpuApple__/libaten_native_cpuApple.pic.a \| grep ZN3c106detail12_str_wrapperIJPKcRKiS3_RKxS3_RKS3_S3_EE4callES9_S5_S9_S7_S9_S9_S9 0000000000003234 T __ZN3c106detail12_str_wrapperIJPKcRKiS3_RKxS3_RKS3_S3_EE4callES9_S5_S9_S7_S9_S9_S9_ ``` After: ``` $ buck build //xplat/caffe2:aten_native_cpuApple -c user.sandcastle_build_mode=profile --show-output $ llvm-nm buck-out/v2/gen/fbsource/31fc3668aa0b4012/xplat/caffe2/__aten_native_cpuApple__/libaten_native_cpuApple.pic.a \| grep ZN3c106detail12_str_wrapperIJPKcRKiS3_RKxS3_RKS3_S3_EE4callES9_S5_S9_S7_S9_S9_S9 ``` Rollback Plan: Reviewed By: yury-dymov, abashyam Differential Revision: D81599582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162076 Approved by: https://github.com/swolchok	2025-09-04 04:18:27 +00:00
Guilherme Leobas	d636c181f9	Fix `range.__getitem__()` (#161804 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161804 Approved by: https://github.com/anijain2305 ghstack dependencies: #161801, #161802, #161803	2025-09-04 02:33:03 +00:00
Guilherme Leobas	c8255c67cd	redirect `iter(range)` to `range.__iter__()` (#161803 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161803 Approved by: https://github.com/anijain2305 ghstack dependencies: #161801, #161802	2025-09-04 02:33:03 +00:00
Guilherme Leobas	485a7bd82e	Add `range_count` and `range.__contains__` (#161802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161802 Approved by: https://github.com/anijain2305 ghstack dependencies: #161801	2025-09-04 02:33:03 +00:00
Guilherme Leobas	1ef7efa592	Add `range_equals` (#161801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161801 Approved by: https://github.com/anijain2305	2025-09-04 02:33:03 +00:00
Sun, Jiayi	57278d45f0	[Quant][Inductor][CPU] add qconv int8-mixed-bf16 patterns (#161487 ) Summary: Expand the patterns supported by qconv weight prepack, Specifically, expand the conv patterns of int8-mixed-bf16 datatype to support the following two cases: Case 1: the `out_dtype `of `dequantize_per_tensor `is `torch.float32` ``` dq_per_tensor dq_per_channel \| \| to_bf16 to_bf16 \ / Conv2d ``` Case 2: the `out_dtype `of `dequantize_per_tensor `is `torch.bfloat16` ``` dq_per_tensor dq_per_channel \ \| to_bf16 / Conv2d ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161487 Approved by: https://github.com/Xia-Weiwen, https://github.com/CaoE, https://github.com/jansel ghstack dependencies: #161486	2025-09-04 02:01:34 +00:00
Sun, Jiayi	cec0ff1228	[Quant][Inductor][CPU] add qlinear int8-mixed-bf16 patterns (#161486 ) Summary: Expand the patterns supported by qlinear weight prepack, Specifically, expand the linear patterns of int8-mixed-bf16 datatype to support the following two cases: Case 1: the `out_dtype` of `dequantize_per_tensor ` is `torch.float32` dq_per_tensor dq_per_channel \| \| to_bf16 to_bf16 \| \| OPT(reshape) permute \ / addmm/mm \| OPT(reshape) or dq_per_tensor dq_per_channel \| \| to_bf16 to_bf16 \| \| expand permute \ \| expand / bmm \| OPT(add) Case 2: the `out_dtype` of `dequantize_per_tensor ` is `torch.bfloat16` dq_per_tensor dq_per_channel \| \| to_bf16 \| OPT(reshape) permute \ / addmm/mm \| OPT(reshape) or dq_per_tensor dq_per_channel \| \| to_bf16 \| expand permute \ \| expand / bmm \| OPT(add) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161486 Approved by: https://github.com/Xia-Weiwen, https://github.com/jansel	2025-09-04 01:53:02 +00:00
Jacob Szwejbka	65985937d9	expose number of outputs in native runtime for unified runtime (#161723 ) This is only user outputs which is what we want. Spoke to @zhxchen17 though and it seems like nativeRT might have some bugs on propogating updates to things like input mutation or buffer mutation though. Something to take a look at in a follow up. Also I have no idea where the nativeRT tests are. Any pointers @zhxchen17 @SherlockNoMad Pull Request resolved: https://github.com/pytorch/pytorch/pull/161723 Approved by: https://github.com/zhxchen17	2025-09-04 01:20:31 +00:00
Laith Sakka	fbf3d2027d	use sym_or instead of any to avoid dde in calc_conv_nd_return_shape (#162084 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162084 Approved by: https://github.com/aorenste Co-authored-by: Aaron Orenstein <aorenste@fb.com>	2025-09-04 01:20:22 +00:00
William Wen	8678d831c4	[dynamo] rename set_fullgraph to error_on_graph_break (#161739 ) Renaming `set_fullgraph` to `error_on_graph_break` for now. There are no semantic differences yet. In a followup PR, we will introduce a new `torch.compile` option `error_on_graph_break` that has lower priority than `fullgraph` so that `fullgraph` really returns 1 graph. I could keep `set_fullgraph` as a deprecated alias for `error_on_graph_break` for now, but I'm hoping that won't be necessary since it's still private API (there are no internal callsites yet, and there are no significant OSS callsites yet). cc @albanD @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @Lucaskabela @mlazos @guilhermeleobas @xmfan as primary users for `set_fullgraph` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161739 Approved by: https://github.com/xmfan, https://github.com/Lucaskabela, https://github.com/anijain2305, https://github.com/mlazos	2025-09-04 01:15:06 +00:00
Saurabh Mishra	1281470155	[DCP][HuggingFace] Add Support for dequantization of SafeTensors checkpoints (#160682 ) This PR introduces the QuantizedHuggingFaceReader component which enables the reading and dequantization of the quantized tensors in the SafeTensors checkpoint. Following capabilities are inrtoduced: - Configuration the target DType and the block size. - Multi threaded dequantization for efficiency Test Plan: buck test //caffe2/test/distributed/checkpoint\:test_quantized_hf_storage ``` Time elapsed: 2:34.1s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D80174674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160682 Approved by: https://github.com/ankitageorge	2025-09-04 01:09:53 +00:00
Markus Hoehnerbach	9458d1ac3b	[inductor] pdl inductor option (disabled by default) (#160928 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160928 Approved by: https://github.com/eellison	2025-09-04 00:35:23 +00:00
Avik Chaudhuri	3c45af079a	kill allow_complex_guards_as_runtime_asserts (#161794 ) Summary: [reland] Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept). Test Plan: updated tests Rollback Plan: Differential Revision: D81334984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161794 Approved by: https://github.com/zhxchen17	2025-09-04 00:17:01 +00:00
PyTorch MergeBot	aad96a2022	Revert "Contiguous subgraph decomposition (#161241 )" This reverts commit d64718503728001a1e78168fd7f2d4ff23e57285. Reverted https://github.com/pytorch/pytorch/pull/161241 on behalf of https://github.com/jeffdaily due to breaks rocm mi300 tests ([comment](https://github.com/pytorch/pytorch/pull/161241#issuecomment-3251185098))	2025-09-04 00:14:22 +00:00
Rohit Manav	5f3cbc9442	fixed typo error (#162055 ) Fixes #162054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162055 Approved by: https://github.com/RajeshvShiyal, https://github.com/malfet	2025-09-04 00:06:58 +00:00
Xu Han	a918bbad6a	[inductor] fix test output path 2 (#162085 ) Fix test_output_path_2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162085 Approved by: https://github.com/angelayi, https://github.com/jansel	2025-09-04 00:03:47 +00:00
dolpm	8ec551bb35	[aot-compile] strip internal tracebacks for non-verbose graph breaks + include user file/lineno (#162005 ) pytest test/dynamo/test_aot_compile.py -k test_aot_compile_graph_break_error_fmt before ``` Traceback (most recent call last): File "/data/users/$USER/vllm-tests/graph-break.py", line 15, in <module> aot_compiled_fn = compiled.aot_compile((example_inputs, {})) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 717, in aot_compile return aot_compile_fullgraph( ^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/aot_compile.py", line 132, in aot_compile_fullgraph capture_output = convert_frame.fullgraph_capture( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 947, in fullgraph_capture dynamo_output = compile_frame( ^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 1020, in compile_frame bytecode, tracer_output = transform_code_object(code, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/bytecode_transformation.py", line 1592, in transform_code_object tracer_output = transformations(instructions, code_options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 992, in transform tracer_output = trace_frame( ^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 312, in _fn return fn(args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 821, in trace_frame run_tracer() File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 803, in run_tracer tracer.run() File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1472, in run while self.step(): ^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1342, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 902, in wrapper return inner_fn(self, inst) ^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 3364, in CALL self._call(inst) File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 3358, in _call self.call_function(fn, args, kwargs) File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1260, in call_function self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward return getattr(self.realize(), name)(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/variables/functions.py", line 1513, in call_function unimplemented_v2( File "/data/users/$USER/pytorch/torch/_dynamo/exc.py", line 596, in unimplemented_v2 raise Unsupported(msg) torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()` Explanation: User-inserted graph break. Message: None Hint: Remove the `torch._dynamo.graph_break()` call. Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}` For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html ``` after ``` Traceback (most recent call last): File "/data/users/$USER/vllm-tests/graph-break.py", line 15, in <module> aot_compiled_fn = compiled.aot_compile((example_inputs, {})) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 737, in aot_compile raise e.with_traceback(None) from e.__cause__ # User compiler error ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()` Explanation: User-inserted graph break. Message: None Hint: Remove the `torch._dynamo.graph_break()` call. Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}` For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html from user code: File "/data/users/$USER/vllm-tests/graph-break.py", line 5, in foo torch._dynamo.graph_break() Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" ``` consistent w/ std torch.compile ``` Traceback (most recent call last): File "/data/users/$USER/vllm-tests/graph-break.py", line 16, in <module> res = compiled(example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 850, in compile_wrapper raise e.with_traceback(None) from e.__cause__ # User compiler error ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()` Explanation: User-inserted graph break. Message: None Hint: Remove the `torch._dynamo.graph_break()` call. Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}` For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html from user code: File "/data/users/$USER/vllm-tests/graph-break.py", line 5, in foo torch._dynamo.graph_break() Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162005 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2025-09-03 23:19:47 +00:00
Catherine Lee	36d207fcaa	[CI] viable strict upgrade: Explicitly name which linux binary wheels should block (#162100 ) Reason: rocm binary builds should not block viable strict upgrade. It is queuing/canceled so viable strict is 1.2 days old Tested by mangling the workflow file to get to the actual call of the python script `python ../test-infra/tools/scripts/fetch_latest_green_commit.py --required-checks '["pull", "trunk", "lint", "^linux-binary-manywheel$", "^linux-binary-libtorch-release$", "linux-aarch64"]' --viable-strict-branch viable/strict --main-branch master`, which I then ran locally where I have credentials. It returned d64718503728001a1e78168fd7f2d4ff23e57285 which is green. Without this change, it returns 5e5870e858f60ff4bf87d03f3592097e934a9580, which is pretty old The other solution would have been to mark it as unstable I think Side note, why is it master and how is it working like that Pull Request resolved: https://github.com/pytorch/pytorch/pull/162100 Approved by: https://github.com/huydhn	2025-09-03 22:38:32 +00:00
Jeff Daily	99f356fa58	[ROCm] revamp miopen integration (#161687 ) Update sources under ATen/miopen and ATen/native/miopen to align with best practices. Avoid reshape_ calls inside backward operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161687 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-03 22:28:09 +00:00
Jithun Nair	0af70e2353	Modify ROCm MI2xx-based workflows to run on cron schedule (#162103 ) To mitigate queueing on MI2xx runners since Cirrascale runners are offline. Match cron schedule of periodic.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/162103 Approved by: https://github.com/jeffdaily, https://github.com/seemethere	2025-09-03 21:51:03 +00:00
Jeff Daily	b1bb98ddeb	[ROCm] TunableOp should use HIP version, not ROCm version (#162067 ) Fixes #160874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162067 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-03 21:42:23 +00:00
Howard Huang	abc447174c	[PP] Add profiling to schedule execution (#160753 ) Profiling title will be `str(action)` <img width="1545" height="694" alt="image" src="https://github.com/user-attachments/assets/60b3506b-b8d6-4ae0-8b32-0d51d45fa2f0" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160753 Approved by: https://github.com/wconstab	2025-09-03 21:31:50 +00:00
Arsh Zahed	734ce8eba9	Rename propagate_tensor_meta to make private again (#161744 ) Rename the wrapper `propagate_tensor_meta` added in #161334 to make it clearly private, and rename the existing LRU function to accommodate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161744 Approved by: https://github.com/bdhirsh	2025-09-03 21:11:45 +00:00
Xinya Zhang	98efc9e93d	[ROCm] Bump AOTriton to 0.11b (#161754 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b: * Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements - AITER ASM kernels deliver over 500TFLOPS training performance. See [AOTriton 0.11b Release Page](https://github.com/ROCm/aotriton/releases/tag/0.11b) for more details. * Now returns natural based `logsumexp` tensor, matching CUDA's behavior - PR #156903 is reverted in this PR as well since it is not needed anymore. * Enables `CausalVariant.LOWER_RIGHT` The build system changes drastically along with new packaging scheme of AOTriton 0.11 * AOTriton 0.11 packs GPU images separately from AOTriton runtime * `aotriton.cmake` now selectively downloads image packs according to `PYTORCH_ROCM_ARCH` * `aotriton.cmake` now only use pre-compiled runtime library that exactly matches the ROCM in the build environment. For PyTorch builds with ROCm versions not listed in the file, the build process will build AOTriton runtime without GPU images from source - This avoids any further ABI breaks like ROCM 6.4 -> 7.0 - recursive git clone is disabled since building AOTriton runtime does not require submodules. Bug fixes: * Fix a kernel bug introduced when implementing SWA Known Problems: * gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status due to accuracy issues. Triton compiler fixes are needed to restore the support status. * Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0. This issue is under investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161754 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-09-03 20:45:44 +00:00
Ke Wen	994f2a5dbc	[SymmMem][CI] Make sure group names are consistent (#162035 ) Unblocking #161741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162035 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-09-03 20:40:24 +00:00
Natalia Gimelshein	d1706d9128	[Symmetric memory] set handle type for ROCm (#161741 ) Fixes #161722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161741 Approved by: https://github.com/kwen2501	2025-09-03 20:33:35 +00:00
arkadip-maitra	1aa7476885	fix to segmentation fault when empty tensor is passed to choose_qpara… (#161966 ) …ms_optimized Fixes #153326 Minimal code to reproduce error: ``` import torch tensor = torch.tensor([]) torch.choose_qparams_optimized( tensor, 0, 200, 0.16, 8 ) ``` Previous Output: `Segmentation fault` Now Output: ``` Traceback (most recent call last): File "/home/amaitra/work/tests/issue_153326.py", line 5, in <module> torch.choose_qparams_optimized( RuntimeError: input tensor is empty and has no data ``` Caused because `const float* input_row =input_tensor.const_data_ptr<float>();` becomes null Pull Request resolved: https://github.com/pytorch/pytorch/pull/161966 Approved by: https://github.com/Skylion007	2025-09-03 20:26:26 +00:00
Aaryaman Vasishta	8e23a1227b	[ROCm/Windows] Fix build failures and support some BLAS calls (#161981 ) * Support getrsBatched/geqrfBatched/gelsBatched on Windows ROCm (fixes https://github.com/ROCm/TheRock/issues/1367) * Fix windows pytorch build with USE_DISTRIBUTED=ON by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/161981 Approved by: https://github.com/ScottTodd, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-03 20:26:14 +00:00
Yulun Wang	850e1382a9	[hipify] Replace cudaStreamCaptureStatusNone (#161992 ) Replacing additional cuda symbols to hip symbols Differential Revision: D81420086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161992 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007	2025-09-03 20:23:32 +00:00
Ke Wen	3c0ff1b569	[SymmMem] Add root argument to broadcast op (#161090 ) It was missing earlier. Also added range check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161090 Approved by: https://github.com/fegin	2025-09-03 20:17:45 +00:00
Yiming Zhou	c465b3d52c	[2/n][export] Refactor PT2 Archive weight saving and loading (#161520 ) Summary: The saving (serialization) part of PT2 archive weight refactoring. The loading (deserialization part) has been landed in D80035490 Test Plan: CI Rollback Plan: bifferential Revision: D80970931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161520 Approved by: https://github.com/SherlockNoMad	2025-09-03 20:12:49 +00:00
andrewor14	f4c33cd44a	[pt2e] Avoid getting model device once per node (#159901 ) Summary: Previously, we call `assert_and_get_unqiue_device` once per node in both prepare and convert. This is expensive and unnecessary since the model device is the same across all nodes, so we should just call this once in the beginning and reuse the same model device across all the nodes. Test Plan: python test/test_quantization.py -k TestQuantizePT2E Pull Request resolved: https://github.com/pytorch/pytorch/pull/159901 Approved by: https://github.com/jerryzh168	2025-09-03 19:29:00 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	92576a594b	Prototype for building non-strict leak detector (#160456 ) Summary: Our strategy for detecting fake tensor leakage in non-strict for outside scope (side effects happening outside of model.forward) is: 1. We do gc.collect() before export and get the alive fake tensors 2. We dump the proxy to fake tensor map from make_fx tracer 3. We query gc again to get alive fake tensors 4. We take the delta between (1) and (3) 5. Filter out fake tensors that are: 1. Associated with `TrackedFake` (input tracking thing in symbolic_shapes) 2. Associated with `gm.meta` 6. Do ID match with the proxies and emit their stacktraces. We rely on (https://github.com/pytorch/pytorch/pull/159923) for other sources of leakages such as: 1. We failed to proxy an operator (like param.data) 2. We cache some tensor in model.forward (https://github.com/pytorch/pytorch/issues/155114) In general, we notice `gc.collect()` and query-ing gc for live objects are kinda slow. So we turn on this feature under env variable. We should document on export public facing documents that if you run into weird errors regarding fake tensors, they should look into turning on this env variable for further analysis. Test Plan: Test plan Rollback Plan: Differential Revision: D80003204 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160456 Approved by: https://github.com/pianpwk	2025-09-03 19:21:27 +00:00
Jithun Nair	cd529b686d	[ROCm] Use MI325 (gfx942) runners for binary smoke testing (#162044 ) ### Motivation * MI250 Cirrascale runners are currently having network timeout leading to huge queueing of binary smoke test jobs: <img width="483" height="133" alt="image" src="https://github.com/user-attachments/assets/17293002-78ad-4fc9-954f-ddd518bf0a43" /> * MI210 Hollywood runners (with runner names such as `pytorch-rocm-hw-`) are not suitable for these jobs, because they seem to take much longer to download artifacts: https://github.com/pytorch/pytorch/pull/153287#issuecomment-2918420345 (this is why these jobs were specifically targeting Cirrascale runners). However, it doesn't seem like Cirrascale runners are necessarily doing much better either e.g. [this recent build](https://github.com/pytorch/pytorch/actions/runs/17332256791/job/49231006755). Moving to MI325 runners should address the stability part at least, while also reducing load on limited MI2xx runner capacity. * However, I'm not sure if the MI325 runners will do any better on the artifact download part (this may need to be investigated more) cc @amdfaa * Also removing `ciflow/binaries` and `ciflow/binaries_wheel` label/tag triggers for `generated-linux-binary-manywheel-rocm-main.yml` because we already trigger ROCm binary build/test jobs via these labels/tags in `generated-linux-binary-manywheel-nightly.yml`. And for developers who want to trigger ROCm binary build/test jobs on their PRs, they can use the `ciflow/rocm-mi300` label/tag as per this PR. ### TODOs (cc @amdfaa): * Check that the workflow runs successfully on the MI325 runners in this PR. Note how long the test jobs take esp. the "Download Build Artifacts" step * Once this PR is merged, clear the queue of jobs targeting `linux.rocm.gpu.mi250` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162044 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-03 18:34:07 +00:00
Isuru Fernando	62c3f9a97f	[inductor] Follow integer overflow rules in TypedExpr (#161922 ) Fixes https://github.com/pytorch/pytorch/issues/161763 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161922 Approved by: https://github.com/jansel	2025-09-03 18:33:18 +00:00
Guilherme Leobas	8076a185c8	Offload set method execution to CPython when possible (#160763 ) Reduces CPython `test_set.py` runtime from 63.477s to 40.298s Pull Request resolved: https://github.com/pytorch/pytorch/pull/160763 Approved by: https://github.com/anijain2305	2025-09-03 18:26:05 +00:00
Ruben Rodriguez Buchillon	f00445b43e	[inductor][ez] add hook for heuristics to adjust kernel input nodes (#161339 ) # why - some templates e.g. scale_mm need to unsqueeze/squeeze the nodes for codegen and heuristics - unified place where we can just adjust them for the template # what - inside get_mm_configs, return not the passed in kernel inputs, but allow the template heuristic to adjust them if necessary - the default implementation right now just passes them back this diff just adds the functionality, but does not exercise it other than the default (passthrough) # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520572](https://our.internmc.facebook.com/intern/diff/D81520572) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161339 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #161123, #161124, #161125, #161126, #161336, #161338	2025-09-03 18:23:22 +00:00
Laith Sakka	3559c354ce	stop suggesting using guard_size_oblivious on data dependent errors (#160510 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160510 Approved by: https://github.com/ezyang	2025-09-03 18:07:59 +00:00
Aleksei Nikiforov	71992dd805	S390x: build nightly binaries for new pythons (#161920 ) Enable python 3.13t, 3.14 and 3.14t on s390x for nightly binaries Fixes #161515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161920 Approved by: https://github.com/malfet	2025-09-03 17:38:38 +00:00
Gabriel Ferns	d647185037	Contiguous subgraph decomposition (#161241 ) ## Summary Adds a subgraph decomposition for addmm and mm that performs well on large `K` compared to `M` and `N`, and functions well as an alternative to `split-k` on AMD (transposed only), which does not support AMD currently. ## Background On AMD (MI300x), for a matmul A * B, if B is non-contiguous, the resulting matmul is quite a bit slower. For example: ``` args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[1, 178176])) )) ``` is a lot slower than: ``` args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[6144, 1])) )) ``` This PR adds a subgraph decomposition to test out whether making B contiguous is faster than just using the normal kernels. ## Data I ran this on unique non-contiguous shapes from torchbench/huggingface and got these speedups: ``` Parsed 420 unique shapes from benchmark output addmm improvements when best: addmm_16448x512x2048: +0.14% addmm_128x2048x2048: +0.01% addmm_128x768x1000: +0.75% addmm_12672x3072x768: +1.08% addmm_512x768x32000: +0.62% addmm_12608x384x384: +0.00% addmm_4160x1024x4096: +0.90% addmm_16x768x2: +0.56% addmm_12608x3072x768: +0.09% addmm_64x4096x1000: +2.77% addmm_256x1024x512: +1.99% addmm_30x256x256: +1.12% addmm_100480x128x384: +0.91% addmm_6400x2048x512: +0.25% addmm_61568x1024x256: +0.08% addmm_1x768x768: +0.93% addmm_12544x384x384: +0.19% addmm_128x512x1000: +0.77% addmm_2048x128x128: +1.32% addmm_128x3072x1000: +0.24% addmm_7936x512x2048: +0.07% addmm_8192x512x2048: +0.33% addmm_64x1024x1000: +1.43% addmm_128x2304x1000: +0.01% addmm_32768x256x2: +0.75% addmm_64x384x1152: +0.79% addmm_64x640x1000: +0.01% addmm_100480x128x128: +0.87% addmm_1152x3072x768: +1.13% addmm_8192x256x2048: +1.40% addmm_4096x128x768: +0.01% addmm_128x2560x1000: +0.01% addmm_12544x2048x512: +0.43% addmm_200704x24x96: +0.14% addmm_8448x512x2048: +0.96% addmm_50176x256x1024: +0.62% addmm_4160x4096x1024: +0.22% addmm_4096x768x768: +0.32% addmm_220x2048x512: +0.56% addmm_8x2048x1000: +1.12% addmm_256x197951x512: +26.99% addmm_401536x64x192: +0.60% addmm_2040x2048x512: +0.47% addmm_512x1024x256: +1.32% addmm_128x4096x1000: +1.67% addmm_12672x768x768: +0.34% addmm_128x368x1000: +0.77% addmm_96x1280x1000: +0.01% addmm_12544x512x2048: +0.41% addmm_6272x320x1280: +0.76% addmm_12544x3072x768: +0.09% addmm_64x384x1000: +0.39% mm improvements when best: mm_200704x128x512: +1.29% mm_663552x16x16: +0.80% mm_4096x768x768: +0.51% mm_131072x64x31: +0.24% mm_12544x1152x384: +0.11% mm_128x2048x2: +0.46% mm_262144x16x23: +0.62% mm_50176x576x192: +0.37% mm_131072x16x31: +0.26% ================================================================================ BENCHMARK ANALYSIS RESULTS ================================================================================ Operation: addmm ---------------------------------------- Total shapes analyzed: 247 Average Subgraph placement: 3.38 Median Subgraph placement: 2.0 Subgraph is best choice: 52/247 shapes (21.1%) Average improvement when best: 1.15% Median improvement when best: 0.58% Largest improvement when best: +26.99% Operation: bmm ---------------------------------------- Total shapes analyzed: 85 Average Subgraph placement: 24.00 Median Subgraph placement: 21.0 Subgraph is best choice: 0/85 shapes (0.0%) Average improvement when best: N/A (never best) Median improvement when best: N/A (never best) Largest improvement when best: N/A (never best) Operation: mm ---------------------------------------- Total shapes analyzed: 88 Average Subgraph placement: 15.08 Median Subgraph placement: 4.0 Subgraph is best choice: 9/88 shapes (10.2%) Average improvement when best: 0.52% Median improvement when best: 0.46% Largest improvement when best: +1.29% ``` ## Results The largest shape gain, `256,197951,512`, seemed to be driven by a case where the extern kernel is way faster than the best triton configs on the recursive autotune: ``` addmm,Extern,extern_kernels.addmm,256,197951,512,0.38024500012397766 addmm,Triton,256,197951,512,32,256,16,2,2,4,2.005444049835205 addmm,Triton,256,197951,512,32,128,32,2,4,8,2.04189395904541 addmm,Triton,256,197951,512,64,128,16,2,4,8,2.1911399364471436 addmm,Triton,256,197951,512,64,128,32,2,4,8,2.496040105819702 addmm,Triton,256,197951,512,64,128,64,2,8,16,2.9306790828704834 addmm,Triton,256,197951,512,64,64,32,2,4,8,3.0347819328308105 ... ``` Compared to the non-transposed autotune: ``` addmm,Subgraph,contiguous_addmm_1384,256,197951,512,0.5024129748344421 addmm,Extern,extern_kernels.addmm,256,197951,512,0.6881489753723145 addmm,Triton,256,197951,512,32,256,16,2,2,4,2.5115010738372803 addmm,Triton,256,197951,512,32,128,32,2,4,8,2.5167479515075684 addmm,Triton,256,197951,512,64,128,16,2,4,8,2.9507460594177246 addmm,Triton,256,197951,512,64,256,64,2,8,4,2.9673290252685547 addmm,Triton,256,197951,512,64,128,64,2,8,16,3.3906331062316895 addmm,Triton,256,197951,512,64,128,32,2,4,8,3.496859073638916 ``` It seems to perform really well for high values of `K` vs `N` and `M`. Testing this hypothesis with some custom shapes: ``` Parsed 64 unique shapes from benchmark output addmm improvements when best: addmm_128x16384x128: +0.18% addmm_128x262144x256: +38.24% addmm_128x200000x512: +14.76% addmm_256x800000x128: +0.06% addmm_131072x128x256: +0.27% addmm_128x256x131072: +0.25% addmm_2048x200000x64: +12.45% mm improvements when best: mm_128x16384x128: +0.18% mm_128x262144x256: +38.05% mm_128x200000x512: +9.47% mm_256x800000x128: +0.99% mm_512x6400000x256: +3.17% mm_524288x64x64: +0.29% mm_2048x200000x64: +11.19% mm_8192x1000000x256: +34.14% mm_128x4096x100000: +0.40% mm_128x3072x150000: +0.27% ================================================================================ BENCHMARK ANALYSIS RESULTS ================================================================================ Operation: addmm ---------------------------------------- Total shapes analyzed: 33 Average Subgraph placement: 4.39 Median Subgraph placement: 2.0 Subgraph is best choice: 7/33 shapes (21.2%) Average improvement when best: 9.46% Median improvement when best: 0.27% Largest improvement when best: +38.24% Operation: mm ---------------------------------------- Total shapes analyzed: 30 Average Subgraph placement: 7.63 Median Subgraph placement: 2.0 Subgraph is best choice: 10/30 shapes (33.3%) Average improvement when best: 9.81% Median improvement when best: 2.08% Largest improvement when best: +38.05% ``` ## Conclusion Contiguous Subgraph Decompositionseems worthwhile for `mm` and `addmm`, but not `bmm`, and has a very large improvment on low `M`, low `N`, and high `K` shapes. Data gathering scripts: https://gist.github.com/exclamaforte/4a896c064d301b27bf5ca0a4f8fc3866 ## Test Plan: New unit tests. Differential Revision: D80771648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161241 Approved by: https://github.com/eellison	2025-09-03 17:02:59 +00:00
Guilherme Leobas	eb18d32bda	Add `range_iterator` (#161800 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161800 Approved by: https://github.com/anijain2305 ghstack dependencies: #161799	2025-09-03 16:55:04 +00:00
Guilherme Leobas	889f01eb73	Add CPython test `test_range` (#161799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161799 Approved by: https://github.com/anijain2305	2025-09-03 16:55:04 +00:00
Xu Han	451ed93156	[inductor] fix split_aot_inductor_output_path on Windows. (#162058 ) fix split_aot_inductor_output_path on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162058 Approved by: https://github.com/angelayi	2025-09-03 16:53:38 +00:00
nandesuka	9491d289b3	Support generic dynamic shape with padding (#160997 ) Summary: Inductor has the following configurations: config.comprehensive_padding config.padding_alignment_bytes config.padding_stride_threshold In the case of static shape by enabling these three options Inductor will generate code for Flexible layout tensors that tries to pad up all stride dimension to be a multiple of config.padding_alignment_bytes for strides above: config.padding_stride_threshold. In the case where dynamic shapes is enabled no padding is done today. This PR introduces the following configuration which allows the user to specify they wish to generated a padded stride even in the case of dynamic shape operations. This is mainly done so we don't break the previous behaviour of not padding up dynamic shape use cases. The config.padding_stride_threshold does not apply since the values of the strides are dynamic. config.pad_dynamic_shapes In addition to this a new mode "python_slow" has been added to launch grid calculation which achieves the same ceildiv behaviour that is generally applicable to integer division. This is done to prevent test regressions and make wrapper_fxir codegen more generic. Test Plan: CI Rollback Plan: Differential Revision: D80468808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160997 Approved by: https://github.com/blaine-rister, https://github.com/jansel	2025-09-03 15:58:18 +00:00
Liao, Wei	c157cf6488	port distributed tensor parallel test files for Intel GPU (#161261 ) In this pr, we port test/distributed/parallel 4 test files and test/distributed/debug 1 test file for Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. Use torch.accelerator for general gpu 2. Skip the case if running on xpu which has known issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/161261 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-03 15:03:32 +00:00
PyTorch MergeBot	bb950284c7	Revert "[inductor][ez] add hook for heuristics to adjust kernel input nodes (#161339 )" This reverts commit 90f50f7e68e120d9574e6e3189e37b4280010ad9. Reverted https://github.com/pytorch/pytorch/pull/161339 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, check D81486248 for more details ([comment](https://github.com/pytorch/pytorch/pull/161339#issuecomment-3249600885))	2025-09-03 14:56:02 +00:00
PyTorch MergeBot	f27985b7e7	Revert "[CUDAGraph] add config to error on skipping cudagraph (#161862 )" This reverts commit 204697f0e695d82894c5010fbec664c4391f90cc. Reverted https://github.com/pytorch/pytorch/pull/161862 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D81522732 for more details ([comment](https://github.com/pytorch/pytorch/pull/161862#issuecomment-3249582583))	2025-09-03 14:50:44 +00:00
PyTorch MergeBot	0cd6c56bdf	Revert "test: ensure editable cached wrapper is respected (#160943 )" This reverts commit bbedc71fd3267c639c38b4ec25eaa22f973d9c4d. Reverted https://github.com/pytorch/pytorch/pull/160943 on behalf of https://github.com/jeanschmidt due to See [D81486248](https://www.internalfb.com/diff/D81486248) for details on broken test ([comment](https://github.com/pytorch/pytorch/pull/160943#issuecomment-3249565671))	2025-09-03 14:46:35 +00:00
Nikita Shulga	b40d9432be	[BE] Cleanup stale comments/copy from `gemm` (#162001 ) Followup after https://github.com/pytorch/pytorch/pull/154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/162001 Approved by: https://github.com/drisspg ghstack dependencies: #161999	2025-09-03 14:31:09 +00:00
Nikita Shulga	02c83f1334	[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999 ) Followup after https://github.com/pytorch/pytorch/pull/154012 Fixes CPU part of https://github.com/pytorch/pytorch/issues/160841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161999 Approved by: https://github.com/drisspg	2025-09-03 14:31:08 +00:00
Nikhil Patel	aed33a8fcb	[Inductor][Tritonparse] Get Inductor kernel params (#161953 ) Summary: Save the config args that Inductor burns into `inductor_metadata` so we can optionally pass them to any Jit Hooks that are set. This allows us to pass them to Tritonparse. Reviewed By: davidberard98, FindHao Differential Revision: D80994791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161953 Approved by: https://github.com/FindHao	2025-09-03 14:11:27 +00:00
Huamin Li	b16d3f4c8c	[AOTI] Fix a bug from load_constants (#161887 ) Summary: we have ``` std::vector<size_t> constants_internal_offset( num_constants - num_folded_constants); ``` but the for loop does not consider it ``` for (size_t i = 0; i < num_constants; i++) { ... constants_internal_offset[i] ... ``` even in the for loop, it does ``` bool from_folded = this->constant_from_folded(i); if (from_folded) { continue; } ``` but `i` could still be wrong Rollback Plan: Differential Revision: D81425007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161887 Approved by: https://github.com/angelayi	2025-09-03 07:45:16 +00:00
Edward Z. Yang	4ae57d448c	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-03 07:33:55 +00:00
Edward Yang	90b08643c3	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-03 07:33:55 +00:00
Scott Wolchok	b0a3e58dd7	Add inline fast paths for SymInt operators (#161586 ) If SymInt::maybe_as_int() returns non-empty, then we get an inline fast path. The philosophy here (as with the previous PR) is to preserve performance in the "plain old ints" case. Observed time spent in SymInt functions in computeStorageNBytes to drop (and not cost shift elsewhere in the function) after this change, profiling detach() using code similar to the benchmark from #160580 and Linux perf. Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161586 Approved by: https://github.com/ezyang ghstack dependencies: #161466	2025-09-03 06:54:47 +00:00
Scott Wolchok	fa1514acf1	Outline SymInt::maybe_as_int_slow_path (#161466 ) Keeps SymInt::maybe_as_int small enough to inline. Differential Revision: [D81530097](https://our.internmc.facebook.com/intern/diff/D81530097) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161466 Approved by: https://github.com/ezyang	2025-09-03 06:54:47 +00:00
FFFrog	827f0d4054	Using get_paths() to get correct installation path for PYTHONPATY (#161947 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161947 Approved by: https://github.com/albanD ghstack dependencies: #161845, #161903	2025-09-03 06:38:03 +00:00
Isalia20	2c03f0acc5	[MPS] enable cat op for sparse (#162007 ) Enable cat op for sparse on MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/162007 Approved by: https://github.com/malfet	2025-09-03 06:31:35 +00:00
Scott Wolchok	f8ffa9194e	Perf nitpicks on python_arg_parser's is_int_or_symint_list (#161998 ) This function has come up in DTensor perf work, and I had a nitpick on #160256 so here it is. I have neither compiled nor measured this, but am reasonably confident it's better nonetheless. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161998 Approved by: https://github.com/ezyang	2025-09-03 05:38:30 +00:00
fengqing.lu	50fc22dedf	[Intel GPU] Fix XPU SDPA default priority_order UT fail (#161690 ) Fixes #161483 When the whole `test/test_transformers.py` file is run, the case `test_default_priority_order` can pass because other xpu cases would call SDPA so that the priority order is set by `eec876deb6/aten/src/ATen/native/mkldnn/xpu/Attention.cpp (L98-L112)` However, when the case `test_default_priority_order` is run separately, the priority order is unset so that this case would fail. This PR fix this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161690 Approved by: https://github.com/guangyey, https://github.com/drisspg	2025-09-03 04:43:27 +00:00
Tianyu Liu	e381d4b020	[DTensor] forbid view ops to redistribute when local split is impossible (#161950 ) This PR is a followup to https://github.com/pytorch/pytorch/pull/149764. In that PR, it only forbids illegal view due to `Flatten`; this PR also forbids illegal view caused by `Split`. This PR also updates the error message to be less about internal implementation details, which users may find confusing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161950 Approved by: https://github.com/ezyang	2025-09-03 04:40:11 +00:00
PyTorch UpdateBot	8875d6e394	[vllm hash update] update the pinned vllm hash (#161929 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161929 Approved by: https://github.com/pytorchbot	2025-09-03 04:26:38 +00:00
Wenyuan Chi	00636e0171	[Reland][Inductor] Prune configs that require more shared memory than the hardware limit. (#161996 ) Summary: This is a re-land of [PR161040](https://github.com/pytorch/pytorch/pull/161040), which had previously caused test failures on AMD GPUs. The tests are now configured to target only NVIDIA GPUs. This diff removes configurations that exceed the hardware shared memory limit, which causes the following compilation error: ``` No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help. ``` Test Plan: ``` pytest test/inductor/test_max_autotune.py pytest test/inductor/test_triton_heuristics.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161996 Approved by: https://github.com/coconutruben	2025-09-03 04:23:09 +00:00
PyTorch UpdateBot	09d2f1b631	[audio hash update] update the pinned audio hash (#161928 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161928 Approved by: https://github.com/pytorchbot	2025-09-03 04:22:55 +00:00
FFFrog	dac8a4b91c	Using pip3 install instead of python setup.py develop/install (#161903 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161903 Approved by: https://github.com/ezyang ghstack dependencies: #161845	2025-09-03 03:12:18 +00:00
FFFrog	d789451ff6	[OpenReg] Migrate Accelerator Document from source/notes into source/accelerator (#161845 ) As the tile stated. As the document grows, the content will become more and more, so in order to make it easier for users to read and easier for developers to maintain, we have split this file into several separate files and placed them in a dedicated directory called "accelerator". Pull Request resolved: https://github.com/pytorch/pytorch/pull/161845 Approved by: https://github.com/albanD	2025-09-03 03:12:18 +00:00
Eli Uriegas	0447f2d99b	build: Add fallback commands to setup.py (#162009 ) Adds fallback commands for the following: * python setup.py install * python setup.py develop Ideally these should just work and should provide backwards compat. Thought process here is that multiple people rely on these commands and just because setuptools wants to drop support for this I don't think a lot of our downstream users who build from source are expecting these to be gone. This should provide some room for developers to move away from these commands until we have a unified frontend for doing all of these commands that should abstract most of these away. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162009 Approved by: https://github.com/clee2000, https://github.com/atalman	2025-09-03 02:56:10 +00:00
William Wen	d5643e8f3a	[dynamo, nested graph breaks] support nested graph breaks that cause skipped frames (#160470 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160470 Approved by: https://github.com/anijain2305 ghstack dependencies: #159329, #159678, #159817, #160138, #159786	2025-09-03 02:47:07 +00:00
Ke Wen	9b81fe281d	[c10d] Lessen density of barrier warning (#162015 ) Warnings are great, but too dense when there are many ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162015 Approved by: https://github.com/d4l3k, https://github.com/H-Huang	2025-09-03 02:20:54 +00:00
Ruben Rodriguez Buchillon	90f50f7e68	[inductor][ez] add hook for heuristics to adjust kernel input nodes (#161339 ) # why - some templates e.g. scale_mm need to unsqueeze/squeeze the nodes for codegen and heuristics - unified place where we can just adjust them for the template # what - inside get_mm_configs, return not the passed in kernel inputs, but allow the template heuristic to adjust them if necessary - the default implementation right now just passes them back this diff just adds the functionality, but does not exercise it other than the default (passthrough) # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520572](https://our.internmc.facebook.com/intern/diff/D81520572) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161339 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #161123, #161124, #161125, #161126, #161336, #161338	2025-09-03 01:03:57 +00:00
Ruben Rodriguez Buchillon	877062c9d3	[inductor][choices][ez] pass through layout and input_nodes (#161338 ) # why - params already available in get_mm_configs - simplifies the code - adds a possibility to edit the nodes/layout in a centralized place # what - add layout and input_nodes into extra_kwargs - no other modifications # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520575](https://our.internmc.facebook.com/intern/diff/D81520575) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161338 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #161123, #161124, #161125, #161126, #161336	2025-09-03 01:03:57 +00:00
Ruben Rodriguez Buchillon	c31dee6fa5	[inductor][ez] ExternChoice with maybe_append_choice (#161336 ) # why - make the API for ExternChoice the same as KernelTemplate - make it possible to use the same retrieval point as templates # what - add a maybe_append_choice to ExternChoice that under the hood invokes self.bind This pr does not actuate the new path, but just exposes it # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py ``` Differential Revision: [D81520578](https://our.internmc.facebook.com/intern/diff/D81520578) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161336 Approved by: https://github.com/jansel ghstack dependencies: #161123, #161124, #161125, #161126	2025-09-03 01:03:57 +00:00
Ruben Rodriguez Buchillon	6cb13dd3cc	[inductor] move scaled_mm template args into heuristics (#161126 ) # why - another step towards get_mm_configs providing all the kwargs needed to add a choice from a template. This in turn will allow us to send all templates through one single call, and handle modifications # what - use the infrastructure for template heuristics to provide extra kwargs that are fixed for a template/op pair to provide the suffix args and epilogue function/fn for scaled_mm # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D80670914](https://our.internmc.facebook.com/intern/diff/D80670914) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161126 Approved by: https://github.com/jansel ghstack dependencies: #161123, #161124, #161125	2025-09-03 01:03:57 +00:00
Ruben Rodriguez Buchillon	cbf01c11ff	[inductor] move addmm/baddbmm template args into heuristics (#161125 ) # why - another step towards get_mm_configs providing all the kwargs needed to add a choice from a template. This in turn will allow us to send all templates through one single call, and handle modifications # what - use the infrastructure for template heuristics to provide extra kwargs that are fixed for a template/op pair to provide the prefix args and epilogue function/fn for addmm/baddbmm - expand kernelinputs to also be able to shuttle around non tensor inputs (scalars) as is needed for alpha and beta # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k addmm ``` Differential Revision: [D80670912](https://our.internmc.facebook.com/intern/diff/D80670912) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161125 Approved by: https://github.com/jansel ghstack dependencies: #161123, #161124	2025-09-03 01:03:57 +00:00
Ruben Rodriguez Buchillon	7cdfa520a6	[inductor] move tma workspace in heuristics (#161124 ) # why - another step towards get_mm_configs providing all the kwargs needed to add a choice from a template. This in turn will allow us to send all templates through one single call, and handle modifications # what use the infrastructure for template heuristics to provide extra kwargs that are fixed for a template/op pair to provide the workspace_arg for all the tma templates # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k tma ``` Differential Revision: [D80670915](https://our.internmc.facebook.com/intern/diff/D80670915) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161124 Approved by: https://github.com/jansel ghstack dependencies: #161123	2025-09-03 01:03:57 +00:00
Ruben Rodriguez Buchillon	1485ac3264	[inductor] add notion of extra_kwargs for mm_configs (#161123 ) # why - some kwargs are choice independent but rather always the same for a specific op or template - this enables us to track those differently than the choice ones, and thus enables interception of them cleaner - maybe_append_choices can then be simplified to just pass through the kwargs # what - hookup for template heuristics to have per template/op extra kwargs that are always the same, for all choices - hookup for the called to get_mm_configs to provide template/op kwargs to override some of the template/choice kwargs this pr does not use the new machinery, and everything is empty for now. subsequent prs start using it to simplify ops # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D80670916](https://our.internmc.facebook.com/intern/diff/D80670916) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161123 Approved by: https://github.com/jansel	2025-09-03 01:03:57 +00:00
Alex Malyshev	c5b8a10be5	Fix compiler errors in 3.14 stub definitions (#161792 ) The functions here expect to return pointers, but currently aren't returning anything. Make them return NULL. The properties array wants an extra set of braces. One pair for the array, another for the first item in the array. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161792 Approved by: https://github.com/Skylion007	2025-09-03 00:58:41 +00:00
Ke Wen	a02ee4a816	[SymmMem] Use non-blocking version of getmem (#162006 ) As titled, so that the `getmem` calls in the loop are non-blocking, so that we max out the issuance rate. Also had a single `nvshmem_quiet()` at the end to make sure all the getmem calls complete. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162006 Approved by: https://github.com/ngimel	2025-09-02 23:55:22 +00:00
xinan.lin	81b7b16618	Reland "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142 )" (#161949 ) This PR reland #161142 which is reverted to be able to revert other PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161949 Approved by: https://github.com/jansel	2025-09-02 23:43:27 +00:00
PyTorch MergeBot	4cdaf8265d	Revert "Update Kineto submodule (#161572 )" This reverts commit d33840c542b387ab08ba49aa6c45aa9567fd9be7. Reverted https://github.com/pytorch/pytorch/pull/161572 on behalf of https://github.com/seemethere due to This appears as though its causing downstream build failures in inductor workflows and for developers working locally. Going to revert out of an abundance of caution. ([comment](https://github.com/pytorch/pytorch/pull/161572#issuecomment-3247121981))	2025-09-02 23:28:19 +00:00
Kevin Fu	874069fbe4	Log Const Folded Node (#161827 ) Summary: Log folded nodes for easier debugging. Test Plan: sandcastle. Rollback Plan: Reviewed By: henryoier Differential Revision: D81352098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161827 Approved by: https://github.com/henryoier, https://github.com/yewentao256	2025-09-02 23:23:51 +00:00
Ke Wen	ab643e4dbb	[SymmMem] Increase minimum nthreads to cover sync needs in NVL72 (#161983 ) `sync_remote_blocks` maps threads to peers. Previously min nthreads is warp size, which is too small to cover NVL72. Bumping it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161983 Approved by: https://github.com/ngimel	2025-09-02 23:18:08 +00:00
Ke Wen	5a2da090ed	[SymmMem] Make sure CUDA runtime is initialized before NVSHMEM init (#161232 ) Previously, without calling `torch.empty` before NVSHMEM init, we see error below: ``` src/host/init/init.cu:nvshmemi_check_state_and_init:1117: nvshmem initialization failed, exiting src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed ``` Fixing it by calling a `cudaFree(nullptr)` to make sure CUDA runtime is initialized before NVSHMEM init. Removing all `torch.empty(1)` calls from tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161232 Approved by: https://github.com/ngimel ghstack dependencies: #161214	2025-09-02 22:53:28 +00:00
Justin Chu	bd39e47fee	[ONNX] Default to dynamo export (#159646 ) Set dynamo=True and enable fallback. 1. Implemented the compatible behavior where BytesIO objects as `f` is accepted 2. Update tests to explicitly set dynamo=False #151693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159646 Approved by: https://github.com/titaiwangms	2025-09-02 22:45:55 +00:00
zhxchen17	e4bd0ff4f8	[aot precompile] Handle closure variables. (#161990 ) We previously assume aot precompile should only work on non closures. This is hard to enforce in practice because we will see a lot of cases with decorater (e.g. hugging face models) ``` def check_inputs(fn): def _fn(self, args, kwargs): for arg in args: assert arg.shape[0] > 1 return fn(args, **kwargs) return _fn @check_inputs def foo(x, y): a = x + x b = y + y c = a + b return c ``` It doesn't make sense to not support these cases since they are straightfowrad to do. This PR adds the logic to handle closure and make sure they can be precompiled properly. Differential Revision: [D81509535](https://our.internmc.facebook.com/intern/diff/D81509535/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161990 Approved by: https://github.com/angelayi	2025-09-02 22:26:04 +00:00
PyTorch MergeBot	15c77a8cfd	Revert "Add inductor provenance mapping for cpp extern kernel (#161656 )" This reverts commit 5e5870e858f60ff4bf87d03f3592097e934a9580. Reverted https://github.com/pytorch/pytorch/pull/161656 on behalf of https://github.com/jeffdaily due to causing failures on ROCm MI300, will add label to PR ([comment](https://github.com/pytorch/pytorch/pull/161656#issuecomment-3246965676))	2025-09-02 22:19:19 +00:00
Kurt Mohler	791eff96c8	[MPS] Add `igamma/igammac` ops (#161927 ) Fixes #161725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161927 Approved by: https://github.com/malfet	2025-09-02 20:52:02 +00:00
Chris Leonard	80dd397f19	Argsort doc stable kwargs (#161986 ) Fixes #129311 Updated torch.argsort documentation to reflect that the 'stable' parameter is a keyword argument and not a normal parameter. @albanD, @soulitzer Pull Request resolved: https://github.com/pytorch/pytorch/pull/161986 Approved by: https://github.com/soulitzer	2025-09-02 20:42:53 +00:00
orangeH25	a75e8cd270	Add api info for torch._C._nn.pyi (#161958 ) Fix part of #148404 APis involved are as followed: - max_pool2d_with_indices - max_pool3d_with_indices - elu - glu - max_unpool2d - max_unpool3d Pull Request resolved: https://github.com/pytorch/pytorch/pull/161958 Approved by: https://github.com/ezyang	2025-09-02 20:39:20 +00:00
PyTorch MergeBot	4e42aa8ffc	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit b7034e9c924412bfbe8ee25a22d7e95239b5ca65. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3246689684))	2025-09-02 20:28:42 +00:00
PyTorch MergeBot	420c52ecf3	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit 626cb7df8161dd4ecb4fe43b60f37ce9076f56b1. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3246677982))	2025-09-02 20:24:01 +00:00
PyTorch MergeBot	82f63c8f6d	Revert "[HOTFIX] Disable DISTRIBUTED_C10D_DIRECT_ACCESS for now (#161946 )" This reverts commit 5561e45758d59c94605873d5db48ed459c004c3b. Reverted https://github.com/pytorch/pytorch/pull/161946 on behalf of https://github.com/jeanschmidt due to Need to be reverted so https://github.com/pytorch/pytorch/pull/159889 can be ([comment](https://github.com/pytorch/pytorch/pull/161946#issuecomment-3246663376))	2025-09-02 20:18:52 +00:00
Xu Han	b4ad38279b	[AOTI] Add Windows-compatible implementation of the mmap-related funcs (#161805 ) Add Windows-compatible implementation of the mmap-related functions. These code was validated on the small developing project: https://github.com/xuhancn/cross_os_mmap?tab=readme-ov-file#cross_os_mmap Pull Request resolved: https://github.com/pytorch/pytorch/pull/161805 Approved by: https://github.com/angelayi	2025-09-02 20:07:41 +00:00
Wei Wang	ef8aabd424	[CD][CUDA13][ARM] aarch64 binary seems to be missing Triton dependency (#161833 ) Requires: filelock, fsspec, jinja2, networkx, setuptools, sympy, typing-extensions Seems to be missing Triton. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161833 Approved by: https://github.com/tinglvv, https://github.com/Skylion007, https://github.com/atalman	2025-09-02 19:31:14 +00:00
Isalia20	dcf385395d	[MPS] Move sparsemps testing from test_mps to test_sparse (#161852 ) Moves Sparse MPS testing from test_mps to test_sparse. Lots of skips now but I expect to remove them iteratively once ops are implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/161852 Approved by: https://github.com/malfet	2025-09-02 19:04:11 +00:00
Animesh Jain	600c25e9a1	[dynamo] Graph break on torch.cuda.sychronize (#161925 ) Today, AOTDispatcher ignores cuda.synchornize. Even if we wrap it in some HOP, we need it to be a barrier op to prevent any inductor reordering. So graph breaking. Fixes https://github.com/pytorch/pytorch/issues/160751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161925 Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/mlazos	2025-09-02 19:00:21 +00:00
Ke Wen	f981a7fa52	[SymmMem] Add device guard before alloc (#161214 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161214 Approved by: https://github.com/ngimel	2025-09-02 18:53:45 +00:00
sibuachu	b7e207ca9f	Make error message descriptive (#150627 ) (#159423 ) Summary: Adding the number of locals shards to error messages makes it easier to debug. Test Plan: UT Differential Revision: D72396478 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159423 Approved by: https://github.com/Saiteja64	2025-09-02 17:54:39 +00:00
Shangdi Yu	5e5870e858	Add inductor provenance mapping for cpp extern kernel (#161656 ) Summary: Add inductor provenance mapping for cpp extern kernel Test Plan: ``` buck run fbcode//caffe2/test/inductor:provenance_tracing -- -r test_cpu_extern_kernel ``` Differential Revision: D81161751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161656 Approved by: https://github.com/angelayi	2025-09-02 17:54:04 +00:00
Yu, Guangye	a99d8d39bc	Update torch-xpu-ops commit pin (#161919 ) # Motivation 1. Fallback some linalg functionality such as `linalg_eig`, `linalg_householder_product`, `linalg_solve_triangular` to CPU; 2. Fix codegen dependency bug. # Additional Context This PR aims to fix https://github.com/pytorch/pytorch/issues/161498 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161919 Approved by: https://github.com/EikanWang	2025-09-02 17:09:07 +00:00
PyTorch MergeBot	d6b74568e2	Revert "Add __init__.pyi to torch/linalg (#160750 )" This reverts commit 9a665ca3c472384e9d722bddba79e5a7680f1abd. Reverted https://github.com/pytorch/pytorch/pull/160750 on behalf of https://github.com/jeanschmidt due to Seems that those errors are legitimate, and there is no test plan. I'll be proceeding with a revert ([comment](https://github.com/pytorch/pytorch/pull/160750#issuecomment-3246095383))	2025-09-02 16:53:55 +00:00
Shivam Raikundalia	d33840c542	Update Kineto submodule (#161572 ) Differential Revision: D81087601 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161572 Approved by: https://github.com/cyyever, https://github.com/aaronenyeshi	2025-09-02 16:31:55 +00:00
Justin Chu	f0c391102b	[ONNX] Remove private members from torch.onnx (#161546 ) Remove import of two functions - _run_symbolic_function - _run_symbolic_method to the `torch.onnx` namespace. Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161546 Approved by: https://github.com/titaiwangms ghstack dependencies: #161323, #161449	2025-09-02 16:31:23 +00:00
Jagadish Krishnamoorthy	a8d6943d36	ROCm: Enable overload tests from test_matmul_cuda (#161540 ) This patch enables hipblaslt backend tests for test_mm_bmm_dtype_overload and test_addmm_baddmm_dtype_overload. Tests were disabled as part of #150812 Rocblas backend tests are not enabled yet, WIP. Test command PYTORCH_TEST_WITH_ROCM=1 pytest test/test_matmul_cuda.py -k 'test_mm_bmm_dtype_overload' -v PYTORCH_TEST_WITH_ROCM=1 pytest test/test_matmul_cuda.py -k 'test_addmm_baddmm_dtype_overload' -v Pull Request resolved: https://github.com/pytorch/pytorch/pull/161540 Approved by: https://github.com/jeffdaily	2025-09-02 16:27:42 +00:00
Justin Chu	d11720efdb	[ONNX] Remove unused logic from internal verification module (#161449 ) Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161449 Approved by: https://github.com/xadupre, https://github.com/titaiwangms ghstack dependencies: #161323	2025-09-02 16:22:49 +00:00
Edward Yang	9a1c5c0a07	Detect torch function in lists as well (#160256 ) We basically follow the same pattern we do for tensor arguments. The major downside is we now have to traverse the entirety of the int list / etc where previously we didn't have. Benchmark suggests 2% regression for relevant things. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160256 Approved by: https://github.com/albanD	2025-09-02 16:22:42 +00:00
Justin Chu	524b78d4f6	[ONNX] Refactor torchscript based exporter (#161323 ) Refactor torchscript based exporter logic to move them to a single (private) location for better code management. Original public module and method apis are preserved. - Updated module paths in `torch/csrc/autograd/python_function.cpp` accordingly - Removed `check_onnx_broadcast` from `torch/autograd/_functions/utils.py` because it is private&unused @albanD / @soulitzer could you review changes in `torch/csrc/autograd/python_function.cpp` and `torch/autograd/_functions/utils.py`? Thanks! ## BC Breaking - Deprecated members in `torch.onnx.verification` are removed Differential Revision: [D81236421](https://our.internmc.facebook.com/intern/diff/D81236421) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161323 Approved by: https://github.com/titaiwangms, https://github.com/angelayi	2025-09-02 16:10:30 +00:00
Wang, Chuanqi	793fc12aff	[CD] Fix setup-xpu action issue (#161934 ) Fix XPU CD test failure, refer https://github.com/pytorch/pytorch/actions/runs/17370923627/job/49315624191 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161934 Approved by: https://github.com/atalman	2025-09-02 16:03:44 +00:00
Boyuan Feng	204697f0e6	[CUDAGraph] add config to error on skipping cudagraph (#161862 ) Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error. This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161862 Approved by: https://github.com/ezyang	2025-09-02 15:28:22 +00:00
Guilherme Leobas	789d494212	Defer loading hipify until it is needed (#160824 ) Saves a few milliseconds when running a test case: Before: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 1.497s ``` After: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 0.909s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160824 Approved by: https://github.com/zou3519	2025-09-02 15:27:37 +00:00
DrStone71	bc4db2c27f	CUDA 13 -- sm_120 -- Nvidia 5090 -- ptxas warning : Value of threads … (#161380 ) bug fix: i have opened a issue ( https://github.com/pytorch/pytorch/issues/161376 ) and i suggest this bug fix. In this metod compile fine. Fixes #161376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161380 Approved by: https://github.com/eqy, https://github.com/malfet Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>	2025-09-02 13:27:57 +00:00
PyTorch MergeBot	e304ea4e69	Revert "[BE] Update xpu driver repo for CD used almalinux 8.10 (#157356 )" This reverts commit c78bbdf4102d2c13bf6aa1abe4352aa7bca401ca. Reverted https://github.com/pytorch/pytorch/pull/157356 on behalf of https://github.com/chuanqi129 due to This PR has performance regression on some workloads ([comment](https://github.com/pytorch/pytorch/pull/157356#issuecomment-3245319046))	2025-09-02 13:20:38 +00:00
Jean Schmidt	1f820de639	[ci] Increase shards for linux-jammy-py3.10-clang18-asan on pull.yml to 7 (#161968 ) [ci] Increase shards for linux-jammy-py3.10-clang18-asan to 7	2025-09-02 14:08:47 +02:00
Rohit Singh Rathaur	fca2601c9d	Improve error message for unsupported padding config (#160866 ) Fixes #160053 The previous error message `Only 2D, 3D, 4D, 5D padding with non-constant padding are supported for now` was not clear now we have ``` python3 Python 3.13.5 \| packaged by conda-forge \| (main, Jun 16 2025, 08:27:50) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch ... import torch.nn.functional as F ... a = torch.empty(2,2,2,2) ... F.pad(a, (1,1), mode="circular") ... Traceback (most recent call last): File "<python-input-0>", line 4, in <module> F.pad(a, (1,1), mode="circular") ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/rrathaur/Desktop/pytorch/torch/nn/functional.py", line 5294, in pad return torch._C._nn.pad(input, pad, mode, value) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ NotImplementedError: Padding size 2 is not supported for 4D input tensor. Supported combinations for non-constant padding: - 2D or 3D input: padding size = 2 (pads last dimension) - 3D or 4D input: padding size = 4 (pads last 2 dimensions) - 4D or 5D input: padding size = 6 (pads last 3 dimensions) >>> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160866 Approved by: https://github.com/mikaylagawarecki	2025-09-02 07:15:59 +00:00
Yu, Guangye	f8746b878d	Add uuid to XPU device properties (#161392 ) # Motivation Fix https://github.com/intel/torch-xpu-ops/issues/1955 Refer to https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md#device-uuid, `ext::intel::info::device::uuid` returns `std::array<unsigned char, 16>` as the UUID. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161392 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-09-02 06:41:32 +00:00
Tianyu Liu	8703debf66	[DTensor] select strategy with no redistribute when redistribute cost is 0 (#161882 ) Before this PR, the `_select_strategy` always selects the first strategy with minimum redistribute cost. This causes unexpected behavior when - multiple strategies have 0 redistribute costs - the first one with 0 redistribute cost may perform local chunking E.g. in memory efficient SDPA, the default orders of candidate strategies have a `Shard(2)` one before the `Replicate()` one. https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_matrix_ops.py#L500-L512 When the input is `Replicate()`, `_select_strategy` will pick the `Shard(2)` strategy and do local chunking first, before local computation. This is clearly unexpected to users. In this PR, we improve `_select_strategy` so that when multiple strategies have 0 redistribute cost, we prioritize the one which keeps input unchanged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161882 Approved by: https://github.com/ezyang	2025-09-02 05:41:56 +00:00
bobrenjc93	1aeb421c34	Make pattern matcher resilient to ddes (#161843 ) Motivated by the following discord support chat: https://discord.com/channels/1189498204333543425/1409578286186758195 ``` import torch @torch.compile(fullgraph=True, mode='reduce-overhead') def get_mask(W: torch.Tensor, percentage_nonzeros: torch.Tensor): total_elements = W.numel() k = int(total_elements * percentage_nonzeros) top_k_indices = torch.topk(torch.abs(W).flatten(), k)[1] mask = torch.zeros(total_elements, dtype=torch.bool, device=W.device) mask.scatter_(0, top_k_indices, True) mask = mask.view(W.shape) return mask x = torch.randn((128, 64), device='cuda') p = torch.tensor(0.50, device='cuda') get_mask(x, p) ``` Results in ``` InductorError: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(TruncToInt(zuf0), 1) (unhinted: Eq(TruncToInt(zuf0), 1)). (Size-like symbols: none) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161843 Approved by: https://github.com/ezyang	2025-09-02 05:16:13 +00:00
Edward Yang	5561e45758	[HOTFIX] Disable DISTRIBUTED_C10D_DIRECT_ACCESS for now (#161946 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161946 Approved by: https://github.com/msaroufim	2025-09-02 05:01:46 +00:00
soulitzer	8171d6052e	Clear custom autograd Function ctx.to_save earlier (#161171 ) Fixes https://github.com/pytorch/pytorch/issues/161186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161171 Approved by: https://github.com/albanD	2025-09-02 03:26:31 +00:00
Dev Sashidhar	d5e0f4202b	Fixes broken memory_viz link in CUDA memory docs (#161426 ) Fixes #161375 The "Using the visualizer" section in torch_cuda_memory.md had a link to https://pytorch.org/memory_viz written in inline Markdown link form. Strangely the same syntax worked earlier on the page as the issuer mentioned, but in this spot it's rendered sa a broken link. I wasn't able to pinpoint why the second occurrence was treated differently, but switching it to the Markdown autolink form fixes the problem consistently. I tested this by rebuilding the docs locally with make html and serving the HTML with a local http.server. With the autolink, the link resolves correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161426 Approved by: https://github.com/soulitzer	2025-09-02 02:06:54 +00:00
Xuehai Pan	13d66e2a66	[BE][Easy] restore #157584 after #158288 (#158541 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158541 Approved by: https://github.com/ezyang	2025-09-02 02:06:50 +00:00
Edward Yang	bbedc71fd3	test: ensure editable cached wrapper is respected (#160943 ) ## Summary - add a test verifying that editing the local cache wrapper is picked up after Dynamo reset ## Testing - `lintrunner -a` (fails: FLAKE8 failure, TEST_HAS_MAIN failure, CODESPELL failure, PYFMT failure) - `PYTHONPATH=. python test/inductor/test_codecache.py TestPyCodeCache.test_editable_cached_wrapper -v` ------ https://chatgpt.com/codex/tasks/task_e_68a3aa3fcc9883239b17d1f4250d1e89 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160943 Approved by: https://github.com/xmfan	2025-09-02 01:48:30 +00:00
Animesh Jain	e9481b6617	[dynamo] Prevent unnecessary recompile on disabled functions in the compiled frame (#161883 ) Trying out a re-impl of https://github.com/pytorch/pytorch/pull/160934 The above PR led to OOM, most likely because of the cache holding to a nested function (which if not held in the cache would have been garbage collected), which holds on to cuda tensors in its closure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161883 Approved by: https://github.com/jansel	2025-09-02 01:13:48 +00:00
gaoyufeng	1c1b28d5b6	Fix slice scatter dtype consistency (#160851 ) Fixes #147842 Fix torch.slice_scatter type inconsistency issue. I noticed previous PRs on this have stalled, so I'm opening this new PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160851 Approved by: https://github.com/soulitzer	2025-09-02 01:08:26 +00:00
Xu Han	2a5c0785e2	[AOTI] split too long string to smaller pieces when its length larger than 16000, fix msvc c2026. (#161850 ) Split too long string to smaller pieces when its length larger than 16000, fix msvc c2026. reproducer: ```cmd pytest test\inductor\test_aot_inductor.py -v -k test_runtime_checks_large_cpu ``` Error message: <img width="1660" height="174" alt="image" src="https://github.com/user-attachments/assets/56fcd9be-24cb-484b-bfdc-f719ff2650b8" /> For MSVC c2026: https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2026?view=msvc-170 We can split too long string to smaller pieces, it can fix this issue. Local validated: <img width="1122" height="232" alt="image" src="https://github.com/user-attachments/assets/cac54cc9-be51-4a5d-b408-06755a4debd5" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161850 Approved by: https://github.com/jansel	2025-09-02 00:09:01 +00:00
Edward Z. Yang	626cb7df81	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-01 23:00:21 +00:00
Edward Yang	b7034e9c92	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-01 23:00:21 +00:00
PyTorch MergeBot	13b65196db	Revert "Defer loading hipify until it is needed (#160824 )" This reverts commit 403a3a393cda7e60f503f3b04b8805a845dcf45d. Reverted https://github.com/pytorch/pytorch/pull/160824 on behalf of https://github.com/atalman due to Broke slow tests test_utils.py::TestHipifyTrie::test_special_char_export_trie_to_regex [GH job link](https://github.com/pytorch/pytorch/actions/runs/17387051351/job/49355619371) [HUD commit link](`403a3a393c`) ([comment](https://github.com/pytorch/pytorch/pull/160824#issuecomment-3243281628))	2025-09-01 21:34:13 +00:00
Guilherme Leobas	403a3a393c	Defer loading hipify until it is needed (#160824 ) Saves a few milliseconds when running a test case: Before: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 1.497s ``` After: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 0.909s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160824 Approved by: https://github.com/zou3519	2025-09-01 20:57:41 +00:00
Ivan Komarov	cbfb005f7c	Fix type checking for persistent loads in the weights-only unpickler (#161661 ) The error message here implies that we can only call `self.persistent_load(...)` for ints or tuples, but due to the second part of the type check being inverted, weights-only unpickler will throw an exception iff `pid` is an int. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161661 Approved by: https://github.com/Skylion007	2025-09-01 19:57:19 +00:00
Huy Do	d232a95d4a	[BE] Consolidate inductor benchmark Docker images and rename jobs (#161536 ) We have 4 different version of inductor benchmark Docker images used in CI at the moment: 1. `pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks` is used by almost all inductor jobs including nightly benchmark 2. `pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks` runs inductor unit tests with python 3.12 3. `pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks` runs inductor unit tests with python 3.13 4. `pytorch-linux-jammy-py3-gcc11-inductor-benchmarks` runs inductor unit tests on CPU My proposal here is to clean up (2) and (3) and to keep (1) under the same setup from https://ghcr.io/pytorch/torchbench. Simplicity is the key here as inductor workflows are getting more and more complex: 1. Unit tests for Python variant like 3.12 and 3.13 were useful when they were first added to CI. They are much less useful now. [Flambeau](https://hud.pytorch.org/flambeau/s/3876ec7b-43f0-42c6-bfbf-899035e5bb77) shows a 0.97 correlation between them. And we are also moving to 3.14 nowadays. I want to choose 3.12 for (1), but will do this separately. This is also what TorchBench and vLLM are using on CI. 1. We are gradually cleaning up 3.9 on CI https://github.com/pytorch/pytorch/issues/161167 Another BE change here is to rename the jobs various inductor workflows because I think names like `linux-jammy-cuda12_8-py3_10-gcc9-inductor-build` is too long and confusing to look at, better just use human-friendly names like `inductor-build`. Other information is already spelled out in the build environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161536 Approved by: https://github.com/zou3519	2025-09-01 19:07:08 +00:00
PyTorch MergeBot	17fa8eec4a	Revert "Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387 )" This reverts commit 4b4cdcfe3af10df624878985caac4e595fbab54c. Reverted https://github.com/pytorch/pytorch/pull/159387 on behalf of https://github.com/atalman due to need to revert due to merge conflicts, please feel free to merge it back in once conflicts are resolved ([comment](https://github.com/pytorch/pytorch/pull/159387#issuecomment-3242945661))	2025-09-01 17:08:27 +00:00
PyTorch MergeBot	54e275e0d8	Revert "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142 )" This reverts commit c83cbd2f2a2de2e3258f07de77d8740743df6d2d. Reverted https://github.com/pytorch/pytorch/pull/161142 on behalf of https://github.com/jeanschmidt due to This PR needs to be reverted to be able to revert another PR, this is due to merge conflicts, I am sorry for this. Please feel free to rebase and merge at your earliest convenience ([comment](https://github.com/pytorch/pytorch/pull/161142#issuecomment-3242937640))	2025-09-01 17:03:50 +00:00
PyTorch MergeBot	63a9c23fe9	Revert "[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 )" This reverts commit 190c391a28845a14df26abb228d26aa813efb20c. Reverted https://github.com/pytorch/pytorch/pull/158352 on behalf of https://github.com/atalman due to Broke cuda 13.0 nightly builds https://github.com/pytorch/pytorch/actions/runs/17382188549/job/49341981474 ([comment](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3242871629))	2025-09-01 16:27:03 +00:00
Ting Lu	fefee08164	[CD] Add CUDA 13.0 Windows build (#161663 ) Test CUDA 13.0 windows build Pull Request resolved: https://github.com/pytorch/pytorch/pull/161663 Approved by: https://github.com/malfet, https://github.com/atalman	2025-09-01 15:27:17 +00:00
PyTorch MergeBot	21fae99c18	Revert "[cuBLASLt][FP8] `cuBLASLt` appears to support float8 rowwise-scaling on H100 (#161305 )" This reverts commit 55c289d5c104c4959cc125c0fb4fb50c9fc71102. Reverted https://github.com/pytorch/pytorch/pull/161305 on behalf of https://github.com/atalman due to Broke test_matmul_cuda.py::TestFP8MatmulCUDA::test_float8_error_messages_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17309011599/job/49140215634) [HUD commit link](`1190b7f73e`) ([comment](https://github.com/pytorch/pytorch/pull/161305#issuecomment-3242652672))	2025-09-01 14:56:47 +00:00
PyTorch UpdateBot	2ba65472dd	[xla hash update] update the pinned xla hash (#161396 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161396 Approved by: https://github.com/pytorchbot	2025-09-01 11:43:03 +00:00
Frank Lin	190c391a28	[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 ) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it capturing graph) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True\|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * Free marker: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * Terminal: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered reusable on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352 Approved by: https://github.com/ngimel Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-01 09:25:01 +00:00
Raman-RH	20bfb2539d	Skip compilation when FX graph has no calls and returns empty (#160536 ) Fixes #160437 Summary: This PR avoids compiling empty FX graphs generated during graph breaks. If there are no calls in the graph, we can just return the empty list of instructions. More precisely, In compile_and_call_fx_graph, if the FX graph contains no calls (count_calls(self.graph) == 0) and the return value list is empty, we now return an empty instruction list immediately Impact: module: dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/160536 Approved by: https://github.com/Lucaskabela	2025-09-01 08:32:22 +00:00
Eli Uriegas	dd2519abe8	ci: Update sphinx, disable google search by default (#161793 ) Includes fixes from https://github.com/pytorch/pytorch_sphinx_theme/pull/207 Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161793 Approved by: https://github.com/malfet, https://github.com/albanD	2025-09-01 07:43:39 +00:00
Ke Wen	2f6b4b1ad3	[4/N][SymmMem] Add `get_remote_tensor` + move up `get_buffer` and `get_signal_pad` (#161533 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): `get_remote_tensor `: return a symmetric tensor given a peer rank. The difference between `get_buffer` API and `get_remote_tensor` API: - the former accepts an offset, whereas the latter doesn't - the latter returns a symmetric tensor at `hdl.offset` on `peer`. As a refactorization, this PR also moves the implementation of `get_buffer` and `get_signal_pad` to the `SymmetricMemory` level as their code is common to all backends. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161533 Approved by: https://github.com/ngimel ghstack dependencies: #161470, #161471, #161532	2025-09-01 07:02:06 +00:00
Zheng, Zhaoqiong	6737e2c996	update supported OS for Intel client GPU (#161699 ) update supported OS for Intel client GPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/161699 Approved by: https://github.com/chuanqi129, https://github.com/malfet	2025-09-01 05:45:09 +00:00
PyTorch UpdateBot	67c31dcd36	[vllm hash update] update the pinned vllm hash (#161867 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161867 Approved by: https://github.com/pytorchbot	2025-09-01 04:37:13 +00:00
Yu, Guangye	cb1e31362c	Remove background thread UT on XPU to fix CI (#161844 ) # Motivation Because we revert `torch._C._set_allocator_settings` in https://github.com/pytorch/pytorch/pull/161626, this UT becomes invalid. Fix https://github.com/pytorch/pytorch/issues/161697 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161844 Approved by: https://github.com/gujinghui	2025-09-01 03:45:26 +00:00
Sean McGovern	9a665ca3c4	Add __init__.pyi to torch/linalg (#160750 ) Fixes #149639 In an effort to improve the type checking coverage, added a stub file for the torch/linalg directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160750 Approved by: https://github.com/Skylion007	2025-08-31 22:39:05 +00:00
Edward Yang	d9d6dde0f4	Leak Python filenames so that we can give good dispatcher errors. (#160418 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160418 Approved by: https://github.com/zou3519	2025-08-31 22:31:39 +00:00
Scott Wolchok	68738beff7	PythonArgs::toBool: order cheap mutually exclusive checks first (#161455 ) symbools are not identical with Py_True or PyFalse, so we can do those cheap checks first and at least get plain old bools to go fast. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161455 Approved by: https://github.com/Skylion007 ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328, #161329, #161432	2025-08-31 21:35:48 +00:00
Ke Wen	25f4aaed9e	[3/N][SymmMem] Expose offset field from handle (#161532 ) As titled, so that kernels relying on direct pointers can use base address and `hdl.offset` to access remote memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161532 Approved by: https://github.com/ngimel ghstack dependencies: #161470, #161471	2025-08-31 18:08:57 +00:00
Ke Wen	61e18b5304	[2/N][SymmMem] Add MemPool allocator and tests (#161471 ) (Porting most of #161008) Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory. To end users, this PR supports a python UI as follows: ``` allocator = symm_mem.get_mempool_allocator(device) mempool = torch.cuda.MemPool(allocator) with torch.cuda.use_mem_pool(mempool): tensor = torch.arange(numel, dtype=dtype, device=device) ``` Added tests for both use cases above. Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471 Approved by: https://github.com/ngimel ghstack dependencies: #161470	2025-08-31 18:08:57 +00:00
Rohit Manav	e92cd94153	removed duplicate imports (#161685 ) Fixes #161684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161685 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2025-08-31 16:21:49 +00:00
Raman Kumar	0d421ace32	fix spelling of word - when (#160185 ) just found a typo while understanding the codebase while working on another PR This fixes typo in word `when` in files ``` native/cpu/PaddingKernel.cpp native/cpu/batch_norm_kernel.cpp ``` @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/160185 Approved by: https://github.com/yewentao256, https://github.com/ezyang	2025-08-31 13:38:23 +00:00
Tan Hoang	91f0bcf43f	[c10d][nvshmem] add nvshmem build rules and dependency for libtorch_cuda (#159562 ) Summary: Add guarded build option for nvshmem-related c10d code with `-c fbcode.caffe2_use_nvshmem` Guarded clause include nvshmem device + host code (static-linked) + these 2 files: - `torch/csrc/distributed/c10d/symm_mem/NVSHMEMSymmetricMemory.cu` - `torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159562 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2025-08-31 12:56:51 +00:00
Xia, Weiwen	75bc23cfc3	[CPU][Inductor] Improve performance of A16W8 GEMM template (#161148 ) Summary This PR improves the performance of A16W8 GEMM template by - Removing the config with block_n=48 & block_m=16 as it is not very efficient. - Using AMX microkernel when M >= 5 so that we use AMX instead of AVX512 for M=5~31. - Converting int8 values to bf16 with intrinsics instead of `at::vec::convert` as the latter does not have optimized implementation for this case. We saw up to >10% performance gain in various cases of running Llama-3.1-8b-instruct. Test plan Already covered by UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161148 Approved by: https://github.com/CaoE, https://github.com/jansel	2025-08-31 09:56:29 +00:00
Natalia Gimelshein	377033757a	Use vectorized stores for all dtypes in cat (#161649 ) resurrecting #151818 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161649 Approved by: https://github.com/Skylion007	2025-08-31 05:42:41 +00:00
PyTorch UpdateBot	f612045ce1	[vllm hash update] update the pinned vllm hash (#161835 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161835 Approved by: https://github.com/pytorchbot	2025-08-31 04:24:04 +00:00
Xu Han	ad7b748686	[AOTI] fix ut, add extension file type for Windows. (#161851 ) fix ut, add extension file type for Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161851 Approved by: https://github.com/ezyang	2025-08-31 01:13:29 +00:00
Isalia20	f3697b033e	[MPS] add bunch of unary funcs for sparse tensors (#161846 ) adds bunch of unary functions for sparse tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/161846 Approved by: https://github.com/malfet	2025-08-30 21:13:05 +00:00
Lakshay Garg	2d31c3d99d	Pass shared_ptr by value (#161834 ) The way AsyncAllreduceCUDADeviceWork is currently implemented, using it will force a copy of `shared_ptr<gloo::Context>` because `std::move` does nothing for a const ref. This PR changes the param type to shared_ptr<> instead of the const ref. This allows more efficient parameter passing. Here's an example that demonstrates the issue: ```cpp #include <memory> #include <iostream> struct Foo {}; void useFoo_ref(const std::shared_ptr<Foo>& f) { std::shared_ptr<Foo> internal = std::move(f); std::cout << "use_count: " << internal.use_count() << '\n'; } void useFoo_val(std::shared_ptr<Foo> f) { std::shared_ptr<Foo> internal = std::move(f); std::cout << "use_count: " << internal.use_count() << '\n'; } int main() { std::shared_ptr<Foo> f1 = std::make_shared<Foo>(); useFoo_ref(std::move(f1)); // prints "use_count: 2" std::shared_ptr<Foo> f2 = std::make_shared<Foo>(); useFoo_val(std::move(f2)); // prints "use_count: 1" } ``` This also aligns well with [C++ Core Guidelines][1] for handling smart pointers. [1]: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines?utm_source=chatgpt.com#Rr-summary-smartptrs Pull Request resolved: https://github.com/pytorch/pytorch/pull/161834 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/kwen2501	2025-08-30 18:00:37 +00:00
PyTorch MergeBot	fb2d5ea697	Revert "[2/N][SymmMem] Add MemPool allocator and tests (#161471 )" This reverts commit b291dc9684d00396239a0c7786b7aac71bf69c05. Reverted https://github.com/pytorch/pytorch/pull/161471 on behalf of https://github.com/atalman due to Multiple internal failures on PR #https://github.com/pytorch/pytorch/pull/161471 will need to land it via co-dev ([comment](https://github.com/pytorch/pytorch/pull/161471#issuecomment-3239283585))	2025-08-30 14:00:29 +00:00
PyTorch MergeBot	2e1345a0f8	Revert "[3/N][SymmMem] Expose offset field from handle (#161532 )" This reverts commit ff9533970ad76ed1905b90df6515aca50354c193. Reverted https://github.com/pytorch/pytorch/pull/161532 on behalf of https://github.com/atalman due to Multiple internal failures on PR #https://github.com/pytorch/pytorch/pull/161471 will need to land it via co-dev ([comment](https://github.com/pytorch/pytorch/pull/161532#issuecomment-3239282308))	2025-08-30 13:57:50 +00:00
PyTorch MergeBot	684ae48c16	Revert "[4/N][SymmMem] Add `get_remote_tensor` + move up `get_buffer` and `get_signal_pad` (#161533 )" This reverts commit 95516ad7e6d92ed131fb6057b29ec52e73190e3c. Reverted https://github.com/pytorch/pytorch/pull/161533 on behalf of https://github.com/atalman due to Multiple internal failures on PR #[161471](https://github.com/pytorch/pytorch/pull/161471) will need to land it via co-dev ([comment](https://github.com/pytorch/pytorch/pull/161533#issuecomment-3239278635))	2025-08-30 13:51:22 +00:00
FFFrog	b93f87d67b	[OpenReg] Integrate Event&Stream from OpenReg Backend into PyTorch (#160100 ) We integrated the openreg backend’s `Stream` and `Event` into PyTorch, all of which are similar to other accelerators like `CUDA`, `XPUs`, etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160100 Approved by: https://github.com/albanD ghstack dependencies: #161603, #160099, #161773	2025-08-30 13:21:28 +00:00
FFFrog	6284881b2a	[OpenReg] Add tests of device and memory for OpenReg (#161773 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161773 Approved by: https://github.com/albanD ghstack dependencies: #161603, #160099	2025-08-30 13:21:28 +00:00
FFFrog	aae9cbb6c0	[OpenReg] Add Event&Stream Support for OpenReg Backend (#160099 ) Referring to the signatures and functions of `Stream` and `Event` in CUDA, we use CPU multithreading and conditional variables to implement equivalent capabilities as the underlying foundation of torch_openreg. Changes: - Add stream capabilities for OpenReg - Add event capabilities for OpenReg - Add kernel launch entrypoint for OpenReg - Add testcases about stream and event for OpenReg - Add example for OpenReg Pull Request resolved: https://github.com/pytorch/pytorch/pull/160099 Approved by: https://github.com/albanD ghstack dependencies: #161603	2025-08-30 13:21:21 +00:00
FFFrog	dad2e50ac5	[OpenReg] Rename cpu_fallback_blacklist to cpu_fallback_blocklist (#161603 ) As the title stated. Related Infos: https://github.com/pytorch/pytorch/pull/158644#discussion_r2301460839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161603 Approved by: https://github.com/albanD	2025-08-30 13:21:15 +00:00
Aleksandar Samardžić	37da7b777b	Fix _scaled_grouped_mm not reported as unsupported on SM100. (#161780 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161780 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel, https://github.com/Skylion007, https://github.com/eqy	2025-08-30 12:33:51 +00:00
xinan.lin	c83cbd2f2a	[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142 ) Fixes #161384, Fixes #161162, Fixes #160946, Fixes #160947, Fixes #160948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161142 Approved by: https://github.com/jansel	2025-08-30 11:09:07 +00:00
Mwiza Kunda	b994f6e3b3	[inductor] check block options after broadcasting and singleton dims have been removed (#161602 ) This will allow for some more cases to use tensor descriptors e.g. before the following block params would not match because the innermost dimension does not have stride 1 ```python block_params=BlockParameters(shape=[64, 4, 1, 1], block_shape=[((XBLOCK + 3)//4), Min(4, XBLOCK), 1, 1], strides=[0, 1, 0, 0], offsets=[(xoffset//4), ModularIndexing(xoffset, 1, 4), 0, 0]) ``` After broadcasting dimensions and singleton dimensions are removed: ```python block_params=BlockParameters(shape=[4], block_shape=[Min(4, XBLOCK)], strides=[1], offsets=[ModularIndexing(xoffset, 1, 4)]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161602 Approved by: https://github.com/jansel	2025-08-30 08:10:51 +00:00
yucai-intel	f44ad54bc6	Update torch-xpu-ops commit pin (#161152 ) Update the torch-xpu-ops commit to [8b58040ee32689487f660462f655085f31506dab](`8b58040ee3`), includes: - Add vectorization path on maxpool forward channel last - Add FlightRecorder support for ProcessGroupXCCL - Fix random build failure on codegen - Suppress dllexport warning on Windows - Make torch-xpu-ops build depend on ATen XPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/161152 Approved by: https://github.com/EikanWang Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-08-30 07:19:24 +00:00
Scott Wolchok	4d3ab2669b	Stop trying to intern arguments in PyObject_FastGetAttrString (#161432 ) If we want them interned, we should intern at callsites. (The numpy reference has bit rotted; see `b222eb66c7 (diff-6bdb6105198083838f51c57b55b3a49472ed23043bb40018f1ea41138e687163)`) Profiling a simple torchdispatch benchmark with perf before/after seems to show that time spent copying std::strings and interning Python strings is gone, though there is some noise and the improvement is very small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161432 Approved by: https://github.com/ezyang ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328, #161329	2025-08-30 06:55:43 +00:00
Scott Wolchok	0ee8a4e281	Fix accidental copy in pushPyOutToStack (#161329 ) `auto` forces a copy. Confirmed this did something noticable with perf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161329 Approved by: https://github.com/zpcore, https://github.com/fduwjj, https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328	2025-08-30 06:55:43 +00:00
Scott Wolchok	eb9526ae35	Avoid double hash lookup in torch._library.simple_registry (#161328 ) Not a huge cost, but free win is free. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161328 Approved by: https://github.com/Skylion007 ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317	2025-08-30 06:55:43 +00:00
Scott Wolchok	302d860157	Improve assert perf in _python_dispatch._correct_storage_aliasing (#161317 ) This assertion was expensive because of is_traceable_wrapper_subclass. Finding a cheap check to run first that's likely to let us skip the rest seems to improve things significantly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161317 Approved by: https://github.com/ezyang, https://github.com/XilunWu, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304, #161308, #161315	2025-08-30 06:55:42 +00:00
Scott Wolchok	0c459f2921	Fix pybind enum efficiency issue in return_and_correct_aliasing (#161315 ) Scanning a list of pybind enums with `in` is slow. See NOTE in code for full explanation. This is a significant optimization; will be updating the torchdispatch/return_and_correct_aliasing portion of this stack with benchmark and results soonish. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161315 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304, #161308	2025-08-30 06:55:42 +00:00
Scott Wolchok	b96bcb9fdb	Optimize _python_dispatch.return_and_correct_aliasing.get_write_alias (#161308 ) - Empty containers are Falsey - Hoist cheap checks first - Microbenchmarked single-element set access method Benchmark code: ``` import timeit to_test = [ ('list(x)', 'x = set([3])'), ('x[0]', 'x = [3]'), ('list(x)[0]', 'x = set([3])'), ('next(iter(x))', 'x = set([3])'), ] for (stmt, setup) in to_test: res = timeit.timeit(stmt=stmt, setup=setup) print(f"Time for `{stmt}`: {res}") ``` Result with Python 3.13 on Mac (with excess digits manually trimmed; directionally matches result on Linux) ``` Time for `list(x)`: 0.03418 Time for `x[0]`: 0.00852 Time for `list(x)[0]`: 0.03561 Time for `next(iter(x))`: 0.02278 ``` FWIW, I was surprised by this result, so I guess I'm glad I wrote the benchmark! Pull Request resolved: https://github.com/pytorch/pytorch/pull/161308 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304	2025-08-30 06:55:42 +00:00
Scott Wolchok	2089ed3d5e	Use `is`, not ==, to check exact type matches in _python_dispatch (#161304 ) `is` checks object identity and is more efficient. Google seems to confirm it is the correct way to do an exact type check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161304 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/bdhirsh ghstack dependencies: #161301, #161292	2025-08-30 06:55:42 +00:00
Scott Wolchok	1a64bf2636	Stop accessing func._schema in _python_dispatch.correct_storage_aliasing (#161292 ) func._schema is a pybind, accessing the arguments/returns is expensive, we have no reason to do it anyway, and even though #161301 makes accessing the arguments/returns less expensive, this still seems to improve performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161292 Approved by: https://github.com/wconstab, https://github.com/malfet, https://github.com/bdhirsh ghstack dependencies: #161301	2025-08-30 06:55:42 +00:00
Scott Wolchok	5d35b49ba7	Fix forced copying def_property_readonly for FunctionSchema & friends (#161301 ) This took me a bit to figure out and I'm pretty sure I've looked at this code before. Pybind uses `return_value_policy::reference_internal` for `def_property`, which [causes the owning object to be kept alive for the lifespan of the return value](https://pybind11.readthedocs.io/en/stable/advanced/functions.html), allowing the getter to safely avoid copying the property value. However, lambdas act like they return `auto`, not `decltype(auto)`, so our lambdas themselves were forcing copies! Testing: observed std::vector<Argument> copying disappear in Linux perf profile of someOpInfo._schema.arguments/returns (in _python_dispatch.correct_storage_aliasing). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161301 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/wconstab	2025-08-30 06:55:42 +00:00
CaoE	db622842bc	[Inductor][CPP] Optimize config selecting for micro gemm when number of mxn blocks can not occupy all the threads (#161144 ) If number of mxn blocks can not occupy all the threads, use smaller register block size will get better performance since the computing size per thread is smaller. It may get ~20% performance improvement for the real case `m1_n512_k4096`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161144 Approved by: https://github.com/leslie-fang-intel	2025-08-30 05:53:49 +00:00
Boyuan Feng	77d8e98e1b	[Inductor] update exp codegen for better precision (#161829 ) Prior to this PR, we have: ``` [Default Behavior] uses `tl.math.exp({x})`: eager diff: tensor(2.6935e-06, device='cuda:0', dtype=torch.float64) compile diff: tensor(9.2757e-06, device='cuda:0', dtype=torch.float64) eager_latency:0.0013996509159580942, compile_latency:0.0013981951951980592 TORCHINDUCTOR_USE_FAST_MATH=1 uses `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)`: eager diff: tensor(2.2315e-06, device='cuda:0', dtype=torch.float64) compile diff: tensor(3.5329e-06, device='cuda:0', dtype=torch.float64) eager_latency:0.0013982331859319662, compile_latency:0.0013824134564199367 Update inductor to use `tl.extra.libdevice.exp(tmp0)`: eager diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64) compile diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64) eager_latency:0.0014109122834153282, compile_latency:0.0014062877025520593 ``` Since `tl.extra.libdevice.exp` leads to both better precision and on-par latency, we use it by default now. Note that `tl.extra.libdevice.exp` used to have a perf issue in [January 2025](https://github.com/triton-lang/triton/issues/5735) since it used due to `ex2.approx.f32` instead of `ex2.approx.ftz.f32`. So `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)` was used as a workaround. I double checked that the issue is resolved and `tl.extra.libdevice.exp` also uses [ex2.approx.ftz.f32](https://github.com/triton-lang/triton/issues/5735#issuecomment-3238421293) today. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161829 Approved by: https://github.com/jansel	2025-08-30 04:56:51 +00:00
Tianren Gao	2fed4fb464	[FlexAttn] Fix Paged Attention Accuracy via Upper Mask Mod and Prevent Invalid Memory Access (#160861 ) Fixes #159247 Issue 1: Accuracy Problem with Non-Divisible KV Sequences --------------------------------------------------------- ### Background Paged attention in flex decoding produced inaccurate results when KV sequence length is not divisible by block size. For example, when `KV_S = 64` and `block_size = 128`, the output didn't match standard attention accuracy. ### Root Cause The current paged attention does not apply upper mask mod when converting from logical to physical mask mod. Instead, it uses a noop_mask by default which makes all the values unmasked, leading to an accuracy mismatch. Adding a upper mask mod according to the origin actual kv_len (64 in this test case) resolves the issue. ### Solution * Applied proper upper bound masking: Updated all calls to `convert_logical_block_mask` to pass `kv_len` as a tensor with proper shape `[B, KV_S]` to provide information of actual batched KV sequence length. The function now correctly applies upper bound checks using the actual KV sequence lengths for each batch ### Files Modified * `torch/nn/attention/experimental/_paged_attention.py`: Added `kv_len` parameter as a tensor to `get_mask_mod` and applied upper mask to the new mask mod. * `test/inductor/test_flex_attention.py`: Fixed all related `kv_len` parameter call in the tests * `test/inductor/test_flex_decoding.py`: Fixed all related `kv_len` parameter call in the tests Issue 2: Invalid Memory Access (IMA) in Triton Kernels ------------------------------------------------------ ### Background The Triton kernel for flex attention was experiencing invalid memory access errors when running with compute sanitizers, particularly with short KV sequences and small batch sizes. ### Root Cause * Kernel launches CTAs (Cooperative Thread Arrays) proportional to GPU's multi-processor count (108 via `SPLIT_KV`) * With small workloads, many CTAs remain idle but still attempt to access `kv_indices` with invalid `indices_idx` values * This caused out-of-bounds memory access violations ### Solution Implemented boundary checks with early exit: 1. Added `MAX_VALID_KV_IDX` parameter in `torch/_inductor/kernel/flex/flex_decoding.py` * Calculate maximum valid KV index based on actual `kv_indices` tensor size and pass it to Triton template 2. Added early exit logic in `torch/_inductor/kernel/flex/templates/flex_decode.py.jinja` * Boundary checks before accessing `kv_indices` in both normal and full blocks * Idle CTAs with invalid `indices_idx` skip computation entirely This prevents invalid memory access while reducing wasted computation on idle thread blocks. Testing & Validation -------------------- ### Accuracy Tests * Added comprehensive test cases covering KV sequences not divisible by block sizes * Verified output matches standard attention for various sequence length combinations ### Sanitizer Results `========= COMPUTE-SANITIZER Starting standalone test_max_autotune... Running test_max_autotune on device: cuda max_autotune config: True test_max_autotune completed successfully! Test passed! ========= ERROR SUMMARY: 0 errors` Before: More than 13720 invalid memory access errors with sanitizers After: Clean execution with 0 errors Both fixes work together to ensure paged attention produces accurate results while running safely without memory access violations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160861 Approved by: https://github.com/BoyuanFeng	2025-08-30 04:50:23 +00:00
PyTorch UpdateBot	76f81b56d3	[audio hash update] update the pinned audio hash (#161836 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161836 Approved by: https://github.com/pytorchbot	2025-08-30 04:23:04 +00:00
Howard Huang	82d2d23e85	Add batch option for send/recv_object_list (#160342 ) `send_object_list` and `recv_object_list` use regular `send`/`recv` P2P ops which means that they will create 2-rank NCCL communicators between ranks if the communicators have not been initialized. This adds an option `use_batch` which will call the send/recv with `batch_isend_irecv` which will re-use the communicators already initialized for collectives in the group. --- BatchP2P ops, creates (or use existing) communicator keyed by device index Regular P2P Ops, creates (or use existing) dedicated 2-rank communicators keyed by “rank1:rank2” See: `c8205cb354/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L3980-L4008)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160342 Approved by: https://github.com/wconstab	2025-08-30 03:29:09 +00:00
PyTorch MergeBot	e015de1969	Revert "Use vectorized stores for all dtypes (#161649 )" This reverts commit f0a517e333d6204f560d8061a4f70523060c93bf. Reverted https://github.com/pytorch/pytorch/pull/161649 on behalf of https://github.com/ngimel due to buggy ([comment](https://github.com/pytorch/pytorch/pull/161649#issuecomment-3238895967))	2025-08-30 03:13:40 +00:00
Nikita Shulga	0af56fc33e	Cleanup stale submodule directories after checkout (#161748 ) Fixes https://github.com/pytorch/pytorch/issues/161510 Test plan: ``` % cd third_party/kineto % git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive M libkineto/third_party/dynolog M libkineto/third_party/fmt M libkineto/third_party/googletest Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104) HEAD is now at fe80f93 Fix MSVC Error (#1134) Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1' Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21' Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' % git checkout 5e75018; git submodule update --init --recursive M libkineto/third_party/dynolog M libkineto/third_party/fmt M libkineto/third_party/googletest Previous HEAD position was fe80f93 Fix MSVC Error (#1134) HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104) warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e' Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850' Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164' Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347' % cd ../.. % git status HEAD detached from 649e397c6de Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) (commit or discard the untracked or modified content in submodules) modified: third_party/kineto (untracked content) % time git submodule foreach --recursive git clean -ffdx ... git submodule foreach --recursive git clean -ffdx 0.47s user 0.96s system 88% cpu 1.625 total % git status HEAD detached from 649e397c6de ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748 Approved by: https://github.com/atalman	2025-08-30 01:30:44 +00:00
Irakli Salia	8627a19adf	[MPS] sparse add unary funcs + add for sparse tensors (#160839 ) Adds several unary functions and add. Enables tests for unary functions in test_sparse but not enabling other tests yet, needs more ops before we fully migrate to testing SparseMPS with `test_sparse.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160839 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-30 01:09:00 +00:00
eellison	ebfee60101	[WIP] more aggressive persistent reduction (#161055 ) Gives 18% speedup on rms norm (2048, 32768). And we have seen other instances where inductor is not aggressive enough about codegening persistent reductions - e.g. 39% on [this kernel from torch ao](https://github.com/pytorch/pytorch/issues/159769#issuecomment-3188568335). Codegen-ing persistent reductions can be risky if you run out of registers. Here, I'm effectively making persistent reductions an option of looped reductions by setting RBLOCK == rnumel, so that we can still fallback to looped reductions as needed. As criteria: - there needs to be significant memory savings from doing a persistent reduction (by keeping memory in register and avoiding another iteration over input) - we should not be coalescing on x dimension, otherwise large rblock will inhibit coalescing - we should not be especially register or arithmetic intensive (this last part uses mem_ops_per_thread, but could be improved). Still need to do dashboard run, although I'm not sure we get a lot of large rblock in our benchmarks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161055 Approved by: https://github.com/jansel	2025-08-30 01:08:45 +00:00
PyTorch MergeBot	6db872fa2c	Revert "Cleanup stale submodule directories after checkout (#161748 )" This reverts commit 0e45023cf9cbe1cf18279c1b0d391ea9464e7731. Reverted https://github.com/pytorch/pytorch/pull/161748 on behalf of https://github.com/malfet due to I still see the same failures, and could not understand, from the log whether those checks are running on not ([comment](https://github.com/pytorch/pytorch/pull/161748#issuecomment-3238791895))	2025-08-30 01:04:11 +00:00
Nikita Shulga	7c30a9d7fc	[MPS] Add slow version of `kthvalue` (#161817 ) Which heavily borrows implementation logic from `topk` As this method is non-deterministic, modified the logic for cpu-ops indices comparison with just an equality statement, as by default random numbers picked for input tensor allow for quite a lot of overlaps Pull Request resolved: https://github.com/pytorch/pytorch/pull/161817 Approved by: https://github.com/dcci	2025-08-30 00:44:29 +00:00
Chien-Chin Huang	c1e504ec2f	[SymmMEM] Move AsyncTP tests to a seperate test class (#161820 ) We move AsyncTP tests to a seperate test suite because 1) Async TP ops are not the core symmetric memory APIs, they are more like applications, 2) MultiProcContinuousTest will skip all the following tests if a test fails (we should fix this too). We still want to get the test signals for the core symmetric memory APIs when Async TP ops fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161820 Approved by: https://github.com/kwen2501	2025-08-30 00:40:40 +00:00
Parshant Sharma	4ad9fbc83a	Unify TypeAlias definitions in optimizer.py (#161493 ) Fixes #160834 This issue unifies TypeAlias definitions in [optimizer.py](https://github.com/pytorch/pytorch/blob/main/torch/optim/optimizer.py) This ensures the following: - Consistency and Standardization - Enhanced IDE support - Prevents runtime confusion Pull Request resolved: https://github.com/pytorch/pytorch/pull/161493 Approved by: https://github.com/Skylion007	2025-08-30 00:35:02 +00:00
Wang, Chuanqi	0f81e7f640	[CI] Fix XPU ci test permission issue (#161389 ) Due to new test runners, refer https://github.com/pytorch/pytorch/actions/runs/17161094208/job/48694776064#step:2:124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161389 Approved by: https://github.com/atalman	2025-08-30 00:03:59 +00:00
Isalia20	3daf20f8e1	[MPS] fix empty input in posneg functions (#161824 ) fix empty posneg function for mps: ```python import torch input_tensor = torch.empty(0, device="mps") out_pos = torch.isposinf(input_tensor) ``` Gives: ``` RuntimeError: [srcBuf length] > 0 INTERNAL ASSERT FAILED at "/Users/Irakli_Salia/Desktop/pytorch/aten/src/ATen/native/mps/OperationUtils.mm":551, please report a bug to PyTorch. Placeholder tensor is empty! ``` on main branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/161824 Approved by: https://github.com/malfet	2025-08-29 23:12:04 +00:00
Zhang, Liangang	3e459491b5	Enable XPU path for FlexAttention (#143553 ) [#RFC153024](https://github.com/pytorch/pytorch/issues/153024) Motivation 1. The Attention has been the critical performance bottleneck in the current LLM models, and FlexAttention is a good choice to cover the broad variants in the transformers series models. With FlexAttention, it is easy for us to enable the paged attention and fused SDPA in the transformers repo on XPU device. Besides, it also provide a candidate to process attention in LLM ecosystem libraries ., e.g., vLLM, SGLang on XPU device. 2. FlexAttention is good start point to push the intel triton based GEMM kernel to be matured. FlexAttention provide both flexattention kernel and flexdecoding kernel to cover both compute bound and memory bound GEMM computation, and different shapes should also been supported to serve LLM inference., e.g. head_dim=64, 96, 128, 256. What does this PR do? 1. Enable the device type for Flexattention kernel and UTs to ensure all important UTs pass on XPU device. 2. For E2E model inference, ensure the functionality of LLM models inference with FlexAttention to be ready. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143553 Approved by: https://github.com/EikanWang, https://github.com/drisspg Co-authored-by: Mao Yunfei <yunfei.mao@intel.com> Co-authored-by: Xingyuan Li <xingyuan.li@intel.com> Co-authored-by: majing <jing1.ma@intel.com> Co-authored-by: Xiao, Wang <wang.xiao@intel.com>	2025-08-29 23:10:58 +00:00
Andrey Talman	0e2c8af5a6	[CI/CD] Windows set git config --global core.ignorecase false (#161813 ) Make sure git on windows have core.ignorecase false Pull Request resolved: https://github.com/pytorch/pytorch/pull/161813 Approved by: https://github.com/malfet	2025-08-29 23:04:43 +00:00
Ruben Rodriguez Buchillon	ea27464a79	[inductor][decompose k] disable on everything other than cuda (#161795 ) # why - untested so far # what - add an empty config heuristic for all devices for decompose k - the cuda heuristic, because it is more specific, will still be picked up - add notes explaining how to enable on other devices # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "decompose_k" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161795 Approved by: https://github.com/PaulZhang12 ghstack dependencies: #161767	2025-08-29 22:41:27 +00:00
Ruben Rodriguez Buchillon	45eccf414f	[inductor][heuristics registry] missing heuristic is not an error anymore, cross device heuristics (#161767 ) # why - not having a heuristic is an error but should not crash, just provide 0 configs - some heuristics are cross device type - cleaner to be explicit about being cross device type than having to enumerate every possible device type # what - on registration, supply device_type=None (explicitly) to say this heuristic is cross device - test to guard the heuristics hierarchies # testing ``` python3 -bb -m pytest test/inductor/test_template_heuristics_registry.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161767 Approved by: https://github.com/PaulZhang12	2025-08-29 22:41:27 +00:00
Wang, Chuanqi	037f3bd475	[CI] Migrate XPU build and test to python 3.10 (#161708 ) Follow #161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161708 Approved by: https://github.com/malfet	2025-08-29 22:31:39 +00:00
PyTorch MergeBot	6e548c1a87	Revert "[CI] Migrate XPU build and test to python 3.10 (#161708 )" This reverts commit 2a70d98abf8256d3d768eff028fca20198579824. Reverted https://github.com/pytorch/pytorch/pull/161708 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing rocm jobs to fail. See: test/inductor/test_max_autotune.py::TestMaxAutotuneSubproc::test_max_autotune_addmm_search_space_EXHAUSTIVE_dynamic_True [GH job link](https://github.com/pytorch/pytorch/actions/runs/17303310877/job/49125664617) [HUD commit link](`2a70d98abf`) ([comment](https://github.com/pytorch/pytorch/pull/161708#issuecomment-3238359944))	2025-08-29 21:49:15 +00:00
zhxchen17	eb78757708	[inductor] Lift fw_compiler and bw_compiler as toplevel functions. (#161762 ) This is a no-op refactor to compiler_fx which lifts the logic of fw_compiler and bw_compiler to toplevel, so that they can be reused in a different stack (e.g. precompile). Differential Revision: [D81292968](https://our.internmc.facebook.com/intern/diff/D81292968/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161762 Approved by: https://github.com/angelayi, https://github.com/yushangdi	2025-08-29 21:46:55 +00:00
David Berard	05eeb29976	[inductor][triton] support JITCallable._hash_lock (#161768 ) Fixes #161618 Triton # 7974 introduces a threading.RLock() in JITCallable, which is not pickle-able. This PR adds this field to the list of un-pickleable fields that need to be handled specially. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161768 Approved by: https://github.com/xuzhao9	2025-08-29 21:20:02 +00:00
Tristan T	18b4fdde8f	Add MTIA to floor_divide op (#161575 ) Summary: Missed file in op registration resulting in fallback during test Reviewed By: andyanwang, srsuryadev Differential Revision: D81085615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161575 Approved by: https://github.com/albanD, https://github.com/malfet	2025-08-29 20:39:29 +00:00
PyTorch MergeBot	f6368e934e	Revert "[MPS] sparse add unary funcs + add for sparse tensors (#160839 )" This reverts commit 93c5112f46a978a029644ae599979416ead5c917. Reverted https://github.com/pytorch/pytorch/pull/160839 on behalf of https://github.com/atalman due to test_sparse_csr.py::TestSparseCompressedCPU::test_consistency_SparseCSR_asinh_cpu_complex64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17329155095/job/49201551217) [HUD commit link](`93c5112f46`) ([comment](https://github.com/pytorch/pytorch/pull/160839#issuecomment-3238093296))	2025-08-29 19:55:39 +00:00
Yidi Wu	bf6aaba0f7	[while_loop] avoid aliasing when body_fn never executes (#160670 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160670 Approved by: https://github.com/zou3519 ghstack dependencies: #160548, #160669	2025-08-29 19:36:37 +00:00
Yidi Wu	456493f7ed	[while_loop][inductor] remove offset check for while_loop (#160669 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160669 Approved by: https://github.com/zou3519 ghstack dependencies: #160548	2025-08-29 19:36:37 +00:00
Huy Do	c74e301455	Bump TorchBench version (#161461 ) To include the latest fixes from TorchBench. I'll setup a nightly commit hash update for this next Pull Request resolved: https://github.com/pytorch/pytorch/pull/161461 Approved by: https://github.com/malfet	2025-08-29 19:21:07 +00:00
Scott Wolchok	67457dbb9d	Fix non-const reference arguments in torch/csrc/jit/python/init.cpp (#161300 ) Shouldn't be any generated code impact, just fixing bad practice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161300 Approved by: https://github.com/wconstab, https://github.com/malfet ghstack dependencies: #161286	2025-08-29 19:01:32 +00:00
Natalia Gimelshein	e9bbd28f22	make einsum produce contiguous inputs in more cases (#161755 ) Fixes #161729 Written by codex This won't produce contiguous inputs for all einsum applications, because we flatten all right-only and left-only dimensions, so if right and left operand dimensions are interleaved in output, we cannot (with current algo) produce contiguous output, however, for common cases like in the linked issue it works. Let's see what CI says Pull Request resolved: https://github.com/pytorch/pytorch/pull/161755 Approved by: https://github.com/malfet, https://github.com/albanD	2025-08-29 18:50:46 +00:00
PaulZhang12	348d781055	[Inductor] Update Outer Reduction Heuristic (#159093 ) Update outer reduction heuristics for significant speedups. HuggingFace: <img width="572" height="705" alt="Screenshot 2025-08-20 at 12 44 51 AM" src="https://github.com/user-attachments/assets/4872a23b-d136-423a-b2e6-187895bccba1" /> Average ~20% speedup on a kernel by kernel basis TorchBench: <img width="572" height="705" alt="Screenshot 2025-08-20 at 12 45 10 AM" src="https://github.com/user-attachments/assets/b8357b6d-6107-4104-b906-292a17d14d48" /> Average ~40% speedup on a kernel by kernel basis <img width="1705" height="729" alt="Screenshot 2025-08-21 at 5 50 32 PM" src="https://github.com/user-attachments/assets/a9715a2b-9e6c-4b33-ba9f-7870dc561e31" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159093 Approved by: https://github.com/jansel	2025-08-29 18:31:22 +00:00
Ting Lu	303f514d5b	[CI] Add basic CUDA 13.0 periodic test (#161013 ) https://github.com/pytorch/pytorch/issues/159779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161013 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com> Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>	2025-08-29 17:56:33 +00:00
Xu Han	f532f99822	[AOTI] normalize_path_separator zip file path (#161781 ) normalize_path_separator zip file path Pull Request resolved: https://github.com/pytorch/pytorch/pull/161781 Approved by: https://github.com/angelayi	2025-08-29 17:53:41 +00:00
Irakli Salia	93c5112f46	[MPS] sparse add unary funcs + add for sparse tensors (#160839 ) Adds several unary functions and add. Enables tests for unary functions in test_sparse but not enabling other tests yet, needs more ops before we fully migrate to testing SparseMPS with `test_sparse.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160839 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-29 16:28:58 +00:00
Mwiza Kunda	0f6a08a029	[inductor] Fix SubgraphInfo round trip (#161779 ) Currently `numels` is not specific to a created subgraph since it is not retrieved by `dataclasses.fields(SubgraphInfo)` due to it not being type annotated, see [ref](https://docs.python.org/3/library/dataclasses.html#module-dataclasses:~:text=The%20%40dataclass%20decorator%20examines%20the%20class%20to%20find%20fields.%20A%20field%20is%20defined%20as%20a%20class%20variable%20that%20has%20a%20type%20annotation.%20With%20two%20exceptions%20described%20below%2C%20nothing%20in%20%40dataclass%20examines%20the%20type%20specified%20in%20the%20variable%20annotation.). So for example the following would happen: ``` self.numels = {"x": sympy.Integer(5)} subgraph_name = "<x>" with self.create_subgraph_body(subgraph_name): self.numels = {"x", sympy.Integer(7)} # this would print that x has size 7, not the original value of 5 print(self.numels) # numels would be None because dataclasses.fields(SubgraphInfo) does not include numels # since it is not type annotated print(self.subgraph_bodies[subgraph_name]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161779 Approved by: https://github.com/eellison	2025-08-29 16:27:29 +00:00
Zain Rizvi	c8fa907e74	Check commit order (#161560 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161560 Approved by: https://github.com/malfet ghstack dependencies: #161558, #161637	2025-08-29 16:22:58 +00:00
ILCSFNO	b99a112688	Update optional tag for `interpolation` in `torch.quantile()` (#161706 ) Fixes #146156 Refix the issue with the extra needed fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161706 Approved by: https://github.com/soulitzer	2025-08-29 16:21:14 +00:00
Chien-Chin Huang	cd6d63f453	[SymmMEM] Fix test_empty_strided_p2p_persistent (#161677 ) test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test. https://github.com/pytorch/pytorch/pull/161668 should also fix the issue but we can land this PR for a safer test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161677 Approved by: https://github.com/kwen2501 ghstack dependencies: #161676	2025-08-29 16:11:58 +00:00
Nikita Shulga	0e45023cf9	Cleanup stale submodule directories after checkout (#161748 ) Fixes https://github.com/pytorch/pytorch/issues/161510 Test plan: ``` % cd third_party/kineto % git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive M libkineto/third_party/dynolog M libkineto/third_party/fmt M libkineto/third_party/googletest Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104) HEAD is now at fe80f93 Fix MSVC Error (#1134) Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1' Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21' Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' % git checkout 5e75018; git submodule update --init --recursive M libkineto/third_party/dynolog M libkineto/third_party/fmt M libkineto/third_party/googletest Previous HEAD position was fe80f93 Fix MSVC Error (#1134) HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104) warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e' Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850' Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164' Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347' % cd ../.. % git status HEAD detached from 649e397c6de Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) (commit or discard the untracked or modified content in submodules) modified: third_party/kineto (untracked content) % time git submodule foreach --recursive git clean -ffdx ... git submodule foreach --recursive git clean -ffdx 0.47s user 0.96s system 88% cpu 1.625 total % git status HEAD detached from 649e397c6de ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748 Approved by: https://github.com/atalman	2025-08-29 14:07:06 +00:00
PyTorch MergeBot	823a329984	Revert "Cleanup stale submodule directories in checkout action (#161748 )" This reverts commit f3c5a82139539c63e6f08966e268c4160e138320. Reverted https://github.com/pytorch/pytorch/pull/161748 on behalf of https://github.com/malfet due to I put the check in the wrong place ([comment](https://github.com/pytorch/pytorch/pull/161748#issuecomment-3237080419))	2025-08-29 13:40:21 +00:00
Ankita George	f0a65cd6d6	Add pg argument to consolidate_safetensors_files_on_every_rank (#161421 ) Summary: Based on feedback on https://github.com/pytorch/torchtitan/pull/1625, adding a pg argument to consolidate_safetensors_files_on_every_rank so that we don't infer the pg and users can supply one if needed. Test Plan: ensure existing tests pass Rollback Plan: Differential Revision: D80954339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161421 Approved by: https://github.com/fegin	2025-08-29 13:31:11 +00:00
Xilun Wu	627decb0ed	[DTensor] fix DTensorTestCase.destroy_pg() when device_type is "cpu" but CUDA device is available (#161015 ) Summary When `device_id` is not None, barrier() will choose the accelerator of the most pripority, which means if the test specifies to use CPU for testing while CUDA is available on the host, the barrier() will use CUDA. To avoid this and better respect `self.device_type`, we add this branch to enforce barrier() to use CPU when `self.device_type` is CPU and other accelerator is also available. Test `pytest test/distributed/tensor/test_dtensor_testbase.py` Debugging Output ``` # from init_process_group() init pg: backend=gloo, device_id = None default_pg has backend: gloo, device_types: [device(type='cuda'), device(type='cpu')] # from barrier() barrier: device_ids = [10], devices = [], device = None, PG=[device(type='cuda'), device(type='cpu')] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161015 Approved by: https://github.com/tianyu-l	2025-08-29 12:47:11 +00:00
zeshengzong	448a7e7e31	Fix `SequentialLR` deprecate warning about invoke `step(epoch)` (#149392 ) Fixes #116776 #76113 #113222 #67958 ## Changes - Refactor `LRScheduler.step` method, leave `epoch` check logic in public method `step` - Move update `lr` logic to `_update_lr` method - Make `SequentialLR` use `_update_lr` to avoid unnecessary warning message ## Test Result ```bash pytest test/optim/test_lrscheduler.py -vv ``` ![image](https://github.com/user-attachments/assets/e1c5527e-193e-4328-bf95-023139ea0416) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149392 Approved by: https://github.com/janeyx99	2025-08-29 11:45:11 +00:00
Malay Bag	ed370ae4b0	[unflatten] Fix test by supporting both MappingKey anf GetAttrKey (#161599 ) Summary: As title Test Plan: Run internal tests Rollback Plan: Differential Revision: D81115712 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161599 Approved by: https://github.com/tugsbayasgalan	2025-08-29 10:08:38 +00:00
David Berard	5859edf113	[BE][inductor] replace "and" -> "logical_and" in bucketize_binary_search (#160941 ) Get rid of these warnings: ``` /home/dberard/local/pytorch-env7/pytorch/torch/_inductor/runtime/triton_helpers.py:317: UserWarning: Logical operators 'and' and 'or' are deprecated for non-scalar tensors; please use '&' or '\|' instead ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160941 Approved by: https://github.com/malfet, https://github.com/jingsh	2025-08-29 09:27:13 +00:00
xinan.lin	5b701a6bb2	[AOTI][Intel GPU] Add XPU quantization ops to AOT Inductor. (#156572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156572 Approved by: https://github.com/EikanWang, https://github.com/angelayi ghstack dependencies: #157430	2025-08-29 09:19:44 +00:00
xinan.lin	48679ef966	[Refactor][XPU] Refactor XPU quantization op and add header files. (#157430 ) This PR refactors the XPU quantization ops to align their code structure with the CPU implementation for consistency. It also adds necessary header files to enable future integration with AOTI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157430 Approved by: https://github.com/angelayi	2025-08-29 09:19:44 +00:00
Natalia Gimelshein	0ca3a6085d	use host+device_id to make sure devices are unique in rendezvous request (#161756 ) Per title, for NVL72 systems where devices with the same indices on multiple hosts are within the same nvlink domain Pull Request resolved: https://github.com/pytorch/pytorch/pull/161756 Approved by: https://github.com/kwen2501	2025-08-29 09:09:45 +00:00
Yiming Zhou	a55d2beb50	[export] Support complex constant in serde (#161517 ) Summary: Fixes #160749 For a model like ``` class M(torch.nn.Module): def forward(self, x): s = torch.sin(x) z = 1j * s return z ``` Its graph will be ``` graph(): %x : [num_users=1] = placeholder[target=x] %sin : [num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%x,), kwargs = {}) %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%sin, 1j), kwargs = {}) return (mul,) ``` `1j` will appear as a constant complex argument in the `aten.mul` Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_complex_constant Rollback Plan: Differential Revision: D80672323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161517 Approved by: https://github.com/angelayi	2025-08-29 08:13:21 +00:00
Chien-Chin Huang	d8a0bdb0d3	[BE][SymmMEM] Change Optional to the shorthand expression for symmetric memory modules (#161676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161676 Approved by: https://github.com/Skylion007	2025-08-29 07:31:16 +00:00
PyTorch UpdateBot	a7c949089a	[vllm hash update] update the pinned vllm hash (#161752 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161752 Approved by: https://github.com/pytorchbot	2025-08-29 04:54:31 +00:00
PyTorch UpdateBot	a6456bfa85	[audio hash update] update the pinned audio hash (#161753 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161753 Approved by: https://github.com/pytorchbot	2025-08-29 04:52:58 +00:00
Nikita Shulga	f3c5a82139	Cleanup stale submodule directories in checkout action (#161748 ) Fixes https://github.com/pytorch/pytorch/issues/161510 Test plan: ``` % cd third_party/kineto % git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive M libkineto/third_party/dynolog M libkineto/third_party/fmt M libkineto/third_party/googletest Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104) HEAD is now at fe80f93 Fix MSVC Error (#1134) Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1' Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21' Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' % git checkout 5e75018; git submodule update --init --recursive M libkineto/third_party/dynolog M libkineto/third_party/fmt M libkineto/third_party/googletest Previous HEAD position was fe80f93 Fix MSVC Error (#1134) HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104) warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e' Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850' Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164' Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347' % cd ../.. % git status HEAD detached from 649e397c6de Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) (commit or discard the untracked or modified content in submodules) modified: third_party/kineto (untracked content) % time git submodule foreach --recursive git clean -ffdx ... git submodule foreach --recursive git clean -ffdx 0.47s user 0.96s system 88% cpu 1.625 total % git status HEAD detached from 649e397c6de ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748 Approved by: https://github.com/atalman	2025-08-29 03:21:31 +00:00
Angela Yi	5c306c3ccb	[fx] Add lru_cache to warning (#161721 ) Summary: Added lru_cache to the warning message to avoid flooding logs Test Plan: CI Rollback Plan: Differential Revision: D81245618 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161721 Approved by: https://github.com/pianpwk	2025-08-29 02:25:45 +00:00
Dylan Maloy	c1cb1cb26e	fix tests caused by has_triton (#161737 ) Summary: this will only cause it in the event that we are serializing a triton hop. there are a few tests that do weird mocking stuff that this function doesn't like, so this will prevent it from being called there. Test Plan: att Rollback Plan: Differential Revision: D81261486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161737 Approved by: https://github.com/angelayi	2025-08-29 02:25:35 +00:00
drisspg	5cb1d71e59	[Flex] Fix float16 default config 128 headdim (#161647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161647 Approved by: https://github.com/v0i0	2025-08-29 01:48:06 +00:00
Justin Chu	d153af713e	[ez] Improve formatting in error messages for dynamic shapes (#161573 ) Show the repr of `dim` to make the message more clear. Example: before `but got batch instead`, after `but got "batch" instead` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161573 Approved by: https://github.com/angelayi	2025-08-28 23:52:58 +00:00
PyTorch MergeBot	9b67d8e344	Revert "[RELAND] Close some sources of fake tensor leakage (#161589 )" This reverts commit 5790b009751e6ebba35d3e6d05e7c1b135553eee. Reverted https://github.com/pytorch/pytorch/pull/161589 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/17305150611/job/49128381649) [HUD commit link](`5790b00975`) ([comment](https://github.com/pytorch/pytorch/pull/161589#issuecomment-3235224249))	2025-08-28 23:19:36 +00:00
PyTorch MergeBot	47742081c9	Revert "kill allow_complex_guards_as_runtime_asserts (#160198 )" This reverts commit 69d91b94ba5366f4444d8cb8fd3dab4de4f04d3d. Reverted https://github.com/pytorch/pytorch/pull/160198 on behalf of https://github.com/jeffdaily due to let's revert again instead of waiting for forward fix, see earlier comments ([comment](https://github.com/pytorch/pytorch/pull/160198#issuecomment-3235165462))	2025-08-28 22:50:37 +00:00
drisspg	fffa62fa12	Ensure large tensor int32 -> int64 indexing is enabled (#157767 ) Fixes: #https://github.com/pytorch/pytorch/issues/157446 I think that this delta is worth the switch form block-ptrs especially since they are deprecated ## Perf Summary A is nightly B is this diff, so `negative` means this diff improves perf TOP 5 differences <img width="805" height="754" alt="Screenshot 2025-08-24 at 5 49 49 PM" src="https://github.com/user-attachments/assets/aa359cdf-ee9a-427d-be72-1b9aef6f3115" /> <details> <summary><strong>Full perf table (click to expand)</strong></summary> \| attn_type \| dtype \| shape(B,Hq,M,Hkv,N,D) \| TFlops Version A \| TFlops Version B \| \| --- \| --- \| --- \| --- \| --- \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 258.38834144791923 \| 258.6353685004612 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.2192450677751 \| 140.12393320464972 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 122.32683823617003 \| 118.51603755647925 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.48556906165314 \| 137.24259849208627 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 86.59814488695922 \| 84.59431398586257 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 288.52679758135764 \| 292.9174195871856 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 172.25541683643277 \| 172.94326459828508 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 164.40864610599826 \| 165.035129576335 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 176.54876886433945 \| 175.08057670028145 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 125.22491679812626 \| 121.06201152859151 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 339.11952481874283 \| 339.0132835601695 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 227.58583240284406 \| 228.21824999409597 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 185.98569659868966 \| 182.32850843255093 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 188.9495725191772 \| 180.31385312481657 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 106.25789530994302 \| 106.55084959448476 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 357.6430536888533 \| 363.30843452247274 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 262.3241154406613 \| 265.73250045488 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 249.30498953911416 \| 249.35928192833785 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 224.74126243851808 \| 223.71776504077988 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 168.26977014013707 \| 165.47991483333809 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 382.8178701785897 \| 384.34752965862685 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 308.1449710013853 \| 311.0653716044644 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 251.96365252505072 \| 243.92283557225903 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 226.69316232745368 \| 215.22769268913356 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 153.34142545296405 \| 151.9312673939401 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 396.0998000753126 \| 398.35036286102473 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 333.5198415274966 \| 344.6354466169716 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 310.5955933379696 \| 305.66347819546 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 260.4012412689896 \| 259.758666997307 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 234.13034252182635 \| 227.61676497283614 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 396.17615538477196 \| 401.1419104525502 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 359.98648311998414 \| 360.8285563463094 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 291.97720707257736 \| 281.41694809965253 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 250.1703628419691 \| 238.556760291579 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 199.50782826294306 \| 191.52327358439223 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 411.0632004785396 \| 413.6362648405517 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 382.9404387613185 \| 397.74886235657607 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 357.0998545146633 \| 350.5115200772392 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 281.8033924428203 \| 281.98601309215843 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 282.56595134222135 \| 277.4565795466672 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 408.89838018149516 \| 405.14531386840076 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 396.07662058160264 \| 393.4598228299578 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 317.8822887267849 \| 304.754931401036 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 265.8801304948243 \| 254.22961974295112 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 227.87390579965614 \| 222.19481980110393 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 427.36821778477025 \| 431.3766620314935 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 410.67994346825 \| 423.4666944003808 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 381.1968748374038 \| 381.77668006420424 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 292.5540046358546 \| 296.5439130720502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 321.04573768858114 \| 310.7423616656888 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 427.46148866769903 \| 426.162091037068 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 419.75580537687347 \| 421.88640120274334 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 337.3208051798903 \| 327.4912454675092 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 276.5638854539581 \| 262.988360558083 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 250.82791326036886 \| 245.07367032501736 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 435.8055824506086 \| 441.8803729460534 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 432.02638235921006 \| 450.33161016596273 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 402.25525939224883 \| 393.8564689669916 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 297.5337286675904 \| 297.0131881135074 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 343.8697037899545 \| 329.8194073407783 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 267.58912366821056 \| 256.91606054118375 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 150.81723692609629 \| 146.32172267858743 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 129.51029293209245 \| 122.72144394093334 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 147.627656359087 \| 141.68956350566188 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 87.55100546003591 \| 84.91293287692788 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 299.5931492743986 \| 305.884253766691 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 179.39026367843837 \| 181.64741311605096 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 173.93547669282367 \| 173.23972950980564 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 185.90234171599252 \| 182.80844545446686 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 128.08176696266082 \| 123.27722685662111 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 340.50674552770664 \| 338.9071088484576 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 225.4438318650432 \| 230.22899884832975 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 194.15123248528312 \| 185.02793973094865 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 200.74289714108176 \| 191.76606719670647 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 107.03564946728423 \| 106.82432377861258 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 371.31799283918406 \| 379.7555394732925 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 275.97762744310455 \| 276.71106853992995 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 261.6648679783462 \| 259.4127232060398 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 237.03108223577615 \| 233.92710216149527 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 172.13926800371152 \| 168.74390922407585 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 381.50199487767276 \| 383.9043681999597 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 307.9748883093411 \| 312.2403515462001 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 251.11319684705438 \| 243.17870127827277 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 236.3253127246763 \| 223.81250201769552 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 154.55693991756874 \| 153.11360584987685 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 407.11400078586615 \| 413.53709886086557 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 348.1705797722622 \| 360.09771155957367 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 321.8593280850388 \| 318.2882327401255 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 270.089032013835 \| 268.767323026064 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 238.07324557907788 \| 228.09842078362692 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 399.8172853171901 \| 401.0954526332136 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 363.4387330438581 \| 364.13111024232677 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 294.1752429133857 \| 283.7235663368415 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 256.8389394007649 \| 246.91771015606483 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 199.3378564292656 \| 192.40439590901758 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 425.5150965556111 \| 430.8190098707553 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 396.00437184073013 \| 411.3873625655787 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 369.92803661607815 \| 361.43244467343663 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 293.4277354412933 \| 295.2529537595746 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 288.0208673072841 \| 281.51896404878863 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 408.3005367220567 \| 408.96116482298913 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 396.90095962766304 \| 396.87385456176486 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 319.0534576137999 \| 302.50950358107764 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 270.3334977708081 \| 258.8506349486557 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 227.46824134365394 \| 222.23759438128766 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 438.24247309479694 \| 437.7975163205371 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 428.34012029699227 \| 433.3215899950434 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 386.52672049728875 \| 388.26216893354984 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 302.71976814728083 \| 302.3574867306459 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 327.39760662780986 \| 308.6348428844912 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 423.31308678262695 \| 426.6306972137279 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 412.6983690923106 \| 419.4961977664297 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 337.41003544742273 \| 324.2155049126126 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 278.7755890910794 \| 265.9194286636502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 251.55678254755364 \| 244.8843180141462 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 452.5930781172308 \| 457.7117122300742 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 445.05676260348116 \| 463.9304535499636 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 415.78302138389415 \| 406.29229555271456 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 308.0311067300895 \| 304.91354721414314 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 351.43943626809335 \| 329.4476923070317 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 295.1801525813241 \| 291.36521287398904 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 183.23250549178067 \| 182.35421238887605 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 151.56832453117747 \| 151.3422139154794 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 171.02111935180432 \| 160.72516856727913 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 74.05765122783826 \| 74.5885345035243 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 314.3587394591763 \| 319.2938677773619 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 224.57002084153177 \| 225.48868542008177 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.00964804143052 \| 215.39576159953486 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.1174237618258 \| 214.28437413525663 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 121.08920423648368 \| 119.55813661872644 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 362.2193857281911 \| 360.05005804275936 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 279.8840217430121 \| 279.5437918286659 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 227.76617121021982 \| 222.8655938229316 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 215.43141176970562 \| 207.71852284994702 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 121.35588364218539 \| 121.20636565046884 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 365.1545280898012 \| 373.37585444987326 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 304.360119952975 \| 309.1247297936263 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 287.2603904544586 \| 289.25547903162595 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 257.9852675272418 \| 257.59069234098115 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 188.35158496670232 \| 184.24683960154857 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 389.9744911369211 \| 388.43466897254166 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 345.9228295166513 \| 342.63034895210126 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 279.56334658247437 \| 271.2724375402088 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 245.66477202810066 \| 233.49688207371258 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 170.3270720653187 \| 166.23863845657382 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 400.0041140827554 \| 402.11182445396497 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 363.64641830327434 \| 375.9288663364792 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 341.5776139573363 \| 335.1160003213424 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 281.1811770268521 \| 280.21438270014005 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 247.78716118997716 \| 245.3269825179633 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 403.794126680488 \| 405.2353919019577 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 387.079178426863 \| 385.1461762057035 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 309.7847188173431 \| 298.0443968374749 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 262.4721750159666 \| 250.81679725428586 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 205.70866004479979 \| 202.9620839129557 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 413.380982988662 \| 418.40270594263103 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 398.450064800682 \| 409.6794973994029 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 372.26297458194466 \| 364.44415106552196 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 293.0818569905912 \| 292.85172400643984 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 296.46717085592087 \| 285.76362010612763 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 419.3186786037592 \| 426.08801580934437 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 408.1648467766632 \| 409.4122254207817 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 329.24396020457345 \| 313.5200995121138 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 274.61257504571876 \| 255.7801815432177 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 232.63806001220684 \| 230.03020843492314 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 435.0785891054788 \| 440.39101804225345 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 424.86925312752817 \| 435.18898057396825 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 393.000417896268 \| 395.11543361225256 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 297.7755459218185 \| 300.7208114715287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 331.71570861760534 \| 318.07127352552885 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 424.58602747137405 \| 425.84897078470715 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 422.66607285025725 \| 423.5524945535485 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 344.8625760048626 \| 331.6793888458635 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 282.0787281511649 \| 263.7895634445868 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 252.7301927385177 \| 245.41844170037427 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 437.0658069164588 \| 442.9101960063628 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 433.13788271434646 \| 452.3873572709863 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 404.0959191546953 \| 396.7077863894884 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 300.45502211883206 \| 301.3439134717943 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 344.11003202413934 \| 330.8897663350314 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 298.4364205341705 \| 291.6793556507056 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 187.6382133139633 \| 191.05409897308772 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 156.55822078636112 \| 154.178925976516 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 173.47765221825162 \| 169.30862508068464 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 74.5885345035243 \| 74.52689061607104 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 323.12233826013045 \| 328.53889207933514 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 236.75872140126316 \| 235.8378325547398 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 227.17836523816675 \| 226.75357076139966 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 224.07209453308036 \| 224.07209453308036 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 122.85572156047981 \| 121.11642183704716 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 361.3123326658092 \| 360.71014086458337 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 281.5287983927017 \| 281.94301754758345 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 232.7456696285686 \| 226.50976826432776 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 221.5612361744038 \| 214.96188822837055 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 121.38311528944315 \| 120.85441868178513 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 380.2579019244734 \| 389.2520157863988 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 316.95230660496924 \| 317.87597790618906 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 301.07968126657323 \| 298.02424098422983 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 267.2240756921594 \| 267.16353549228154 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 189.82761622494257 \| 186.736450261963 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 389.88665375406805 \| 387.9125133037077 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 348.70619958684887 \| 346.6750499749774 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 280.5472989906087 \| 271.22300822012187 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 250.02397620165968 \| 241.22532776331445 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 171.67817496107645 \| 166.95679280483972 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 412.626880230807 \| 417.60238657950777 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 374.8829313933945 \| 389.4448546468815 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 353.20410434172436 \| 345.7072490717473 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 292.51045924209586 \| 291.66621022138287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 251.6264062063495 \| 248.45110052911542 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 404.0155784550126 \| 401.90546837237514 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 384.4389015599863 \| 386.9684324594344 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 313.3731284132225 \| 298.17074251037894 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 264.19199737284265 \| 252.8982463999916 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 207.03696315185684 \| 202.86697323136772 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 428.2436763312506 \| 433.45005568619536 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 411.8516531869893 \| 428.2753623461049 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 384.9095037182509 \| 372.90888743000744 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 303.2438915629836 \| 302.05095952914337 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 301.8689122735564 \| 285.0363190513223 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 423.13592231504805 \| 420.3991500185611 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 407.44527331585493 \| 408.5064370765247 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 330.50050996167414 \| 316.8763979925965 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 274.6833786307413 \| 259.86098862141324 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 232.24019584158367 \| 226.52040268160232 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 444.4596314237808 \| 455.99558915752266 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 437.4245561244369 \| 455.98275147271966 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 397.3350686877605 \| 397.88875599028063 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 308.53809114394545 \| 307.1359822042007 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 331.32379843423774 \| 316.85293191675646 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 422.4622274366379 \| 425.0407156418684 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 420.9547052783101 \| 430.33779243510276 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 345.50265346504085 \| 332.094855328957 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 280.81715528243365 \| 264.6543640282054 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 252.25635200421783 \| 245.46235499490305 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 452.5524207341139 \| 461.7512032176736 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 445.2316469907137 \| 464.4523799578466 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 416.87264016717023 \| 409.17124592157046 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 309.42579489389846 \| 307.9734464665731 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 350.50782004300623 \| 330.98959545427294 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157767 Approved by: https://github.com/Skylion007	2025-08-28 22:43:59 +00:00
can-gaa-hou	c0ed87c82d	[Dynamo] Fix weakref.proxy error when `torch.compile` (#161508 ) Fixes #159258 The error occurs when we attempt to create a weak reference from a weak reference proxy. `e9d42b3880/torch/_dynamo/guards.py (L2910-L2915)` In fact, we shouldn't create a weak reference from another reference or proxy, as it would check in CPython. `f60f8225ed/Objects/weakrefobject.c (L410-L418)` However, `__weakrefoffset__` is not equal to 0 when the `guarded_object` is in `weakref.ProxyTypes`, and it will wrongly create a weak reference for the `weakref.ProxyTypes`. I think this could be a bug from CPython, but we can prevent it by adding more weakref type checks (`weakref.ProxyTypes` contains `weakref.ProxyType` and `weakref.CallableProxyType`) here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161508 Approved by: https://github.com/Lucaskabela, https://github.com/anijain2305, https://github.com/malfet	2025-08-28 22:34:18 +00:00
Aleksei Nikiforov	1069a08dac	Enable more nightly tests on s390x (#160893 ) Enable more nightly tests on s390x Pull Request resolved: https://github.com/pytorch/pytorch/pull/160893 Approved by: https://github.com/malfet	2025-08-28 22:20:55 +00:00
soulitzer	1190b7f73e	Support Triton kernels in SAC region (#161541 ) SAC interaction with triton kernel: - In eager, triton ops are not dispatchable, and so it is always ignored by SAC, i.e., always recomputed. - In compile, although we wrap triton kernels into HOPs, allowing us to intercept them, we still recompute by default rather than save by default, so that compile maintains the invariant of using less memory than eager. - If you want to do something else (e.g. save the output of your triton kernel) you should wrap it in a custom op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161541 Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/xmfan	2025-08-28 21:15:25 +00:00
PyTorch MergeBot	f46e4bcf43	Revert "Add ciflow/vllm to vLLM commit hash update PR(s) (#161678 )" This reverts commit 0e358050304c6a350dae2bce497bd1867ecc3c9f. Reverted https://github.com/pytorch/pytorch/pull/161678 on behalf of https://github.com/yangw-dev due to we want to keep the vllm pinn updated now, right now we have some failure ([comment](https://github.com/pytorch/pytorch/pull/161678#issuecomment-3234876332))	2025-08-28 20:42:19 +00:00
Ruben Rodriguez Buchillon	496052faf6	[inductor][decompose-k] make part of template heuristics (#161098 ) # why - enable it to go through commont template heuristics point - make easier to use in common extension point e.g. lookup table # what - break template heuristic into base + triton - move k_split generation logic into a templateheuristic for decompose k - register through normal mechanism - to make testing work, add a context manager to temporarily set template heuristics for a template/op to empty (effectively skipping it). This is used for decompose k test to disable triton choices # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D80670918](https://our.internmc.facebook.com/intern/diff/D80670918) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161098 Approved by: https://github.com/jansel ghstack dependencies: #161026, #161097	2025-08-28 20:14:48 +00:00
Ruben Rodriguez Buchillon	f641effe19	[inductor][ez] move template heuristics into dir (#161097 ) # why - simplify the expansion of heuristics beyond just triton (e.g. decomposeK) # what - move template heuristics and registry into its own folder - adjust imports accordingly # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D80670917](https://our.internmc.facebook.com/intern/diff/D80670917) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161097 Approved by: https://github.com/PaulZhang12, https://github.com/jansel ghstack dependencies: #161026	2025-08-28 20:14:48 +00:00
Ruben Rodriguez Buchillon	688acf0b83	[inductor][mm] restructure decompose k (#161026 ) # why - make it easier to integrate into lookup table later # what - current version generates templates on the fly and uses them to generate a single choice - lookup table and performance model work best when there is a stable set of templates (with predictable names) and those are then parametrized - this change makes it so that there is a single DecomposeK template with a stable name, and the k split is the only parametrization we do # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_decompose_k_dynamic_False_bfloat16_sizes1 -v ``` Differential Revision: [D80670913](https://our.internmc.facebook.com/intern/diff/D80670913) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161026 Approved by: https://github.com/PaulZhang12, https://github.com/jansel	2025-08-28 20:14:41 +00:00
Natalia Gimelshein	f0a517e333	Use vectorized stores for all dtypes (#161649 ) resurrecting #151818 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161649 Approved by: https://github.com/Skylion007	2025-08-28 20:06:29 +00:00
Kevin Fu	bacdd985a9	[PT2] Add fastResizeToZero to all static dispatch kernels (#161679 ) Summary: Add fastResizeToZero whenever we are reusing output tensors. Otherwise it keeps throwing warning ``` Warning: An output with one or more elements was resized since it had shape [10], which does not match the required output shape [181]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function _resize_output_check) ``` Test Plan: Run local replayer. ``` MODEL_TYPE=ads_mtml_offsite_cvr_oba_optout_dedicated_model MODEL_ENTITY_ID=786096203 SNAPSHOT_ID=11 HARDWARE_TYPE=1 ./sigrid/predictor/scripts/start_gpu_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} 3443 2>&1 \| tee ~/logs/${MODEL_TYPE}/predictor_${MODEL_ENTITY_ID}_${SNAPSHOT_ID} sigrid/predictor/scripts/start_gpu_replayer_localhost_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} 1000 ${MODEL_TYPE} /data/users/$USER/requests/filter_requests_ads_mtml_offsite_cvr_oba_optout_dedicated_model_100 localhost /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} false 3443 false 2>&1 \| tee ~/logs/${MODEL_TYPE}/replayer_${MODEL_ENTITY_ID}_${SNAPSHOT_ID} ``` Before: P1921177565 After: P1921178087 Rollback Plan: Differential Revision: D81177596 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161679 Approved by: https://github.com/henryoier	2025-08-28 19:58:40 +00:00
RajeshvShiyal	1621b5494c	Removed redundant dtype conversion in scaled_dot_product_attention docstring example (#161613 ) Suggested changes done for Fixes #161611. Removed the line attn_bias.to(query.dtype) entirely Fixes #161611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161613 Approved by: https://github.com/mikaylagawarecki	2025-08-28 19:58:07 +00:00
Avik Chaudhuri	69d91b94ba	kill allow_complex_guards_as_runtime_asserts (#160198 ) Summary: Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept). Test Plan: updated tests Rollback Plan: Differential Revision: D79903317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160198 Approved by: https://github.com/ezyang	2025-08-28 19:36:19 +00:00
Dmitry Nikolaev	b76f6d117a	[ROCm] fix numpy version detection and adjust fudge_factors for MI355 (#161429 ) This PR fixes: - Numpy >= 2.1 version detection (instead of python 3.13 version detection) to skip some tests (numpy 2.1 can be installed for older python versions) ``` test_quantization.py::TestDynamicQuantizedOps::test_qlinear test_quantization.py::TestDynamicQuantizedOps::test_qlinear_legacy test_quantization.py::TestQuantizedLinear::test_qlinear test_quantization.py::TestQuantizedLinear::test_qlinear_leaky_relu test_quantization.py::TestQuantizedLinear::test_qlinear_relu test_quantization.py::TestQuantizedLinear::test_qlinear_tanh test_quantization.py::TestQuantizedLinear::test_qlinear_with_input_q_dq_qweight_dq_output_fp32 ``` - A couple of SDPA tests on MI355 by adjusting fudge_factors: ``` test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32 test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161429 Approved by: https://github.com/jeffdaily	2025-08-28 19:32:09 +00:00
Karthick Panner Selvam	130e50afff	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos, https://github.com/atalman	2025-08-28 18:57:34 +00:00
Shangdi Yu	30ab87c884	[inductor] don't append None to choices (#161672 ) Summary: don't append None as a choice to choices in autotune Test Plan: See internal Diff Differential Revision: D81188644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161672 Approved by: https://github.com/angelayi	2025-08-28 18:48:50 +00:00
PyTorch MergeBot	049c08eda8	Revert "[dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#160934 )" This reverts commit 8f31aa97a3e1e17bed29b6cedf9884f0c6b145e9. Reverted https://github.com/pytorch/pytorch/pull/160934 on behalf of https://github.com/anijain2305 due to causes memory leak leading to OOMs ([comment](https://github.com/pytorch/pytorch/pull/160934#issuecomment-3234426359))	2025-08-28 17:56:36 +00:00
dolpm	affd071858	[export] serialization support for triton_kernel_wrapper_functional (#161314 ) Summary: att Test Plan: buck2 test mode/opt //caffe2/test:test_export -- test_triton_hop Rollback Plan: Differential Revision: D80827767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161314 Approved by: https://github.com/angelayi	2025-08-28 17:42:47 +00:00
angelayi	dac062f23b	Add aoti to mps benchmarks (#160741 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160741 Approved by: https://github.com/malfet, https://github.com/huydhn	2025-08-28 17:32:29 +00:00
Wang, Chuanqi	2a70d98abf	[CI] Migrate XPU build and test to python 3.10 (#161708 ) Follow #161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161708 Approved by: https://github.com/malfet	2025-08-28 17:27:11 +00:00
eqy	55c289d5c1	[cuBLASLt][FP8] `cuBLASLt` appears to support float8 rowwise-scaling on H100 (#161305 ) Following #157905 I think the macro around ``` TORCH_INTERNAL_ASSERT(use_rowwise == false, "rowwise scaled_gemm not supported with blaslt"); ``` was never updated and this would cause `float8` tests to fail. Also it appears the `Lt` accepts two inputs with `e4m3` and `e5m2` dtypes simultaneously, so removing that check here as well... CC @lw Pull Request resolved: https://github.com/pytorch/pytorch/pull/161305 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-28 17:04:25 +00:00
Nikita Shulga	2042d2174a	[MPS] Migrate round unary op to Metal (#161712 ) And actually use the right function, as [`torch.round`](https://docs.pytorch.org/docs/stable/generated/torch.round.html) doesn't use `std::round`, but rather `std::rint`, which can be easily seen by running something like ```python import torch print(torch.arange(-3., 3., step=.5, device='mps').round()) print(torch.arange(-3., 3., step=.5, device='mps').cpu().round()) ``` Before this change it printed ``` tensor([-3., -3., -2., -2., -1., -1., 0., 1., 1., 2., 2., 3.], device='mps:0') tensor([-3., -2., -2., -2., -1., -0., 0., 0., 1., 2., 2., 2.]) ``` But after this change results match Pull Request resolved: https://github.com/pytorch/pytorch/pull/161712 Approved by: https://github.com/dcci	2025-08-28 16:45:07 +00:00
Will Constable	4fd761fecc	[DTensor] Wrap sharding prop error with contextual exception (#161574 ) Mainly, this helps tell the user more info about the operator that failed to run if it fails during sharding propagation. Previously, only this exception would be raised: ``` RuntimeError: ('Attempted to flatten sharded dimension 1, ', 'but only the leftmost dim of a Flatten can be sharded.') ``` Now you get both the above exception as well as ``` The above exception was the direct cause of the following exception: RuntimeError: Sharding propagation failed for Op(op=aten.view.default, args_schema=Spec((Replicate(), Shard(dim=0), Shard(dim=1), Shard(dim=2)) on (8, 8, 4)), [64, 4] @ mesh: (1, 2, 2, 2)) ``` <stacktrace omitted> <details><summary>detailed error</summary> ``` ====================================================================== ERROR: test_linear (__main__.TestDTensor) ---------------------------------------------------------------------- Traceback (most recent call last): File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 668, in wrapper self._join_processes(fn) File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 932, in _join_processes self._check_return_codes(fn, elapsed_time) File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 972, in _check_return_codes raise RuntimeError(error) RuntimeError: Process 4 exited with error code 10 and exception: Traceback (most recent call last): File "/data/users/whc/pytorch/torch/distributed/tensor/_dispatch.py", line 150, in dispatch self.sharding_propagator.propagate(op_info) File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 309, in propagate OutputSharding, self.propagate_op_sharding(op_info.schema) File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 45, in __call__ return self.cache(args, kwargs) File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 329, in propagate_op_sharding_non_cached op_strategy = self.op_strategy_funcs[op_schema.op](strategy_schema) File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 673, in reshape_strategy input_tgt_placements, output_placements = propagate_shape_and_sharding( File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 601, in propagate_shape_and_sharding in_dim = get_in_dim_to_shard(cmd) File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 537, in get_in_dim_to_shard raise RuntimeError( RuntimeError: ('Attempted to flatten sharded dimension 1, ', 'but only the leftmost dim of a Flatten can be sharded.') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 816, in run_test getattr(self, test_name)() File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 670, in wrapper fn() File "/data/users/whc/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper method(args, *kwargs) File "/data/users/whc/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 490, in wrapper raise e File "/data/users/whc/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 487, in wrapper func(self, args, *kwargs) # type: ignore[misc] File "/data/users/whc/pytorch/test.py", line 60, in test_linear print("results: ", distributed_linear(distributed_input)) File "/data/users/whc/pytorch/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "/data/users/whc/pytorch/torch/nn/modules/module.py", line 1786, in _call_impl return forward_call(args, *kwargs) File "/data/users/whc/pytorch/torch/nn/modules/linear.py", line 134, in forward return F.linear(input, self.weight, self.bias) File "/data/users/whc/pytorch/torch/_compile.py", line 53, in inner return disable_fn(args, *kwargs) File "/data/users/whc/pytorch/torch/_dynamo/eval_frame.py", line 1005, in _fn return fn(args, **kwargs) File "/data/users/whc/pytorch/torch/distributed/tensor/_api.py", line 358, in __torch_dispatch__ return DTensor._op_dispatcher.dispatch( File "/data/users/whc/pytorch/torch/distributed/tensor/_dispatch.py", line 163, in dispatch raise RuntimeError( RuntimeError: Sharding propagation failed for Op(op=aten.view.default, args_schema=Spec((Replicate(), Shard(dim=0), Shard(dim=1), Shard(dim=2)) on (8, 8, 4)), [64, 4] @ mesh: (1, 2, 2, 2)) ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161574 Approved by: https://github.com/zpcore, https://github.com/XilunWu	2025-08-28 15:56:15 +00:00
PyTorch MergeBot	a8270dd124	Revert "kill allow_complex_guards_as_runtime_asserts (#160198 )" This reverts commit 196232bb935cb346f143d5c39e9a73c44121a033. Reverted https://github.com/pytorch/pytorch/pull/160198 on behalf of https://github.com/atalman due to dynamo/test_activation_checkpointing.py::ActivationCheckpointingViaTagsTestsCUDA::test_compile_selective_checkpoint_triton_kernel_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17289619543/job/49074475338) [HUD commit link](`196232bb93`) ([comment](https://github.com/pytorch/pytorch/pull/160198#issuecomment-3234013520))	2025-08-28 15:40:37 +00:00
Jane Xu	63632fc7ee	Add new_zeros dtype variant to the shim and as a stable op (#161597 ) In case we want this before 2.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161597 Approved by: https://github.com/mikaylagawarecki	2025-08-28 13:57:24 +00:00
PyTorch MergeBot	05d0f11dbd	Revert "Add test coverage to tf32 in max autotune mm configs (#161545 )" This reverts commit e9d34b2438d65d6d16109e2416f3698de20f85c2. Reverted https://github.com/pytorch/pytorch/pull/161545 on behalf of https://github.com/atalman due to inductor/test_max_autotune.py::TestMaxAutotuneRemoteCache::test_get_mm_configs_float32_precision_ieee [GH job link](https://github.com/pytorch/pytorch/actions/runs/17283985553/job/49058214260) [HUD commit link](`e9d34b2438`) ([comment](https://github.com/pytorch/pytorch/pull/161545#issuecomment-3233569771))	2025-08-28 13:46:47 +00:00
PyTorch MergeBot	ef0483d74c	Revert "Ensure large tensor int32 -> int64 indexing is enabled (#157767 )" This reverts commit b36a20d368733740a8507b3109d193c88930323a. Reverted https://github.com/pytorch/pytorch/pull/157767 on behalf of https://github.com/atalman due to need to revert https://github.com/pytorch/pytorch/pull/157767 internal tests ([comment](https://github.com/pytorch/pytorch/pull/157767#issuecomment-3233558168))	2025-08-28 13:44:41 +00:00
PyTorch MergeBot	5432966253	Revert "Remove test since it ooms on CI (#161644 )" This reverts commit 443452ca2f5beef58019f4e7e7e31c0526aee0fc. Reverted https://github.com/pytorch/pytorch/pull/161644 on behalf of https://github.com/atalman due to need to revert https://github.com/pytorch/pytorch/pull/157767 internal tests ([comment](https://github.com/pytorch/pytorch/pull/161644#issuecomment-3233550883))	2025-08-28 13:41:58 +00:00
PyTorch MergeBot	e9975f501c	Revert "Support Triton kernels in SAC region (#161541 )" This reverts commit 149c68071ca033d5e3427e63e05d9969bd4961e4. Reverted https://github.com/pytorch/pytorch/pull/161541 on behalf of https://github.com/malfet due to Broke some tests in trunk workflow, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=trunk%20%2F%20linux-jammy-cuda12.8 ([comment](https://github.com/pytorch/pytorch/pull/161541#issuecomment-3233457206))	2025-08-28 13:14:53 +00:00
xinan.lin	07f76517e7	[Inductor][WIndows] Fix Windows test case failure. (#161497 ) Fixes windows test case failures: - TritonCodeGenTests.test_inductor_sequence_nr - TritonCodeGenTests.test_indirect_device_assert - CompiledOptimizerTests.test_static_address_finalizer Pull Request resolved: https://github.com/pytorch/pytorch/pull/161497 Approved by: https://github.com/jansel	2025-08-28 12:40:42 +00:00
xinan.lin	3519969e4f	[Intel GPU] Enable tensor memory descriptor in triton template for XPU. (#161600 ) As Intel Triton now supports tensor descriptor, this PR updates the pinned Intel Triton version and introduces support for Triton MM template with tensor descriptor on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161600 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-08-28 12:39:58 +00:00
Tugsbayasgalan Manlaibaatar	5790b00975	[RELAND] Close some sources of fake tensor leakage (#161589 ) Reland of https://github.com/pytorch/pytorch/pull/159923 Couple of fixes: 1. When we run into an operation we didn't proxy, we end up emitting fake constants. We detect this and warn using the FQN of the lifted constant. We warn because some internal users complained it was regressing their exportability. 2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict 3. We modify yolov3 to fix the previous silent incorrect behaviour 4. We use strict export for levit_128 because it errors in non-strict due to more strict side effect checking When upgrading torchbench pin, opacus_cifar10 seems to not run on eager anymore. I verified this by pushing a temporary PR on master with new pin. So i added it to expect_fail list. Differential Revision: [D81133908](https://our.internmc.facebook.com/intern/diff/D81133908) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161589 Approved by: https://github.com/avikchaudhuri	2025-08-28 09:46:42 +00:00
Eddie Yan	2e77a08b95	[cuDNN][TF32] Account for TF32 in `test_super_resolution_cuda` (#161662 ) cuDNN seems to be dispatching to TF32 kernels on B200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161662 Approved by: https://github.com/Skylion007	2025-08-28 08:42:34 +00:00
Avik Chaudhuri	196232bb93	kill allow_complex_guards_as_runtime_asserts (#160198 ) Summary: Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept). Test Plan: updated tests Rollback Plan: Differential Revision: D79903317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160198 Approved by: https://github.com/ezyang	2025-08-28 07:59:29 +00:00
PyTorch MergeBot	fa76256603	Revert "[dynamic shapes] use prims_common contiguity in create_example_tensors (#160933 )" This reverts commit 33c3794533844236a6e30ba377e0a6802b279fc8. Reverted https://github.com/pytorch/pytorch/pull/160933 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160933#issuecomment-3232305708))	2025-08-28 07:39:26 +00:00
Gabriel Ferns	d2d4a3c539	Select Algorithm clear feedback savers (#161654 ) Add `clear_feedback_savers` and tests for the feedback functionality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161654 Approved by: https://github.com/masnesral	2025-08-28 06:56:03 +00:00
Ke Wen	95516ad7e6	[4/N][SymmMem] Add `get_remote_tensor` + move up `get_buffer` and `get_signal_pad` (#161533 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): `get_remote_tensor `: return a symmetric tensor given a peer rank. The difference between `get_buffer` API and `get_remote_tensor` API: - the former accepts an offset, whereas the latter doesn't - the latter returns a symmetric tensor at `hdl.offset` on `peer`. As a refactorization, this PR also moves the implementation of `get_buffer` and `get_signal_pad` to the `SymmetricMemory` level as their code is common to all backends. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161533 Approved by: https://github.com/ngimel ghstack dependencies: #161470, #161471, #161532	2025-08-28 06:47:35 +00:00
Ke Wen	ff9533970a	[3/N][SymmMem] Expose offset field from handle (#161532 ) As titled, so that kernels relying on direct pointers can use base address and `hdl.offset` to access remote memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161532 Approved by: https://github.com/ngimel ghstack dependencies: #161470, #161471	2025-08-28 06:39:12 +00:00
Ke Wen	b291dc9684	[2/N][SymmMem] Add MemPool allocator and tests (#161471 ) (Porting most of #161008) Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory. To end users, this PR supports a python UI as follows: ``` allocator = symm_mem.get_mempool_allocator(device) mempool = torch.cuda.MemPool(allocator) with torch.cuda.use_mem_pool(mempool): tensor = torch.arange(numel, dtype=dtype, device=device) ``` Added tests for both use cases above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471 Approved by: https://github.com/ngimel ghstack dependencies: #161470	2025-08-28 06:31:29 +00:00
Oguz Ulgen	0fd63fd88b	Guard config copy for pickle errors (#161659 ) Differential Revision: [D81168335](https://our.internmc.facebook.com/intern/diff/D81168335) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161659 Approved by: https://github.com/zou3519	2025-08-28 06:27:48 +00:00
Ke Wen	eec876deb6	[SymmMem] Isolate set_device tests to avoid hang (#161668 ) `test_symmetric_memory.py` hangs like this: ``` SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_False PASSED [5.6364s] SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_True ... ``` This set of tests parameterizes whether user sets the device before calling `symm_mem.emtpy`. However, such parametrization does not work well with `MultiProcContinuousTest` because the set device will "contaminate" the next test function. Solution is to move the "set device" tests to a separate test suite using the traditional `MultiProcessTestCase`, which would respawn processes every time. Hang is gone now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161668 Approved by: https://github.com/fegin	2025-08-28 05:43:49 +00:00
Yang Wang	c83b43d7a8	[1/2]Add summary report for vllm build (#161565 ) Demo Run https://github.com/pytorch/pytorch/actions/runs/17259533323?pr=161565 <img width="1538" height="720" alt="image" src="https://github.com/user-attachments/assets/64f6d7b4-cac6-4c12-863c-b15514bb8810" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161565 Approved by: https://github.com/huydhn	2025-08-28 05:25:55 +00:00
Mikayla Gawarecki	d3d9eb4777	Error when TORCH_STABLE_ONLY is defined in TensorBase.h (#161658 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161658 Approved by: https://github.com/albanD	2025-08-28 04:36:31 +00:00
PyTorch UpdateBot	a65db6dc4c	[vllm hash update] update the pinned vllm hash (#161363 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161363 Approved by: https://github.com/pytorchbot	2025-08-28 04:14:19 +00:00
soulitzer	149c68071c	Support Triton kernels in SAC region (#161541 ) SAC interaction with triton kernel: - In eager, triton ops are not dispatchable, and so it is always ignored by SAC, i.e., always recomputed. - In compile, although we wrap triton kernels into HOPs, allowing us to intercept them, we still recompute by default rather than save by default, so that compile maintains the invariant of using less memory than eager. - If you want to do something else (e.g. save the output of your triton kernel) you should wrap it in a custom op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161541 Approved by: https://github.com/drisspg, https://github.com/zou3519 ghstack dependencies: #160781	2025-08-28 03:54:46 +00:00
xinan.lin	bae01479c3	[Inductor UT] Re-enable test_torchinductor_opinfo.py on XPU. (#161477 ) The PR #160222 replaced @skipCUDAIf with @requires_cuda_and_triton in test_torchinductor_opinfo.py, which caused the CI jobs for other devices to skip this large test suite. We attempted to revert #160222 but ran into conflicts. I then opened #160936 to revert the changes from #160222, but that resulted in CPU CI job timeouts. I also filed issue #161132 for assistance, but haven’t received a response yet. To minimize the impact, this PR re-enables the test suite on XPU first. I will continue to seek help on re-enabling it for CPU afterwards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161477 Approved by: https://github.com/jansel	2025-08-28 03:29:21 +00:00
cyy	8939d151d0	Use std::apply for CPU code (#152526 ) The supported compilers are recent enough to enable std::apply in C++17. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152526 Approved by: https://github.com/ezyang	2025-08-28 02:47:54 +00:00
rzou	5edc3d814f	Add option for TorchDispatchMode to ignore torch.compile internals (#161648 ) If TorchDispatchMode.ignore_compile_internals() is True, then we turn off the TorchDispatchMode during the compilation process, instead turning it back on during runtime of the compiled artifact. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/161648 Approved by: https://github.com/bdhirsh	2025-08-28 02:41:33 +00:00
rzou	199c3633bf	Fix Inductor Periodic (#161617 ) Models are now passing accuracy. # of graph breaks is larger because these were not actually tested in CI (if the model fails accuracy we do not assert on # of graph breaks). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161617 Approved by: https://github.com/anijain2305	2025-08-28 02:36:08 +00:00
Gabriel Ferns	e9d34b2438	Add test coverage to tf32 in max autotune mm configs (#161545 ) Add a test to make sure that the configs are using the correct setting of tf32 to prevent regression. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161545 Approved by: https://github.com/coconutruben	2025-08-28 02:27:58 +00:00
Simon Fan	be1612201d	[export] Support AC HOP in pre-dispatch (#161479 ) Adds the pre-dispatch handling for the AC hop. This lets the HOP pre-dispatch export without actually pre-dispatch tracing into it,. However, this is not sufficient to support AC in export: - because the HOP body will still be in torch IR, so it will fail export verifiers - the exported module also can't be ran in eager because the AC HOP relies on partitioner to embed RNG state saving/restoring So it must be lowered by AOT Autograd into post-dispatch first before being executed, It suffices for my purposes though. If users had checkpoint API use in their exported model, the behavior goes from silently incorrect to now be validation error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161479 Approved by: https://github.com/ydwu4 ghstack dependencies: #161353	2025-08-28 01:46:25 +00:00
Simon Fan	15670f9075	[dtensor] support local_map as a decorator (#161353 ) And extract it out as a convenience function for dynamo to wrap Pull Request resolved: https://github.com/pytorch/pytorch/pull/161353 Approved by: https://github.com/zpcore	2025-08-28 01:46:25 +00:00
Huy Do	0e35805030	Add ciflow/vllm to vLLM commit hash update PR(s) (#161678 ) As it should be, otherwise, PR(s) like https://github.com/pytorch/pytorch/pull/161121 were merged without the signals it needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161678 Approved by: https://github.com/atalman	2025-08-28 01:35:04 +00:00
Shangdi Yu	92c2daebb6	Add inductor provenance tracking artifacts to cache (#161440 ) Summary: - Add inductor provenance tracking artifacts to cache - Update the tlparse version pin to `0.4.0`. The old tlparse version errors out on the new tlparse output. The lowest tlparse version that works is `0.3.42`. tlparse error: ``` thread 'main' panicked at src/parsers.rs:671:71: called `Result::unwrap()` on an `Err` value: Error("EOF while parsing a value", line: 1, column: 0) stack backtrace: 0: 0x55e4ff1c7f00 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h6d42cc84fc840290 1: 0x55e4ff1ee503 - core::fmt::write::h5af61a909e3ec64d 2: 0x55e4ff1c4c33 - std::io::Write::write_fmt::h5a7b54aa6e4a315d 3: 0x55e4ff1c7d52 - std::sys::backtrace::BacktraceLock::print::h555579e7396c26ac 4: 0x55e4ff1c8caf - std::panicking::default_hook::{{closure}}::h9128866118196224 5: 0x55e4ff1c8b1a - std::panicking::default_hook::h52e9e7314e0255f6 6: 0x55e4ff1c9652 - std::panicking::rust_panic_with_hook::h541791bcc774ef34 7: 0x55e4ff1c93fa - std::panicking::begin_panic_handler::{{closure}}::h6479a2f0137c7d19 8: 0x55e4ff1c8419 - std::sys::backtrace::__rust_end_short_backtrace::ha04e7c0fc61ded91 9: 0x55e4ff1c908d - rust_begin_unwind 10: 0x55e4fef7a030 - core::panicking::panic_fmt::h5764ee7030b7a73d 11: 0x55e4fef7a406 - core::result::unwrap_failed::h3ff7104a9ace307a 12: 0x55e4fefb3c56 - <tlparse::parsers::ArtifactParser as tlparse::parsers::StructuredLogParser>::parse::h20bc51a17ffc494a 13: 0x55e4fef9669a - tlparse::run_parser::h20c7729f151eec62 14: 0x55e4fef99a1b - tlparse::parse_path::he4892147f47fbade 15: 0x55e4fef7c760 - tlparse::main::hdc05613b32f4f53b 16: 0x55e4fef89263 - std::sys::backtrace::__rust_begin_short_backtrace::h15f188f3edf42596 17: 0x55e4fef8827d - std::rt::lang_start::{{closure}}::he2c21e32a442538e 18: 0x55e4ff1be0f0 - std::rt::lang_start_internal::h15895544e2012228 19: 0x55e4fef83975 - main 20: 0x7f0b3662a610 - __libc_start_call_main 21: 0x7f0b3662a6c0 - __libc_start_main_alias_2 22: 0x55e4fef7a610 - <unknown> 23: 0x0 - <unknown> ``` Test Plan: ``` buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r test_kernel_information_generation python test/dynamo/test_structured_trace.py -k test_chromium_event ``` Differential Revision: D80976585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161440 Approved by: https://github.com/oulgen	2025-08-28 01:16:02 +00:00
Paul de Supinski	768a1017c5	Allow parallel start NUMA binding (#161576 ) # Context In #161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`. However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff). The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start: > Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted. But on further reading, the Linux docs say [`sched_setaffinity` is per thread.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer. I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07) The upshot is that we actually can safely use the inheritance trick from #161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread. # This PR Remove restrictions against parallel start for NUMA binding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161576 Approved by: https://github.com/d4l3k	2025-08-28 01:15:58 +00:00
Lakshay Garg	0c4a79b7e0	Replace some calls to new with make_{unique,shared} (#160581 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160581 Approved by: https://github.com/malfet	2025-08-28 00:30:45 +00:00
Son Nguyen	9b02435e9f	Improve Scheduler init duration (#161491 ) Early exit merge_loops() if config.loop_ordering_after_fusion is false. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161491 Approved by: https://github.com/jansel	2025-08-28 00:27:51 +00:00
Will Constable	fd60117051	[C10D] add _summarize_ranks util (#160284 ) Prints ranges of ranks succinctly. e.g. For a strided list of ranks, summarizes down to start:stop:step ``` 0:4096:512 ``` Omits step if it's 1 ``` 0:8 ``` Note: endpoints are exclusive. This may not be intuitive to everyone, but in the first above the last rank is 3584, and in the second it is 7. Currently, does not support combinations of striding _and_ range. (e.g. can not generate a representation like "0:2, 4:6, ..., 12:14". Is this needed / useful? If so it could be added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160284 Approved by: https://github.com/XilunWu	2025-08-28 00:17:53 +00:00
Pian Pawakapan	97a548b640	[PGO] skip allowlist logging for empty graphs (#161530 ) Summary: reduces spurious logging Test Plan: test_pgo Rollback Plan: Differential Revision: D81060182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161530 Approved by: https://github.com/bobrenjc93, https://github.com/mlazos	2025-08-28 00:12:13 +00:00
PyTorch MergeBot	c55bdb26e1	Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 )" This reverts commit 378edb047f83dfb84c2d9c032bddebc5e0147b8f. Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/atalman due to new test is failing internally ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3230152168))	2025-08-27 23:45:12 +00:00
PyTorch MergeBot	903181bb6f	Revert "[2/N][SymmMem] Add MemPool allocator and tests (#161471 )" This reverts commit 4ed71d5412d58746d23f16689cab61da0e8149ef. Reverted https://github.com/pytorch/pytorch/pull/161471 on behalf of https://github.com/atalman due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/161471#issuecomment-3230069186))	2025-08-27 23:18:36 +00:00
David Berard	ba201082b6	[TorchScript] ProfilingExecutor - RemoveProfileNodesAndSpecializeTypes None handling (#161538 ) ProfilingGraphExecutor works like this: 1. do some unrelated JIT optimizations 2. Add profiling nodes to collect JIT information like tensor dtypes and shapes 3. Do some more unrelated JIT optimizations 4. Remove the profiling nodes and extract the tensor info, and then use the JIT tensor info to do optimizations. This PR is intended to fix a bug in Step 4, where the profiling nodes were removed. It was previously assumed that all the things that were profiled were either Tensors or Optional[Tensor]s - otherwise, step 2 would not have introduced a profiling node. However, we saw a case where step 3 would remove replace Optional[Tensor] inputs with `None` inputs (e.g. if a conditional that returned a Tensor or a None could be statically known to only follow the `None` branch). To fix this, we essentially just modify the RemoveProfileNodesAndSpecializeTypes assert so that it accepts Tensors, Optional[Tensor]s, or None (the new part). Note that this issue is probably somewhat uncommon (maybe why we didn't see it for the first 4 years that this code existed). I expect that, typically, any time that step 3 would convert `Optional[Tensor] -> None`, step 1 would have already done that. So it's difficult to reproduce in an end-to-end TorchScript workload. Differential Revision: [D81068172](https://our.internmc.facebook.com/intern/diff/D81068172) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161538 Approved by: https://github.com/nmacchioni	2025-08-27 23:12:15 +00:00
PyTorch MergeBot	8fc2467fe5	Revert "[3/N][SymmMem] Expose offset field from handle (#161532 )" This reverts commit 68d395d61e9d4601ab1e2bca56eb28253572c662. Reverted https://github.com/pytorch/pytorch/pull/161532 on behalf of https://github.com/atalman due to need to revert https://github.com/pytorch/pytorch/pull/161471 internal failure ([comment](https://github.com/pytorch/pytorch/pull/161532#issuecomment-3230016806))	2025-08-27 23:06:55 +00:00
drisspg	30edac5da6	Updates to CuTe DSL template renderer (#161117 ) # Summary This adds a few more render functions available to template writers, specifically get_output and modification. The reasons why are more clear in the next PR in this stack. <img width="1645" height="364" alt="Screenshot 2025-08-21 at 1 48 50 PM" src="https://github.com/user-attachments/assets/2d508fda-4273-43ef-9edf-086e592e9249" /> Majority of the new cod is around the OpOverrides for CuTe DSL. It is alot to test and most of the actual testing I have been doing is via score_mods to the flash_attention at the next layer of this stack. A bunch of score mods that me and Claude came up with , that exercise the actual ops. ``` Py def causal_mask(score, b, h, q_idx, kv_idx): """Causal attention mask.""" return torch.where(q_idx >= kv_idx, score, float("-inf")) def relative_bias(score, b, h, token_q, token_kv): """Relative position bias.""" return score + torch.abs(token_q - token_kv) def relative_bias_v2(score, b, h, token_q, token_kv): """Relative position bias with factor of 2.""" return score + 2 * torch.abs(token_q - token_kv) def times_two(score, b, h, q_idx, kv_idx): """Simple score modification that doubles the score.""" return score * 2 def alibi_bias(score, b, h, q_idx, kv_idx): """ALiBi (Attention with Linear Biases) - used in some modern models.""" # Different slopes for different heads slope = 2 ** (-8 * (h + 1) / 8) # Simplified version return score - slope * torch.abs(q_idx - kv_idx) def sliding_window(score, b, h, q_idx, kv_idx, window_size=256): """Sliding window attention - only attend to nearby tokens.""" return torch.where( torch.abs(q_idx - kv_idx) <= window_size, score, float("-inf") ) def block_diagonal(score, b, h, q_idx, kv_idx, block_size=64): """Block diagonal attention pattern.""" q_block = q_idx // block_size kv_block = kv_idx // block_size return torch.where(q_block == kv_block, score, float("-inf")) def additive_bias(score, b, h, q_idx, kv_idx): """Test simple addition with position-based bias.""" return score + (q_idx + kv_idx) * 0.01 def multiplicative_decay(score, b, h, q_idx, kv_idx): """Test multiplication with distance-based decay.""" distance = torch.abs(q_idx - kv_idx) return score * torch.exp(-0.1 * distance) def sine_wave_bias(score, b, h, q_idx, kv_idx): """Test trigonometric functions.""" return score + 0.1 * torch.sin(2 * math.pi * (q_idx - kv_idx) / 64) def log_distance_penalty(score, b, h, q_idx, kv_idx): """Test logarithmic operations.""" distance = torch.abs(q_idx - kv_idx).float() return score - torch.log(1 + distance) def alternating_mask(score, b, h, q_idx, kv_idx): """Test with alternating pattern - good for branch prediction.""" return torch.where((q_idx + kv_idx) % 2 == 0, score, float("-inf")) def head_specific_pattern(score, b, h, q_idx, kv_idx): """Different behavior per attention head.""" even_head = h % 2 == 0 causal = q_idx >= kv_idx return torch.where(even_head & causal, score, float("-inf")) def sparse_strided(score, b, h, q_idx, kv_idx, stride=4): """Sparse attention with strided pattern.""" return torch.where( (kv_idx % stride == 0) \| (q_idx == kv_idx), score, float("-inf") ) def causal_with_global(score, b, h, q_idx, kv_idx): """Causal mask but first few tokens are globally attended.""" is_causal = q_idx >= kv_idx is_global = kv_idx < 4 return torch.where(is_causal \| is_global, score, float("-inf")) def dilated_attention(score, b, h, q_idx, kv_idx, dilation_rate=2): """Dilated attention pattern - exponentially increasing gaps.""" distance = torch.abs(q_idx - kv_idx) is_attended = (distance == 0) \| ((distance > 0) & ((distance & (distance - 1)) == 0)) return torch.where(is_attended, score, float("-inf")) ``` Example outputs: ``` [Test Suite] Config: batch=4, heads=32, seq_q=8192, seq_kv=8192, dim=128 [Test 1: none] [No score_mod, flash='enabled'] Found flash_attncute: True [No score_mod, flash='disabled'] Found flash_attncute: False ✓ Outputs match between flash enabled/disabled ✓ Output matches eager SDPA (rtol=0.001, atol=0.001) [Test 2: causal] [With score_mod, flash='enabled'] Found flash_attncute: True [With score_mod, flash='disabled'] Found flash_attncute: False ✗ Outputs differ between flash modes: Tensor-likes are not close! Mismatched elements: 17879 / 134217728 (0.0%) Greatest absolute difference: 0.0078125 at index (0, 15, 15, 60) (up to 0.001 allowed) Greatest relative difference: 2.5 at index (3, 22, 153, 126) (up to 0.001 allowed) [Test 3: rel_bias] [With score_mod, flash='enabled'] Found flash_attncute: True [With score_mod, flash='disabled'] Found flash_attncute: False ✗ Outputs differ between flash modes: Tensor-likes are not close! Mismatched elements: 12836 / 134217728 (0.0%) Greatest absolute difference: 0.015625 at index (0, 3, 2775, 84) (up to 0.001 allowed) Greatest relative difference: 11.8125 at index (3, 28, 4095, 76) (up to 0.001 allowed) [Test 4: rel_bias_v2] ``` This is bfloat16 and there are no major differences. The list of pointwise ops here isn't exhaustive but it is fairly covering Pull Request resolved: https://github.com/pytorch/pytorch/pull/161117 Approved by: https://github.com/mlazos	2025-08-27 23:01:31 +00:00
Avik Chaudhuri	12c0cf3fab	switch prefer_deferred_runtime_asserts_over_guards in export (#160111 ) Summary: In preparation for checking shape guards in export, this PR effectively switches `prefer_deferred_runtime_asserts_over_guards` to `False`, matching Dynamo. Actually that's a lie: we switch it to `allow_complex_guards_as_runtime_asserts`, which is `False` by default but can be controlled via an internally API to be `True`. This makes the two flags synchronized, so we should be able to kill `allow_complex_guards_as_runtime_asserts` at this point. Test Plan: updated tests Rollback Plan: Differential Revision: D79734206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160111 Approved by: https://github.com/tugsbayasgalan	2025-08-27 22:51:10 +00:00
Zain Rizvi	6b051d7de3	[BE] Refactor trymerge for readability (#161637 ) Two changes: - Extract getting the last_commit's sha into it's own function - Rename merge_changes to merge_changes_locally to better explain it's functionality Pull Request resolved: https://github.com/pytorch/pytorch/pull/161637 Approved by: https://github.com/seemethere, https://github.com/malfet ghstack dependencies: #161558	2025-08-27 22:44:00 +00:00
rebeccajae	ee0ec21191	Ensure that tensors are contiguous before using no-graph MPS impl (#161641 ) Fixes #161640 Check if tensors are contiguous before using the no-graph implementation. Using the script in the issue above with this change I get expected results. ``` MPS contiguous result sample: tensor([ 1.3600, -2.9516, 1.3207, -3.5132, 1.7061], device='mps:0') MPS non-contig result sample: tensor([ 1.3600, -2.9516, 1.3207, -3.5132, 1.7061], device='mps:0') CPU non-contig result sample: tensor([ 1.3600, -2.9516, 1.3207, -3.5132, 1.7061]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161641 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-27 22:31:57 +00:00
Xinran / Allan Rui	7da02bf8af	Skip const folding with symbolic expression (#161437 ) Summary: When performing constant folding, we must skip over operators that have symbolic `fill_value`. Test Plan: CI Rollback Plan: Reviewed By: kalpit-meta-1 Differential Revision: D80965936 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161437 Approved by: https://github.com/StellarrZ	2025-08-27 22:09:58 +00:00
William Wen	1041805c1e	[dynamo, nested graph breaks] prevent excessive recompilations (#159786 ) Nested continuation function code objects are now unique w.r.t. stack trace below (and including) the current code object. Without this change, e.g. in the added test, `f3` would be recompiled on the second graph break. Followup: we can skip guards on continuation functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159786 Approved by: https://github.com/anijain2305 ghstack dependencies: #159329, #159678, #159817, #160138	2025-08-27 21:53:37 +00:00
William Wen	6562646dab	[dynamo, nested graph breaks] clean up comments and codegen (#160138 ) Fix comments to reflect that we no longer codegen cells to be sent to resume function as inputs - they are instead codegen'd after the unsupported instruction in order to build resume functions that are closures. Also simplify some codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160138 Approved by: https://github.com/anijain2305 ghstack dependencies: #159329, #159678, #159817	2025-08-27 21:53:37 +00:00
William Wen	d0a242e547	[dynamo, nested graph breaks] support nested closures (#159817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159817 Approved by: https://github.com/anijain2305 ghstack dependencies: #159329, #159678	2025-08-27 21:53:37 +00:00
William Wen	3f8090809f	[dynamo, nested graph breaks] support nested graph breaks x context managers (#159678 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159678 Approved by: https://github.com/anijain2305 ghstack dependencies: #159329	2025-08-27 21:53:37 +00:00
William Wen	10d93325b1	[dynamo, nested graph breaks] support very simple nested graph breaks (#159329 ) e.g. this graph breaks once now: ```python import torch torch._dynamo.config.nested_graph_breaks = True def inner(x): x = x + 1 torch._dynamo.graph_break() return x + 2 @torch.compile(backend="eager") def outer(x): return inner(x) print(outer(torch.ones(3))) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159329 Approved by: https://github.com/anijain2305	2025-08-27 21:53:37 +00:00
Animesh Jain	68fa882dad	[dynamo] Correctly track mutation class source for MutableMappingVariable (#161568 ) Fixes https://github.com/pytorch/pytorch/issues/161505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161568 Approved by: https://github.com/Lucaskabela, https://github.com/malfet	2025-08-27 21:47:17 +00:00
Yu, Guangye	b9c6aa1e17	Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" (#161628 ) This reverts commit ae1a706444d6c0a6019ffc936c8b36574335a5d5. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161628 Approved by: https://github.com/atalman ghstack dependencies: #161625, #161626, #161627	2025-08-27 21:37:14 +00:00
Yu, Guangye	b7b9fb9962	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" (#161627 ) This reverts commit c1145852a5eac96f5551b5d1805109ce4dc5e1fa. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161627 Approved by: https://github.com/atalman ghstack dependencies: #161625, #161626	2025-08-27 21:37:14 +00:00
Yu, Guangye	c03d8d4082	Revert "Generalize torch._C._set_allocator_settings to be generic (#156175 )" (#161626 ) This reverts commit 908c5cc4c0f22d141776bde47c296b5186691855. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161626 Approved by: https://github.com/atalman ghstack dependencies: #161625	2025-08-27 21:37:14 +00:00
clr	40f46b09c7	async_compile: Fix the wait method to actually wait (#161561 ) This method never triggered. It's used in 2 tests and they pass, so no serious concern. Note that I did introduce and fix a latent bug, which is if we called shutdown_compile_workers, jobs would crash with this change due to ready_future being finished if we called wait. However we only call wait in tests so that bug is fine. The other behaviour, is that if you called shutdown, I believe we may potentially block on your first triton compile after that, until the pool was ready. This should correctly switch to direct mode, until the pool is ready on later warmups. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161561 Approved by: https://github.com/masnesral ghstack dependencies: #161452	2025-08-27 21:35:31 +00:00
clr	0d6597138c	inductor: Log the specific triton kernel that fails (#161452 ) Added a optional name argument to SubprocPool.submit. We record this in a dictionary, and when raising exceptions, add the name. We manage the lifecycle the same as the pending futures. Added a specific testcase to make sure this logs correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161452 Approved by: https://github.com/masnesral	2025-08-27 21:35:31 +00:00
Yu, Guangye	06ddaf1e0a	Revert "Back out "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" (#160999 )" (#161625 ) This reverts commit a818fa77e3a72271f144514ef349c5a666313205. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161625 Approved by: https://github.com/atalman	2025-08-27 21:34:12 +00:00
Blaine Burton Rister	26d0ff1cba	[AOTI-FX] Enhance launch grid FloorDiv replacement using sympy.together. (#161582 ) # Feature 2d launch grids with dynamic shapes can contain sympy expressions like `floor(x / 128 + y / 128)`. This breaks the dynamic shapes tracer which only supports `FloorDiv`, and not `floor`. To handle this case, call `sympy.together` prior to pattern matching to convert this to `floor((x + y) / 128)`. Then, we can recognize the pattern and map it to `FloorDiv(x + y, 128)`. # Test plan Added a custom Triton test exposing this. The test calls a 2d autotuned kernel with dynamic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161582 Approved by: https://github.com/nandesuka	2025-08-27 21:31:28 +00:00
zhxchen17	c36d18d7e8	[rfc] aot precompile with custom backend api (#161383 ) Adding a new feature to torch.compile(fullgraph=True) which "aot_compile" a function with given example inputs. On user side it should look like: ``` def foo(x, y): return x + y compiled_fn = torch.compile(fullgraph=True).aot_compile(((torch.randn(3, 4), torch.randn(3, 4)), {})) ``` This is different from the traditional `torch.compile` workflow where compiled object will be a drop-in replacement for the original eager model: ``` tensor input -> torch.compile() -> tensor output (and populates the cache entry) ``` `aot_compile` will instead return a compiled function as result, and it's purely functional and doesn't populate the compile cache entry in dynamo: ``` tensor input -> aot_compile() -> compiled function ``` The aot compiled function will be savable and loadable on disk as well: ``` torch.compile(fullgraph=True).aot_compile(...).save_compiled_function('my/path') compiled_fn = torch.compiler.load_compiled_function("my/path") ``` Right now we treat compiler backend as a blackbox and it needs to implement the following interface to make compile artifacts serialzable: ``` class SerializableCallable: def save_compile_artifacts(): .... def load_compile_artifacts(): .... ``` We haven't implemented this for inductor yet, but this shouldn't be an issue since we gate this feature through `torch._dynamo.config.aot_compile` (which defaults to False), and this will be left as follow up PR to the current PR. Differential Revision: [D80914270](https://our.internmc.facebook.com/intern/diff/D80914270/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161383 Approved by: https://github.com/tugsbayasgalan	2025-08-27 21:26:25 +00:00
PyTorch MergeBot	014b98dd09	Revert "Add inductor backend to device interface; make minifier_tests more device agnostic (#151314 )" This reverts commit 77bc959fe122bfd131e339ca36cab445a1860806. Reverted https://github.com/pytorch/pytorch/pull/151314 on behalf of https://github.com/atalman due to sorry change is faling internally ([comment](https://github.com/pytorch/pytorch/pull/151314#issuecomment-3229774015))	2025-08-27 21:21:19 +00:00
PyTorch MergeBot	38ed57d446	Revert "Updates to CuTe DSL template renderer (#161117 )" This reverts commit 1750cc80374a9dd22fc26701c0602ae11a62baf0. Reverted https://github.com/pytorch/pytorch/pull/161117 on behalf of https://github.com/atalman due to will need to revert to unblock revert of https://github.com/pytorch/pytorch/pull/151314 ([comment](https://github.com/pytorch/pytorch/pull/161117#issuecomment-3229754295))	2025-08-27 21:17:25 +00:00
Benjamin Glass	007935a802	[cpp_wrapper] Swap to new PyBind11 simple GIL header (#161063 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161063 Approved by: https://github.com/Skylion007 ghstack dependencies: #160754	2025-08-27 21:15:01 +00:00
Benjamin Glass	cbc53b7696	Update pybind11 submodule to 3.0.1 (#160754 ) Upgrade to PyBind11 v3. This allows us to strip out our own (possibly broken?) handling of the C++ ABI when building extensions, in favor of the more-complete PyBind11 internal handling. Fixes a few test failures due to https://github.com/pybind/pybind11/issues/5774, which effectively makes the `__qualname__` attribute of functions platform-dependent. Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/160754 Approved by: https://github.com/Skylion007	2025-08-27 21:15:01 +00:00
Zain Rizvi	624bc36163	Ensure the comment id is always passed in to trymerge (#161558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161558 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-08-27 19:53:28 +00:00
Wang, Chuanqi	06c7516994	[BE] Upgrade XPU support package to 2025.2 (#158733 ) Including below changes, - Add XPU support package 2025.2 build and test in CI for both Linux and Windows - Keep XPU support package 2025.1 build in CI to ensure no break issue until PyTorch 2.9 release - Upgrade XPU support package from 2025.1 to 2025.2 in CD for both Linux and Windows - Rename Linux CI job name & image name to n & n-1 - Update XPU runtime pypi packages dependencies of CD wheels - Remove deprecated support package version docker image build Pull Request resolved: https://github.com/pytorch/pytorch/pull/158733 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-08-27 19:33:38 +00:00
William Wen	2efcf9d081	[dynamo] Fix graph break registry loading in fbcode (#161550 ) Summary: Add `torch/_dynamo/graph_break_registry.json` as an internal dependency. Minor related fixes. Test Plan: Test on OSS. Rollback Plan: Differential Revision: D81078973 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161550 Approved by: https://github.com/Lucaskabela, https://github.com/anijain2305	2025-08-27 19:25:15 +00:00
drisspg	443452ca2f	Remove test since it ooms on CI (#161644 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161644 Approved by: https://github.com/BoyuanFeng	2025-08-27 19:11:29 +00:00
Roman Bobniev	47ecd2042f	[ONNX] Fix index_put_ usage (#161263 ) Summary: It's hard to understand how it's working in most of our models, but in general it looks like `aten::copy_` is replaced incorrectly. There are two schemas for `aten::copy_`: 1. `aten::copy_.Tensor(Tensor(a!) self, Tensor other) -> Tensor(a!)` 2. `aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)` According to the logic in the comments we don't need one of the parameters for `aten::index_put_`. It seems logic has been inferred from ordinary `aten::copy` where there could be a third parameter which is `non_blocking` flag. Depending on the execution environment the sliced copying can be replaced either by first schema or by second schema with explicitly setting default parameter to `False`. If first schema is selected it will lead to the crash (which is easily to catch in our prod env). In case of the second schema selection, there is no crash, but the third parameter is treated as `accumulate` parameter of the `index_put_` function which doesn't make sense. So, in any case usage of the third parameter must be removed from the `aten::copy_` replacement. For more details and check this post: https://fb.workplace.com/groups/1405155842844877/permalink/25337687649165028/ Test Plan: The test fails in production envirounment only. In the test env `non_blocking` flag is mapped as `False` to the `acumulate` flag, which doesn't cause test to fail, but has no sense in terms of flags mapping. The export works without errors, before the fix it was failing with accessing by index out of bounds vector, like this: ``` 1095 _C._jit_onnx_log("Torch IR graph at exception: ", graph) File ~/.bento/kernels/bento_kernel_gaia_ml/1578/bento_kernel_gaia_ml_binary-inplace#link-tree/torch/onnx/utils.py:636, in _optimize_graph(graph, operator_export_type, _disable_torch_constant_prop, fixed_batch_size, params_dict, dynamic_axes, input_names, module) 629 _C._jit_pass_lower_all_tuples(graph) 630 # in _jit_pass_onnx, symbolic functions are called for each node for conversion. 631 # However, there are nodes that cannot be converted without additional context. 632 # For example, the number of outputs from split (and whether it is static or dynamic) is unknown 633 # until the point where it is unpacked by listUnpack node. 634 # This pass does a preprocess, and prepares the nodes such that enough context can be received 635 # by the symbolic function. --> 636 _C._jit_pass_onnx_remove_inplace_ops_for_onnx(graph, module) 637 _C._jit_pass_onnx_preprocess(graph) 639 # onnx does not support tuples, so try to remove them RuntimeError: vector::_M_range_check: __n (which is 2) >= this->size() (which is 2) ``` The test script: ``` import torch as th import tempfile class CopyTest(th.nn.Module): def forward( self, input_th: th.Tensor ): to_fill = th.ones((3, 3)) to_fill[:, 0] = input_th[:, 0] return to_fill m = CopyTest() test_tensor = th.zeros((3, 3)) with tempfile.NamedTemporaryFile() as f: th.onnx.export( m, (test_tensor,), f, export_params=True, opset_version=17, do_constant_folding=True, input_names=["input"], output_names=["features"], dynamo=False, ) ``` The exported model test: ``` import torch import onnx import onnxruntime model_name = '/home/ironsided/test_model.onnx' onnx_model = onnx.load(model_name) onnx.checker.check_model(onnx_model) example_inputs = (torch.zeros(3, 3),) onnx_inputs = [tensor.numpy(force=True) for tensor in example_inputs] print(f"Input length: {len(onnx_inputs)}") print(f"Sample input: {onnx_inputs}") ort_session = onnxruntime.InferenceSession( model_name, providers=["CPUExecutionProvider"] ) onnxruntime_input = {input_arg.name: input_value for input_arg, input_value in zip(ort_session.get_inputs(), onnx_inputs)} # ONNX Runtime returns a list of outputs onnxruntime_outputs = ort_session.run(None, onnxruntime_input)[0] print(onnxruntime_outputs) ``` The produced result is correct: ``` Input length: 1 Sample input: [array([[0., 0., 0.], [0., 0., 0.], [0., 0., 0.]], dtype=float32)] [[0. 1. 1.] [0. 1. 1.] [0. 1. 1.]] ``` Rollback Plan: Differential Revision: D80797028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161263 Approved by: https://github.com/justinchuby, https://github.com/jermenkoo	2025-08-27 18:53:13 +00:00
drisspg	1750cc8037	Updates to CuTe DSL template renderer (#161117 ) # Summary This adds a few more render functions available to template writers, specifically get_output and modification. The reasons why are more clear in the next PR in this stack. <img width="1645" height="364" alt="Screenshot 2025-08-21 at 1 48 50 PM" src="https://github.com/user-attachments/assets/2d508fda-4273-43ef-9edf-086e592e9249" /> Majority of the new cod is around the OpOverrides for CuTe DSL. It is alot to test and most of the actual testing I have been doing is via score_mods to the flash_attention at the next layer of this stack. A bunch of score mods that me and Claude came up with , that exercise the actual ops. ``` Py def causal_mask(score, b, h, q_idx, kv_idx): """Causal attention mask.""" return torch.where(q_idx >= kv_idx, score, float("-inf")) def relative_bias(score, b, h, token_q, token_kv): """Relative position bias.""" return score + torch.abs(token_q - token_kv) def relative_bias_v2(score, b, h, token_q, token_kv): """Relative position bias with factor of 2.""" return score + 2 * torch.abs(token_q - token_kv) def times_two(score, b, h, q_idx, kv_idx): """Simple score modification that doubles the score.""" return score * 2 def alibi_bias(score, b, h, q_idx, kv_idx): """ALiBi (Attention with Linear Biases) - used in some modern models.""" # Different slopes for different heads slope = 2 ** (-8 * (h + 1) / 8) # Simplified version return score - slope * torch.abs(q_idx - kv_idx) def sliding_window(score, b, h, q_idx, kv_idx, window_size=256): """Sliding window attention - only attend to nearby tokens.""" return torch.where( torch.abs(q_idx - kv_idx) <= window_size, score, float("-inf") ) def block_diagonal(score, b, h, q_idx, kv_idx, block_size=64): """Block diagonal attention pattern.""" q_block = q_idx // block_size kv_block = kv_idx // block_size return torch.where(q_block == kv_block, score, float("-inf")) def additive_bias(score, b, h, q_idx, kv_idx): """Test simple addition with position-based bias.""" return score + (q_idx + kv_idx) * 0.01 def multiplicative_decay(score, b, h, q_idx, kv_idx): """Test multiplication with distance-based decay.""" distance = torch.abs(q_idx - kv_idx) return score * torch.exp(-0.1 * distance) def sine_wave_bias(score, b, h, q_idx, kv_idx): """Test trigonometric functions.""" return score + 0.1 * torch.sin(2 * math.pi * (q_idx - kv_idx) / 64) def log_distance_penalty(score, b, h, q_idx, kv_idx): """Test logarithmic operations.""" distance = torch.abs(q_idx - kv_idx).float() return score - torch.log(1 + distance) def alternating_mask(score, b, h, q_idx, kv_idx): """Test with alternating pattern - good for branch prediction.""" return torch.where((q_idx + kv_idx) % 2 == 0, score, float("-inf")) def head_specific_pattern(score, b, h, q_idx, kv_idx): """Different behavior per attention head.""" even_head = h % 2 == 0 causal = q_idx >= kv_idx return torch.where(even_head & causal, score, float("-inf")) def sparse_strided(score, b, h, q_idx, kv_idx, stride=4): """Sparse attention with strided pattern.""" return torch.where( (kv_idx % stride == 0) \| (q_idx == kv_idx), score, float("-inf") ) def causal_with_global(score, b, h, q_idx, kv_idx): """Causal mask but first few tokens are globally attended.""" is_causal = q_idx >= kv_idx is_global = kv_idx < 4 return torch.where(is_causal \| is_global, score, float("-inf")) def dilated_attention(score, b, h, q_idx, kv_idx, dilation_rate=2): """Dilated attention pattern - exponentially increasing gaps.""" distance = torch.abs(q_idx - kv_idx) is_attended = (distance == 0) \| ((distance > 0) & ((distance & (distance - 1)) == 0)) return torch.where(is_attended, score, float("-inf")) ``` Example outputs: ``` [Test Suite] Config: batch=4, heads=32, seq_q=8192, seq_kv=8192, dim=128 [Test 1: none] [No score_mod, flash='enabled'] Found flash_attncute: True [No score_mod, flash='disabled'] Found flash_attncute: False ✓ Outputs match between flash enabled/disabled ✓ Output matches eager SDPA (rtol=0.001, atol=0.001) [Test 2: causal] [With score_mod, flash='enabled'] Found flash_attncute: True [With score_mod, flash='disabled'] Found flash_attncute: False ✗ Outputs differ between flash modes: Tensor-likes are not close! Mismatched elements: 17879 / 134217728 (0.0%) Greatest absolute difference: 0.0078125 at index (0, 15, 15, 60) (up to 0.001 allowed) Greatest relative difference: 2.5 at index (3, 22, 153, 126) (up to 0.001 allowed) [Test 3: rel_bias] [With score_mod, flash='enabled'] Found flash_attncute: True [With score_mod, flash='disabled'] Found flash_attncute: False ✗ Outputs differ between flash modes: Tensor-likes are not close! Mismatched elements: 12836 / 134217728 (0.0%) Greatest absolute difference: 0.015625 at index (0, 3, 2775, 84) (up to 0.001 allowed) Greatest relative difference: 11.8125 at index (3, 28, 4095, 76) (up to 0.001 allowed) [Test 4: rel_bias_v2] ``` This is bfloat16 and there are no major differences. The list of pointwise ops here isn't exhaustive but it is fairly covering Pull Request resolved: https://github.com/pytorch/pytorch/pull/161117 Approved by: https://github.com/mlazos	2025-08-27 18:39:09 +00:00
Sandeep Narendranath Karjala	ec585ceab4	[inductor] structured-log graph execution order + test (#160448 ) Summary: - Emit a structured trace per compiled graph execution to reconstruct execution order in TLParse. - Adds debug.log_graph_execution(name) called from `CompiledFxGraph.__call__`, producing an artifact named inductor_graph_execution with payload {"graph": "graph_<id>"}. Testing: - Add inline test to verify structure and output Pull Request resolved: https://github.com/pytorch/pytorch/pull/160448 Approved by: https://github.com/xmfan	2025-08-27 18:12:46 +00:00
Yidi Wu	16ce6a4aad	[hop] move insert_deferred_runtime_asserts under subtracer (#161416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161416 Approved by: https://github.com/pianpwk ghstack dependencies: #160548	2025-08-27 17:43:02 +00:00
Yang Wang	3345a7ff8a	[VLLM][FLASHINFER UPDATE] (#161537 ) VLLM build x torch fails due to flashinfer build fail, detected that vllm team recently changed the point to flashinfer Pull Request resolved: https://github.com/pytorch/pytorch/pull/161537 Approved by: https://github.com/huydhn	2025-08-27 17:41:26 +00:00
Huy Do	55e6ea105c	Fix running the benchmark jobs twice (#161619 ) I made a mistake in https://github.com/pytorch/pytorch/pull/160935 removing this condition check. This ran the benchmark job twice for schedule jobs, i.e. https://github.com/pytorch/pytorch/actions/runs/17266546494. This was missed during testing because `pull_request` and `workflow_dispatch` were working ok. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161619 Approved by: https://github.com/anijain2305	2025-08-27 17:18:10 +00:00
lakshayg	a3fa1b8c2a	Set USE_NVSHMEM only if USE_DISTRIBUTED is set (#161451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161451 Approved by: https://github.com/eqy	2025-08-27 17:11:19 +00:00
Chris Leonard	620d52e882	Fix sort doc error (#161539 ) Fixes #129298. Updated torch.sort documentation so that the 'stable' parameter is a Keyword Argument. This is how it's implemented in PyTorch. @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/161539 Approved by: https://github.com/soulitzer	2025-08-27 17:01:53 +00:00
PyTorch MergeBot	69c7b16e6f	Revert "Back out "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" (#161002 )" This reverts commit a03cc53e6f6e2fe67316cb8c74c25f5b953f445b. Reverted https://github.com/pytorch/pytorch/pull/161002 on behalf of https://github.com/guangyey due to This PR breaks CI TestCudaMallocAsync::test_allocator_settings ([comment](https://github.com/pytorch/pytorch/pull/161002#issuecomment-3228980897))	2025-08-27 16:52:22 +00:00
Guilherme Leobas	379ebdaf5e	[OrderedDict] Implement `OrderedDict.popitem(last=...)` (#155153 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155153 Approved by: https://github.com/anijain2305 ghstack dependencies: #160156, #155072, #155152	2025-08-27 15:46:40 +00:00
Guilherme Leobas	7c8f049d54	[OrderedDict] Implement `OrderedDict.move_to_end(key, last=False)` (#155152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155152 Approved by: https://github.com/anijain2305 ghstack dependencies: #160156, #155072	2025-08-27 15:46:40 +00:00
Guilherme Leobas	e3718c4855	[dict] Implement dict.__ior__ and fix return type in dict.__or__ (#155072 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155072 Approved by: https://github.com/anijain2305 ghstack dependencies: #160156	2025-08-27 15:46:40 +00:00
Guilherme Leobas	2d44969bbd	Wrap class definitions in `set_fullgraph(False)` in `test_dict`/`test_ordered_dict` (#160156 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160156 Approved by: https://github.com/zou3519	2025-08-27 15:46:40 +00:00
Irem Yuksel	a2af6a9d6b	Run WoArm64 CI every 4 hours (#161504 ) Since WoArm64 isn’t part of CI yet, this PR schedules the workflow to increase visibility and insights. It will execute every 4 hours and still support manual runs via the `ciflow/win-arm64` tag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161504 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-08-27 15:46:34 +00:00
PyTorch MergeBot	28af843ee0	Revert "Fix index_add for int64 input + zerodim index (#161511 )" This reverts commit d51486616cb3fe54bc298669a88059be56c1fb22. Reverted https://github.com/pytorch/pytorch/pull/161511 on behalf of https://github.com/clee2000 due to broke test_indexing.py::TestIndexingCPU::test_index_add_zerodim_index_floating_alpha_cpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/17257089116/job/48971728595) [HUD commit link](`d51486616c`) on dynamo? ([comment](https://github.com/pytorch/pytorch/pull/161511#issuecomment-3228705842))	2025-08-27 15:38:11 +00:00
Karthick Panner Selvam	378edb047f	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos	2025-08-27 14:49:20 +00:00
FFFrog	d2db6c86b0	[OpenReg] Add Develop Notes for Integrating New Backend into PyTorch (#158644 ) To facilitate the integration of the new backend, we plan to publish a new development note that details all the key components,hoping to speed up the development of other accelerators. This PR is the beginning of this note, and involve the part of registration of operators and we will gradually improve it and keep in sync with OpenReg's code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158644 Approved by: https://github.com/albanD	2025-08-27 14:47:25 +00:00
Animesh Jain	a3c1cbdbc6	[dynamo][higher order ops] Refactor for out spec (#161354 ) Preparing for the next PR to add more info in the output spec. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161354 Approved by: https://github.com/zou3519	2025-08-27 14:41:18 +00:00
Ting Lu	9632f4ea9f	[CD] [aarch64] Add CUDA 13.0 sbsa nightly build (#161257 ) https://github.com/pytorch/pytorch/issues/159779 CUDA SBSA build for CUDA 13.0 1. Supported archs: sm_80 to sm_120. Including support for Thor (sm_110), SPARK (sm_121), GB300 (sm_103). "This release adds support of SM110 GPUs for arm64-sbsa on Linux." from 13.0 release notes https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html 2. Use -compress-mode=size for binary size reduction, 13.0 wheel is 2.18 GB, when compared with 12.9 3.28 GB, that is 1.1 GB of savings and ~33.5% smaller. 3. Refactored the libs_to_copy list with common libs, and version_specific_libs. TODO: add the other CUDA archs in the existing support matrix of x86 to SBSA build as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/161257 Approved by: https://github.com/nWEIdia, https://github.com/atalman	2025-08-27 14:38:07 +00:00
Animesh Jain	3d406429b0	[dynamo][vllm] Support typing.get_type_hints (#161362 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161362 Approved by: https://github.com/Skylion007, https://github.com/StrongerXi, https://github.com/jansel	2025-08-27 09:55:31 +00:00
Shangdi Yu	9a12bab0d3	Add debug handle to inductor provenance tracking (#161110 ) Summary: Use debug handle on kernel names to distinguish different calls to the same kernel. Previous kernel name: kernel_name New kernel name: kernel_name:debug_handle We add the debug handle to the tlparse artifacts: `inductor_provenance_tracking_node_mappings` and `inductor_provenance_tracking_kernel_stack_traces`. We also add debug handles in the comments of the generated code so we can map to them in the provenance tracking highlighter tool: https://github.com/pytorch/tlparse/pull/134 Example output code is below. If a kernel doesn't have a debug handle, the `[Provenance debug handles]` comment line will not be written. ``` # Topologically Sorted Source Nodes: [y, z], Original ATen: [aten.addmm, aten.gelu] # [Provenance debug handles] triton_poi_fused_addmm_gelu_2:3 stream0 = get_raw_stream(0) triton_poi_fused_addmm_gelu_2.run(buf4, primals_5, 300, stream=stream0) ``` The debug handles will also be used by downstream profilers such as zoomer. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing ``` Rollback Plan: Differential Revision: D78994959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161110 Approved by: https://github.com/angelayi	2025-08-27 04:56:11 +00:00
Manuel Candales	d51486616c	Fix index_add for int64 input + zerodim index (#161511 ) Fixes #161446 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161511 Approved by: https://github.com/malfet	2025-08-27 04:11:10 +00:00
Animesh Jain	07a4e9fea8	[benchmarks] Skip mobilenetv3_large_100 in CI for accuracy (#161570 ) To keep the CI green - https://github.com/pytorch/pytorch/issues/161419 Its unclear if this is a real failure. And debugging it is non trivial. Skipping for now to keep the CI greenst Pull Request resolved: https://github.com/pytorch/pytorch/pull/161570 Approved by: https://github.com/BoyuanFeng, https://github.com/zou3519	2025-08-27 03:44:04 +00:00
Michael Lazos	be55d7ac9e	Revert "[Dynamo] Allow inlining into AO quantization modules (#152934 )" (#161567 ) This reverts commit 20e2ca3e29ce9eb33eef17db077696222c175764. Fixes https://github.com/pytorch/pytorch/issues/157434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161567 Approved by: https://github.com/Lucaskabela	2025-08-27 03:33:04 +00:00
William Wen	8b78ba07b1	[dynamo, nested graph breaks] add nested graph break tests (#144516 ) Note: nested graph break tests (and wrapped tests) are xfailed/skipped for now - we will iteratively enable the tests as more of the nested graph break implementation is complete. Differential Revision: [D81084809](https://our.internmc.facebook.com/intern/diff/D81084809) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144516 Approved by: https://github.com/anijain2305	2025-08-27 03:00:56 +00:00
drisspg	b36a20d368	Ensure large tensor int32 -> int64 indexing is enabled (#157767 ) Fixes: #https://github.com/pytorch/pytorch/issues/157446 I think that this delta is worth the switch form block-ptrs especially since they are deprecated ## Perf Summary A is nightly B is this diff, so `negative` means this diff improves perf TOP 5 differences <img width="805" height="754" alt="Screenshot 2025-08-24 at 5 49 49 PM" src="https://github.com/user-attachments/assets/aa359cdf-ee9a-427d-be72-1b9aef6f3115" /> <details> <summary><strong>Full perf table (click to expand)</strong></summary> \| attn_type \| dtype \| shape(B,Hq,M,Hkv,N,D) \| TFlops Version A \| TFlops Version B \| \| --- \| --- \| --- \| --- \| --- \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 258.38834144791923 \| 258.6353685004612 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.2192450677751 \| 140.12393320464972 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 122.32683823617003 \| 118.51603755647925 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.48556906165314 \| 137.24259849208627 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 86.59814488695922 \| 84.59431398586257 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 288.52679758135764 \| 292.9174195871856 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 172.25541683643277 \| 172.94326459828508 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 164.40864610599826 \| 165.035129576335 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 176.54876886433945 \| 175.08057670028145 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 125.22491679812626 \| 121.06201152859151 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 339.11952481874283 \| 339.0132835601695 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 227.58583240284406 \| 228.21824999409597 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 185.98569659868966 \| 182.32850843255093 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 188.9495725191772 \| 180.31385312481657 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 106.25789530994302 \| 106.55084959448476 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 357.6430536888533 \| 363.30843452247274 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 262.3241154406613 \| 265.73250045488 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 249.30498953911416 \| 249.35928192833785 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 224.74126243851808 \| 223.71776504077988 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 168.26977014013707 \| 165.47991483333809 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 382.8178701785897 \| 384.34752965862685 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 308.1449710013853 \| 311.0653716044644 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 251.96365252505072 \| 243.92283557225903 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 226.69316232745368 \| 215.22769268913356 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 153.34142545296405 \| 151.9312673939401 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 396.0998000753126 \| 398.35036286102473 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 333.5198415274966 \| 344.6354466169716 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 310.5955933379696 \| 305.66347819546 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 260.4012412689896 \| 259.758666997307 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 234.13034252182635 \| 227.61676497283614 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 396.17615538477196 \| 401.1419104525502 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 359.98648311998414 \| 360.8285563463094 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 291.97720707257736 \| 281.41694809965253 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 250.1703628419691 \| 238.556760291579 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 199.50782826294306 \| 191.52327358439223 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 411.0632004785396 \| 413.6362648405517 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 382.9404387613185 \| 397.74886235657607 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 357.0998545146633 \| 350.5115200772392 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 281.8033924428203 \| 281.98601309215843 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 282.56595134222135 \| 277.4565795466672 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 408.89838018149516 \| 405.14531386840076 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 396.07662058160264 \| 393.4598228299578 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 317.8822887267849 \| 304.754931401036 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 265.8801304948243 \| 254.22961974295112 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 227.87390579965614 \| 222.19481980110393 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 427.36821778477025 \| 431.3766620314935 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 410.67994346825 \| 423.4666944003808 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 381.1968748374038 \| 381.77668006420424 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 292.5540046358546 \| 296.5439130720502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 321.04573768858114 \| 310.7423616656888 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 427.46148866769903 \| 426.162091037068 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 419.75580537687347 \| 421.88640120274334 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 337.3208051798903 \| 327.4912454675092 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 276.5638854539581 \| 262.988360558083 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 250.82791326036886 \| 245.07367032501736 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 435.8055824506086 \| 441.8803729460534 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 432.02638235921006 \| 450.33161016596273 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 402.25525939224883 \| 393.8564689669916 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 297.5337286675904 \| 297.0131881135074 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 343.8697037899545 \| 329.8194073407783 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 267.58912366821056 \| 256.91606054118375 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 150.81723692609629 \| 146.32172267858743 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 129.51029293209245 \| 122.72144394093334 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 147.627656359087 \| 141.68956350566188 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 87.55100546003591 \| 84.91293287692788 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 299.5931492743986 \| 305.884253766691 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 179.39026367843837 \| 181.64741311605096 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 173.93547669282367 \| 173.23972950980564 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 185.90234171599252 \| 182.80844545446686 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 128.08176696266082 \| 123.27722685662111 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 340.50674552770664 \| 338.9071088484576 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 225.4438318650432 \| 230.22899884832975 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 194.15123248528312 \| 185.02793973094865 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 200.74289714108176 \| 191.76606719670647 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 107.03564946728423 \| 106.82432377861258 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 371.31799283918406 \| 379.7555394732925 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 275.97762744310455 \| 276.71106853992995 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 261.6648679783462 \| 259.4127232060398 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 237.03108223577615 \| 233.92710216149527 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 172.13926800371152 \| 168.74390922407585 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 381.50199487767276 \| 383.9043681999597 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 307.9748883093411 \| 312.2403515462001 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 251.11319684705438 \| 243.17870127827277 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 236.3253127246763 \| 223.81250201769552 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 154.55693991756874 \| 153.11360584987685 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 407.11400078586615 \| 413.53709886086557 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 348.1705797722622 \| 360.09771155957367 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 321.8593280850388 \| 318.2882327401255 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 270.089032013835 \| 268.767323026064 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 238.07324557907788 \| 228.09842078362692 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 399.8172853171901 \| 401.0954526332136 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 363.4387330438581 \| 364.13111024232677 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 294.1752429133857 \| 283.7235663368415 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 256.8389394007649 \| 246.91771015606483 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 199.3378564292656 \| 192.40439590901758 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 425.5150965556111 \| 430.8190098707553 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 396.00437184073013 \| 411.3873625655787 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 369.92803661607815 \| 361.43244467343663 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 293.4277354412933 \| 295.2529537595746 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 288.0208673072841 \| 281.51896404878863 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 408.3005367220567 \| 408.96116482298913 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 396.90095962766304 \| 396.87385456176486 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 319.0534576137999 \| 302.50950358107764 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 270.3334977708081 \| 258.8506349486557 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 227.46824134365394 \| 222.23759438128766 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 438.24247309479694 \| 437.7975163205371 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 428.34012029699227 \| 433.3215899950434 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 386.52672049728875 \| 388.26216893354984 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 302.71976814728083 \| 302.3574867306459 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 327.39760662780986 \| 308.6348428844912 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 423.31308678262695 \| 426.6306972137279 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 412.6983690923106 \| 419.4961977664297 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 337.41003544742273 \| 324.2155049126126 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 278.7755890910794 \| 265.9194286636502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 251.55678254755364 \| 244.8843180141462 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 452.5930781172308 \| 457.7117122300742 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 445.05676260348116 \| 463.9304535499636 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 415.78302138389415 \| 406.29229555271456 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 308.0311067300895 \| 304.91354721414314 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 351.43943626809335 \| 329.4476923070317 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 295.1801525813241 \| 291.36521287398904 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 183.23250549178067 \| 182.35421238887605 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 151.56832453117747 \| 151.3422139154794 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 171.02111935180432 \| 160.72516856727913 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 74.05765122783826 \| 74.5885345035243 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 314.3587394591763 \| 319.2938677773619 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 224.57002084153177 \| 225.48868542008177 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.00964804143052 \| 215.39576159953486 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.1174237618258 \| 214.28437413525663 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 121.08920423648368 \| 119.55813661872644 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 362.2193857281911 \| 360.05005804275936 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 279.8840217430121 \| 279.5437918286659 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 227.76617121021982 \| 222.8655938229316 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 215.43141176970562 \| 207.71852284994702 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 121.35588364218539 \| 121.20636565046884 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 365.1545280898012 \| 373.37585444987326 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 304.360119952975 \| 309.1247297936263 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 287.2603904544586 \| 289.25547903162595 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 257.9852675272418 \| 257.59069234098115 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 188.35158496670232 \| 184.24683960154857 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 389.9744911369211 \| 388.43466897254166 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 345.9228295166513 \| 342.63034895210126 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 279.56334658247437 \| 271.2724375402088 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 245.66477202810066 \| 233.49688207371258 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 170.3270720653187 \| 166.23863845657382 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 400.0041140827554 \| 402.11182445396497 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 363.64641830327434 \| 375.9288663364792 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 341.5776139573363 \| 335.1160003213424 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 281.1811770268521 \| 280.21438270014005 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 247.78716118997716 \| 245.3269825179633 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 403.794126680488 \| 405.2353919019577 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 387.079178426863 \| 385.1461762057035 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 309.7847188173431 \| 298.0443968374749 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 262.4721750159666 \| 250.81679725428586 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 205.70866004479979 \| 202.9620839129557 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 413.380982988662 \| 418.40270594263103 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 398.450064800682 \| 409.6794973994029 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 372.26297458194466 \| 364.44415106552196 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 293.0818569905912 \| 292.85172400643984 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 296.46717085592087 \| 285.76362010612763 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 419.3186786037592 \| 426.08801580934437 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 408.1648467766632 \| 409.4122254207817 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 329.24396020457345 \| 313.5200995121138 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 274.61257504571876 \| 255.7801815432177 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 232.63806001220684 \| 230.03020843492314 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 435.0785891054788 \| 440.39101804225345 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 424.86925312752817 \| 435.18898057396825 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 393.000417896268 \| 395.11543361225256 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 297.7755459218185 \| 300.7208114715287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 331.71570861760534 \| 318.07127352552885 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 424.58602747137405 \| 425.84897078470715 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 422.66607285025725 \| 423.5524945535485 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 344.8625760048626 \| 331.6793888458635 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 282.0787281511649 \| 263.7895634445868 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 252.7301927385177 \| 245.41844170037427 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 437.0658069164588 \| 442.9101960063628 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 433.13788271434646 \| 452.3873572709863 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 404.0959191546953 \| 396.7077863894884 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 300.45502211883206 \| 301.3439134717943 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 344.11003202413934 \| 330.8897663350314 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 298.4364205341705 \| 291.6793556507056 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 187.6382133139633 \| 191.05409897308772 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 156.55822078636112 \| 154.178925976516 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 173.47765221825162 \| 169.30862508068464 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 74.5885345035243 \| 74.52689061607104 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 323.12233826013045 \| 328.53889207933514 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 236.75872140126316 \| 235.8378325547398 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 227.17836523816675 \| 226.75357076139966 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 224.07209453308036 \| 224.07209453308036 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 122.85572156047981 \| 121.11642183704716 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 361.3123326658092 \| 360.71014086458337 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 281.5287983927017 \| 281.94301754758345 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 232.7456696285686 \| 226.50976826432776 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 221.5612361744038 \| 214.96188822837055 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 121.38311528944315 \| 120.85441868178513 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 380.2579019244734 \| 389.2520157863988 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 316.95230660496924 \| 317.87597790618906 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 301.07968126657323 \| 298.02424098422983 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 267.2240756921594 \| 267.16353549228154 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 189.82761622494257 \| 186.736450261963 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 389.88665375406805 \| 387.9125133037077 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 348.70619958684887 \| 346.6750499749774 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 280.5472989906087 \| 271.22300822012187 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 250.02397620165968 \| 241.22532776331445 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 171.67817496107645 \| 166.95679280483972 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 412.626880230807 \| 417.60238657950777 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 374.8829313933945 \| 389.4448546468815 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 353.20410434172436 \| 345.7072490717473 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 292.51045924209586 \| 291.66621022138287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 251.6264062063495 \| 248.45110052911542 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 404.0155784550126 \| 401.90546837237514 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 384.4389015599863 \| 386.9684324594344 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 313.3731284132225 \| 298.17074251037894 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 264.19199737284265 \| 252.8982463999916 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 207.03696315185684 \| 202.86697323136772 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 428.2436763312506 \| 433.45005568619536 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 411.8516531869893 \| 428.2753623461049 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 384.9095037182509 \| 372.90888743000744 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 303.2438915629836 \| 302.05095952914337 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 301.8689122735564 \| 285.0363190513223 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 423.13592231504805 \| 420.3991500185611 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 407.44527331585493 \| 408.5064370765247 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 330.50050996167414 \| 316.8763979925965 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 274.6833786307413 \| 259.86098862141324 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 232.24019584158367 \| 226.52040268160232 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 444.4596314237808 \| 455.99558915752266 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 437.4245561244369 \| 455.98275147271966 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 397.3350686877605 \| 397.88875599028063 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 308.53809114394545 \| 307.1359822042007 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 331.32379843423774 \| 316.85293191675646 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 422.4622274366379 \| 425.0407156418684 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 420.9547052783101 \| 430.33779243510276 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 345.50265346504085 \| 332.094855328957 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 280.81715528243365 \| 264.6543640282054 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 252.25635200421783 \| 245.46235499490305 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 452.5524207341139 \| 461.7512032176736 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 445.2316469907137 \| 464.4523799578466 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 416.87264016717023 \| 409.17124592157046 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 309.42579489389846 \| 307.9734464665731 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 350.50782004300623 \| 330.98959545427294 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157767 Approved by: https://github.com/Skylion007	2025-08-27 02:45:20 +00:00
PyTorch MergeBot	de58505890	Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 )" This reverts commit cddcaa19035d6414a351be7c7b16c47d5a0c3466. Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/karthickai due to This is breaking tests on Rocm ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3226541063))	2025-08-27 02:36:42 +00:00
atalman	6913529ff8	Move non inductor workflows to Python 3.9 -> 3.10 (#161182 ) Related to: https://github.com/pytorch/pytorch/issues/161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161182 Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/seemethere	2025-08-27 02:32:24 +00:00
Gabriel Ferns	4b4cdcfe3a	Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387 ) - Fix Conv exhaustive. - Fix AMD config pruning. - Expand exhaustive test suite. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159387 Approved by: https://github.com/coconutruben	2025-08-27 01:54:50 +00:00
Ke Wen	68d395d61e	[3/N][SymmMem] Expose offset field from handle (#161532 ) As titled, so that kernels relying on direct pointers can use base address and `hdl.offset` to access remote memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161532 Approved by: https://github.com/ngimel ghstack dependencies: #161470, #161471	2025-08-27 00:49:06 +00:00
Ke Wen	4ed71d5412	[2/N][SymmMem] Add MemPool allocator and tests (#161471 ) (Porting most of #161008) Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory. To end users, this PR supports a python UI as follows: ``` allocator = symm_mem.get_mempool_allocator(device) mempool = torch.cuda.MemPool(allocator) with torch.cuda.use_mem_pool(mempool): tensor = torch.arange(numel, dtype=dtype, device=device) ``` Added tests for both use cases above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471 Approved by: https://github.com/ngimel ghstack dependencies: #161470	2025-08-27 00:49:06 +00:00
Ke Wen	8dd5aa9689	[1/N][SymmMem] Add offset to handle, cache on base address (#161470 ) For the kernels that need peer pointers directly, the rendezvous handle should allow user to get the offset of tensor wrt to base allocation address. Thus the need to add an `offset` field to SymmMem handle. But we don't want to cache all the handles just bc they have different offsets, hence the search and cache logic below: (i) At rendezvous, the search key is still `x.storage().data_ptr()`, like now, but it should do search in 2 parts - one is just dictionary lookup, like today, if that failed, it needs to search `allocations_` to see if the storage ptr falls in one of the segments. This is possible as we have all segments recorded during alloc. (ii) If this segment hasn't been rendezvoused, we rendezvous it, cache it in the `symm_mem_` map with its base address as key. (iii) We still need to return a handle for the current tensor, with a corresponding offset. This handle will be a shallow copy of the base handle, with the offset adjusted. Some impl details: (i.1) If we find a matching allocation, we can immediately use the allocation base address to do a re-search in `symm_mem_`. (iii.1) To make the handle copy shallow, we move the common information -- base ptrs, base signal pad, etc -- to a structure referenced by both handles. The structure is called `NVSHMEMPeerAllocInfo`. A copy of handle just adds one more `intrusive_ptr` to it. The handle copy constructor accepts an `offset` argument. Test: Existing tests should not fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161470 Approved by: https://github.com/ngimel	2025-08-27 00:49:06 +00:00
Angela Yi	8ff9485815	[export] Update unflattening dynamo.disable (#161306 ) Summary: Doing inline disabling causes recompiles with the reason "Cache line invalidated because L['___stack0'] got deallocated" Test Plan: CI Rollback Plan: Differential Revision: D80816956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161306 Approved by: https://github.com/pianpwk	2025-08-27 00:27:16 +00:00
William Wen	b074cbaedd	[dynamo] allow resume functions to have name in both freevars and varnames (#161544 ) fixes https://github.com/pytorch/pytorch/issues/161542 Differential Revision: [D81073109](https://our.internmc.facebook.com/intern/diff/D81073109) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161544 Approved by: https://github.com/StrongerXi, https://github.com/anijain2305	2025-08-27 00:25:16 +00:00
Scott Wolchok	80bf883d21	Replace manual cache in _python_dispatch.get_alias_info with functools.cache (#161286 ) In addition to being more code, the manual cache was doing an extra dictionary lookup on each cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161286 Approved by: https://github.com/wconstab	2025-08-27 00:17:51 +00:00
Blaine Burton Rister	9de9d25f8d	[Inductor-FX] Support custom triton kernels (#161474 ) # Feature Add support for custom Triton kernels to the FX backend. This turned out not to require any new features, except for a minor change to handle `tl.constexpr` arguments which are not part of the autotuning config. # Caveat This may not cover every possible case. For example, we might need more features for autotuning custom Triton code. This PR entirely skips the [custom codegen ](https://github.com/pytorch/pytorch/blob/main/torch/_higher_order_ops/triton_kernel_wrap.py#L1034-L1039) for user-defined grid functions, but there may be edge cases requiring this logic. However, this PR seems to do a reasonable job as many of the grids end up being written into Inductor/Triton metadata and don't require special codegen. As a follow up, I'm planning to test this against all of AOTI's custom Triton kernel tests. # Test plan Added a CI test using a custom Triton kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161474 Approved by: https://github.com/angelayi	2025-08-27 00:15:19 +00:00
Malay Bag	dbc903a94a	[APS IR] Minfor fix - use GetAttrKey in get_keystr to match with flat args path in unflatten (#161453 ) Summary: While passing path info to [_check_input_constraints_for_graph](https://www.internalfb.com/code/fbsource/[6b5b2dc35902a26ce265e3c0ae5189a3faba1d38]/fbcode/caffe2/torch/export/unflatten.py?lines=594), GetAttrKey is used to specify path str. To match with that get_keystr should also use GetAttrKey. Test Plan: Existing tests ``` buck run mode/opt caffe2/test:test_export -- -r unflatten ``` ``` Ran 413 tests in 204.533s OK (skipped=1, expected failures=13) ``` Rollback Plan: Differential Revision: D80984083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161453 Approved by: https://github.com/tugsbayasgalan	2025-08-27 00:05:20 +00:00
PyTorch MergeBot	1b34e04485	Revert "Update pybind11 submodule to 3.0.1 (#160754 )" This reverts commit 660b0b8128181d11165176ea3f979fa899f24db1. Reverted https://github.com/pytorch/pytorch/pull/160754 on behalf of https://github.com/atalman due to please see https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226051449 ([comment](https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226078102))	2025-08-26 23:35:22 +00:00
PyTorch MergeBot	1ce423274d	Revert "[cpp_wrapper] Swap to new PyBind11 simple GIL header (#161063 )" This reverts commit 74c4c758afa8c28162f00a456c185552e1159fd3. Reverted https://github.com/pytorch/pytorch/pull/161063 on behalf of https://github.com/atalman due to sorry broke vllm tests please see https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226051449 ([comment](https://github.com/pytorch/pytorch/pull/161063#issuecomment-3226065212))	2025-08-26 23:31:23 +00:00
PyTorch MergeBot	4e630f0629	Revert "[Inductor] Update Outer Reduction Heuristic (#159093 )" This reverts commit ca9fe0107e165a4a4147325ff6d34235ebde447f. Reverted https://github.com/pytorch/pytorch/pull/159093 on behalf of https://github.com/PaulZhang12 due to Addressing internal implications then relanding ([comment](https://github.com/pytorch/pytorch/pull/159093#issuecomment-3225942525))	2025-08-26 22:37:56 +00:00
Karthick Panner Selvam	cddcaa1903	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos	2025-08-26 22:33:23 +00:00
soulitzer	1e4dfeeb06	Add early_stop kwarg to torch.utils.checkpoint (#160781 ) We already have a context manager "set_checkpoint_early_stop". This PR adds a kwarg that toggles the same setting. It is also useful to have a kwarg version of the setting in addition to the context manager because is annoying to apply a context manager when the AC is being applied via CheckpointWrapper. Similar to the "debug" kwarg and the corresponding "set_checkpoint_debug_enabled" context manager, the context manager defaults to None and overrides the local setting when non-None. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160781 Approved by: https://github.com/tianyu-l	2025-08-26 22:32:35 +00:00
angelayi	4d078cfc4e	[fx] Add is_fx_symbolic_tracing flag (#161385 ) Fixes https://github.com/pytorch/pytorch/issues/135276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161385 Approved by: https://github.com/pianpwk	2025-08-26 22:26:27 +00:00
Ti-Tai Wang	da838f65af	[ONNX] Drop draft_export in exporter API (#161454 ) If onnx exporter fallbacks to draft_export with big models, this is taking forever for users, and possibly spam the printout, which keeps users from their stack trace with strict=False. We could consider make another API for draft_export as debugging tool, or combine it with report=True when "model is small"? Pull Request resolved: https://github.com/pytorch/pytorch/pull/161454 Approved by: https://github.com/justinchuby	2025-08-26 22:13:43 +00:00
gaoyufeng	cde54fe4e9	fix-unpin-memory-tensor-param (#160992 ) Fixes #160983 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160992 Approved by: https://github.com/ngimel	2025-08-26 21:55:25 +00:00
soulitzer	e06d1d6610	[BE] Improve torch.inference_mode docs and error message (#161164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161164 Approved by: https://github.com/sfc-gh-sbekman, https://github.com/janeyx99	2025-08-26 20:58:56 +00:00
Hashem Hashemi	b2db293abc	[ROCm] No-fence global reduce (#161180 ) This change removes need for fences in global_reduce by converting the stores to reduce_buffer[] into atomics+return. This is crucial for perf in architectures with split caches (e.g. MI300), where fences are inherently costly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161180 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-26 20:43:59 +00:00
PyTorch MergeBot	6686974ddd	Revert "[dynamo, nested graph breaks] add nested graph break tests (#144516 )" This reverts commit 9a756c2d710a0680bac93ab0b42db519ec2dc6cf. Reverted https://github.com/pytorch/pytorch/pull/144516 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/144516#issuecomment-3225659358))	2025-08-26 20:40:17 +00:00
eqy	3d82256a86	[FP8][cuBLAS][SM100] cuBLAS doesn't support rowwise-scaling on `sm110` or `sm120` either (#161236 ) See also #160693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161236 Approved by: https://github.com/Skylion007	2025-08-26 20:40:11 +00:00
PyTorch MergeBot	a4fb65701b	Revert "[dynamo, nested graph breaks] support very simple nested graph breaks (#159329 )" This reverts commit 8dab6d4c414bf997297804008c3da893e69cd51f. Reverted https://github.com/pytorch/pytorch/pull/159329 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/159329#issuecomment-3225617445))	2025-08-26 20:24:10 +00:00
PyTorch MergeBot	6afd766401	Revert "[dynamo, nested graph breaks] support nested graph breaks x context managers (#159678 )" This reverts commit 02fa5bf6d80fa4baa6bb6dd2fa6a16d88852da91. Reverted https://github.com/pytorch/pytorch/pull/159678 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/159678#issuecomment-3225597425))	2025-08-26 20:16:36 +00:00
PyTorch MergeBot	a7aa480e55	Revert "[dynamo, nested graph breaks] support nested closures (#159817 )" This reverts commit ef0ef6f93f7ef6d16d71a6997b72185504acd4b6. Reverted https://github.com/pytorch/pytorch/pull/159817 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/159817#issuecomment-3225586996))	2025-08-26 20:13:33 +00:00
PyTorch MergeBot	9f6e1b8730	Revert "[ROCm] SDPA fix mem fault when dropout is enabled (#154864 )" This reverts commit 3caddd4daa5b1a167663c07219e065e86247ad76. Reverted https://github.com/pytorch/pytorch/pull/154864 on behalf of https://github.com/atalman due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154864#issuecomment-3225554119))	2025-08-26 20:03:59 +00:00
PyTorch MergeBot	caf98fde0d	Revert "[dynamo, nested graph breaks] clean up comments and codegen (#160138 )" This reverts commit ac6316caaa74513cbcf3c7f9269bc23cd74749db. Reverted https://github.com/pytorch/pytorch/pull/160138 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/160138#issuecomment-3225546707))	2025-08-26 20:01:26 +00:00
PyTorch MergeBot	46576f5a16	Revert "[dynamo, nested graph breaks] prevent excessive recompilations (#159786 )" This reverts commit 67d31f6b281d3b15b205756fc7ebc450cdde1dab. Reverted https://github.com/pytorch/pytorch/pull/159786 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/159786#issuecomment-3225535752))	2025-08-26 19:54:22 +00:00
Charlie West-Taylor	77bc959fe1	Add inductor backend to device interface; make minifier_tests more device agnostic (#151314 ) Tried to decouple the always cpu <=> c++, cuda <=> triton assumption. Tried to keep it relatively simple by just guarding things more specifically, at the moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151314 Approved by: https://github.com/eellison	2025-08-26 19:40:37 +00:00
Jeff Daily	262640fd22	[ROCm][CI] restore test_flex_attention tests (#161519 ) Reverts #161450 and targets specific subtests to skip on MI200. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161519 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-26 19:31:30 +00:00
Zhengxu Chen	74124d1b46	[reland] [dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#161514 ) Summary: convert_frame.compile_frame used to take a callback transform function which will capture the frame object it has, but the frame information is not passed directly into compile_frame function. This PR changes the signature of compile_frame so that frame information is directly passed in the function without taking a callback. This makes it easier to build fullgraph capture API on top of compile_frame. Test Plan: CI Rollback Plan: Differential Revision: D81041296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161514 Approved by: https://github.com/tugsbayasgalan	2025-08-26 19:16:05 +00:00
Joshua Su	a03cc53e6f	Back out "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" (#161002 ) Summary: reverting this diff since it caused S551328. Please see D80217492 for dertails. Test Plan: NA Rollback Plan: Differential Revision: D80553588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161002 Approved by: https://github.com/jingsh, https://github.com/izaitsevfb	2025-08-26 19:04:13 +00:00
Yidi Wu	00efeabc29	[hop] make materialize_as_graph disable pre-existing dispatch modes (#161220 ) For materializing_as_subgraph, we just want to trace a graph. The handling of different modes should register their own logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161220 Approved by: https://github.com/Lucaskabela	2025-08-26 18:52:38 +00:00
Arsh Zahed	d4703fb91c	[dtensor] Add propagate_tensor_meta function that skips cache if _are_we_tracing (#161334 ) Fixes an issue where the log softmax handler checked the tensor metadata cache without checking for tracing or symints. Probably best to merge this after #160798, but not strictly blocking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161334 Approved by: https://github.com/xmfan	2025-08-26 18:46:58 +00:00
Tom Ritchford	cd87f30295	DOC: Clarify documentation for torch.matmul and fix a typo (#161424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161424 Approved by: https://github.com/AlannaBurke	2025-08-26 18:30:57 +00:00
Lucas Kabela	f0e0a6897e	type misc init and tools for dynamo (#161293 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/161293 Approved by: https://github.com/anijain2305	2025-08-26 17:38:49 +00:00
vishalgoyal316	d2bd55d8de	Typo correction in variable name inital_grad of Class TestFullyShardG… (#161501 ) Typo correction in variable name inital_grad of Class TestFullyShardGradientScaler implementation. Fixes #161480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161501 Approved by: https://github.com/soulitzer	2025-08-26 17:16:42 +00:00
Yidi Wu	6598f00c18	[dynamo] auto lift unbacked symbol in tensor's storage_offset (#161199 ) ```python import torch torch._dynamo.config.capture_scalar_outputs = True class M(torch.nn.Module): def forward(self, idx, x): u0 = idx.item() x0 = x.select(0, u0) def fn(): return x0.sin() return torch.cond(x0.sum() > 0, fn, fn) m = M() out = torch.compile(m, fullgraph=True)(torch.tensor(0, dtype=torch.int64, device="cuda"), torch.randn(3, 3, device="cuda")) print(out) ``` Before the PR, we didn't track the storage_offset symbol of a tensor. After https://github.com/pytorch/pytorch/pull/157605, we create an unbacked_symint for stroage_offset for the result of select. So when we try to lift the free basic symbols of x0 during speculating fn, we found a free symbol that's not bound to a proxy. This PR tracks the symbols of storage_offset and associated it with a proxy using torch.ops.aten.storage_offest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161199 Approved by: https://github.com/zou3519 ghstack dependencies: #161198	2025-08-26 17:06:54 +00:00
Yidi Wu	ba6ce66698	[dynamo] lift backed symint output of item() (#161198 ) Before the change in this PR, we have an error for the following code ```python import torch torch._dynamo.config.capture_scalar_outputs = True class M(torch.nn.Module): def forward(self, idx, x): u0 = idx.item() x0 = x.select(0, u0) def fn(): return x0.sin() return torch.cond(x0.sum() > 0, fn, fn) m = M() out = torch.compile(m, fullgraph=True)(torch.tensor(0, dtype=torch.int64), torch.randn(3, 3)) ``` The error is caused when speculate fn, and tries to lift symbol of x0.storage_offset() but found the symbols doesn't have a source associated with it. What really happens is that, when input tensor is a scalar tensor of int type and resides on CPU, we have a short cut that creates a norm symint when .item() is called see https://github.com/pytorch/pytorch/pull/126245. However, previously, we only track the unbacked symint output of an operation because we believe all the backed symint must have a source associated with it and has already bee lifted as input at the top-level. Now this invariant no longer holds, so we end up an error saying the symbol doesn't have source (because only input and symbols derided from inputs have source and result of .item() doesn't have a source). In this PR, we start to also track the normal symint with the proxy that created it (i.e. in this case the proxy .item()). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161198 Approved by: https://github.com/zou3519	2025-08-26 17:06:54 +00:00
PaulZhang12	ca9fe0107e	[Inductor] Update Outer Reduction Heuristic (#159093 ) Update outer reduction heuristics for significant speedups. HuggingFace: <img width="572" height="705" alt="Screenshot 2025-08-20 at 12 44 51 AM" src="https://github.com/user-attachments/assets/4872a23b-d136-423a-b2e6-187895bccba1" /> Average ~20% speedup on a kernel by kernel basis TorchBench: <img width="572" height="705" alt="Screenshot 2025-08-20 at 12 45 10 AM" src="https://github.com/user-attachments/assets/b8357b6d-6107-4104-b906-292a17d14d48" /> Average ~40% speedup on a kernel by kernel basis <img width="1705" height="729" alt="Screenshot 2025-08-21 at 5 50 32 PM" src="https://github.com/user-attachments/assets/a9715a2b-9e6c-4b33-ba9f-7870dc561e31" /> Differential Revision: [D80835998](https://our.internmc.facebook.com/intern/diff/D80835998) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159093 Approved by: https://github.com/jansel	2025-08-26 16:12:07 +00:00
AmdSampsa	f9df4ec2af	SDPA skip logic for ROCm (#160522 ) Skips some test for flex and eff attention if they are not supported by the hardware Pull Request resolved: https://github.com/pytorch/pytorch/pull/160522 Approved by: https://github.com/drisspg, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-26 15:51:07 +00:00
Catherine Lee	a72803f1e3	[ez][CI] GIve the linux check job a name that isn't linux-job (#161413 ) Reason: The default name is linux-job, which gets put in the linux category on HUD, but this isn't really a linux related job. Renaming it like this will make it go into the "other" category on HUD Other options: Change the grouping code in test-infra Pull Request resolved: https://github.com/pytorch/pytorch/pull/161413 Approved by: https://github.com/huydhn, https://github.com/seemethere	2025-08-26 15:18:35 +00:00
Jeff Daily	10e67f5ec3	forward fix #161102 (#161465 ) PR #161102 caused tf32 to be the default precision for flex attention. This PR forward-fixes the broken logic and restores ROCm MI200 CI flex attention test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161465 Approved by: https://github.com/jeffdaily, https://github.com/eqy Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-26 15:11:54 +00:00
PyTorch MergeBot	818ba434c7	Revert "Ensure large tensor int32 -> int64 indexing is enabled (#157767 )" This reverts commit fc69c2bc67672c3b2d0c62c1821895f09288f1c0. Reverted https://github.com/pytorch/pytorch/pull/157767 on behalf of https://github.com/atalman due to internal failure, sorry will revert ([comment](https://github.com/pytorch/pytorch/pull/157767#issuecomment-3224341111))	2025-08-26 14:12:06 +00:00
Ting Lu	ae8d319fd4	Update NVSHMEM to 3.3.24 and fix download link (#161321 ) https://github.com/pytorch/pytorch/issues/159779 Update NVSHMEM 3.3.24 for [PyTorch CUDA13 Binary Cannot Be Built with SM_75 with NVSHMEM](https://github.com/pytorch/pytorch/issues/160980) Enabled back sm_75 for NVSHMEM Fixed the NVSHMEM download link for the issue with 3.3.20 download in issue - [[CD] nvshem-3.3.9 wheels for aarch64 is not manylinux2_28 compliant](https://github.com/pytorch/pytorch/issues/160425) Todo: Should also enable back build ARM with NVSHMEM since it is compatible with manylinux2_28 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161321 Approved by: https://github.com/Skylion007, https://github.com/atalman	2025-08-26 13:26:18 +00:00
PyTorch MergeBot	e795450a35	Revert "[dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900 )" This reverts commit 447d34b5f80fb7350f79decd855cb599cab39083. Reverted https://github.com/pytorch/pytorch/pull/160900 on behalf of https://github.com/atalman due to reverting since can't land existing diff internally, will need to reland it ([comment](https://github.com/pytorch/pytorch/pull/160900#issuecomment-3224029031))	2025-08-26 12:45:59 +00:00
David Berard	8c506e6310	[easy][test] Add repeat_interleave opinfo that exercises binary search fusion (#161445 ) This adds a configuration that would have caught the need for https://github.com/pytorch/pytorch/pull/159961 when https://github.com/pytorch/pytorch/pull/158462 was landed. Notably: * the test has output_size kwarg specified * the input is 1D plus a size-1 dimension (otherwise, if there are non-size-1 dimensions, then the fusion won't occur) Differential Revision: [D80981715](https://our.internmc.facebook.com/intern/diff/D80981715) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161445 Approved by: https://github.com/eellison, https://github.com/v0i0	2025-08-26 12:32:24 +00:00
PyTorch MergeBot	4a1aca11c2	Revert "[inductor] structured-log graph execution order + test (#160448 )" This reverts commit 995397d47a0e27394ee1010f158e181eb304100a. Reverted https://github.com/pytorch/pytorch/pull/160448 on behalf of https://github.com/atalman due to internal failure please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/160448#issuecomment-3223939035))	2025-08-26 12:20:37 +00:00
Chuanhao Zhuge	e9d42b3880	[small][muon] Use addmm for Newton–Schulz orthogonalization (#161379 ) A performance optimization. Using `torch.addmm`, which fuses `matrix multiply + scale + add` into one op. Benchmark In a QWEN-like 0.5B model training we observed average `optimizer.step()` latency speedup: matmul ~44.5 ms -> addmm ~27.4 ms: a 1.62× speedup. matmul <img width="1403" height="600" alt="Screenshot 2025-08-24 at 3 15 37 PM" src="https://github.com/user-attachments/assets/a77a68d4-da3c-473a-97f0-e6ef0a3b46d9" /> addmm <img width="1426" height="602" alt="Screenshot 2025-08-24 at 3 13 42 PM" src="https://github.com/user-attachments/assets/e493af36-44d3-4026-9f7c-fd0f9cdbc7e5" /> Testing End-to-end training: We used a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curves show consistency between normal matmul and addmm. <img width="1035" height="434" alt="Screenshot 2025-08-24 at 2 56 21 PM" src="https://github.com/user-attachments/assets/b96b13e3-0a01-4908-853c-d917b41f3d75" /> Unit test: ```python # dummy model and data model0 = Linear(10, 10, bias=False) model1 = copy.deepcopy(model0) inputs = torch.randn(8, 10) targets = torch.randn(8, 10) loss = MSELoss() lr = 1e-3 wd = 0.1 momentum = 0.95 opt_ref_muon = Muon( params=model0.parameters(), lr=lr, weight_decay=wd, momentum=momentum, nesterov=nesterov, adjust_lr_fn="original", ) opt_exp_muon = Muon( params=model1.parameters(), lr=lr, weight_decay=wd, momentum=momentum, nesterov=nesterov, adjust_lr_fn="original", use_addmm=True, ) out_ref = model0(inputs) loss_ref = loss(out_ref, targets) opt_ref_muon.zero_grad() loss_ref.backward() opt_ref_muon.step() out_exp = model1(inputs) loss_exp = loss(out_exp, targets) opt_exp_muon.zero_grad() loss_exp.backward() opt_exp_muon.step() for p_ref, p_exp in zip(model0.parameters(), model1.parameters()): torch.testing.assert_close(p_ref, p_exp) ``` shows numeric difference, but this is expected on bf16 precision: ``` Mismatched elements: 96 / 100 (96.0%) Greatest absolute difference: 8.985400199890137e-05 at index (1, 9) (up to 1e-06 allowed) Greatest relative difference: 0.007370449136942625 at index (0, 6) (up to 1e-05 allowed) ``` ~~Introduced a flag that allows users to opt in, as there are numerical differences relative to the original implementation.~~ Update: since `addmm` fuses the math ops, there are fewer intermediate roundings and is therefore more numerically accurate compared to the original form. Based on this, we opt to make `addmm` the default and only option. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161379 Approved by: https://github.com/janeyx99	2025-08-26 09:17:28 +00:00
Tsung-Hsien Lee	8cfc119491	[pytorch] Simplify codes using `std::all_of()` for `_check_tensors_share_device_and_dtype()` (#161411 ) Summary: These two nested loops of checks could be simplified with `std::all_of()` to make it more compact. Test Plan: OSS CI & tests Rollback Plan: Differential Revision: D80946082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161411 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-08-26 08:56:24 +00:00
Tsung-Hsien Lee	e7e270a33a	[pytorch] Merge two nested if statement checks into one (#161387 ) Summary: This reduces the code indentation level by one. Test Plan: OSS CI & tests Rollback Plan: Differential Revision: D80915357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161387 Approved by: https://github.com/janeyx99	2025-08-26 08:45:36 +00:00
Nikhil Patel	6aef9f3a69	[Inductor][Tritonparse] Call `jit_post_compile_hook` within Inductor Triton Kernel compile path (#161443 ) Summary: Since Inductor skips JIT compilation for Triton kernels, we need to manually invoke `knobs.runtime.jit_post_compile_hook` if one exists. Here, we do this to enable Tritonparse to extract launch metadata from Inductor launched kernels. We can control whether or not Inductor will run the hook with a new `TORCHINDUCTOR_RUN_JIT_POST_COMPILE_HOOK=1 ` config variable. Reviewed By: davidberard98 Differential Revision: D80624932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161443 Approved by: https://github.com/FindHao	2025-08-26 06:24:42 +00:00
Xilun Wu	7376111d59	[BE] fix compute_global_tensor_shape test (#161441 ) Fixes #161154 Test `pytest test/distributed/tensor/test_utils.py -s -k test_compute_global_tensor_shape_1D` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161441 Approved by: https://github.com/kwen2501	2025-08-26 03:22:29 +00:00
PyTorch MergeBot	92ab184824	Revert "[Inductor] Prune configs that require more shared memory than the hardware limit (#161040 )" This reverts commit b2e06e0194c3fa8f7578a1b48751cc027394fb67. Reverted https://github.com/pytorch/pytorch/pull/161040 on behalf of https://github.com/jeffdaily due to still failing on rocm, see https://hud.pytorch.org/failure?name=rocm%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(default%2C%203%2C%206%2C%20linux.rocm.gpu.2)&jobName=undefined&failureCaptures=inductor%2Ftest_triton_heuristics.py%3A%3ATestTritonHeuristics%3A%3Atest_prune_configs_over_shared_memory_limit_do_pruning_True ([comment](https://github.com/pytorch/pytorch/pull/161040#issuecomment-3222430129))	2025-08-26 03:15:32 +00:00
Zesheng Zong	8c442e4fd3	Fix LBFGS warning convert a tensor with requires_grad=True to a scalar (#160389 ) Fixes #160197 ## Test Result ```python In [1]: import warnings ...: warnings.simplefilter('error') ...: import torch ...: print(torch.__version__) ...: a, b = torch.rand((2, 32, 32)) ...: a.requires_grad_() ...: optimizer = torch.optim.LBFGS([a]) ...: loss_fn = lambda x, y: (x-y).pow(2).mean() ...: ...: def closure(): ...: optimizer.zero_grad() ...: loss = loss_fn(a, b) ...: loss.backward() ...: return loss ...: ...: for i in range(100): ...: optimizer.step(closure) ...: print(i, loss_fn(a, b)) ...: 2.9.0a0+gitf33f3f8 0 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 1 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 2 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 3 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 4 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 5 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 6 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 7 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 8 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 9 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 10 tensor(5.8066e-11, grad_fn=<MeanBackward0>) ... ``` ```bash pytest test/test_optim.py -vv ... test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_NAdam_cuda_float32 PASSED [2.7192s] [ 99%] test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_RAdam_cuda_float32 PASSED [2.5370s] [ 99%] test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_RMSprop_cuda_float32 PASSED [2.0190s] [ 99%] test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_Rprop_cuda_float32 PASSED [1.8554s] [ 99%] test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_SGD_cuda_float32 PASSED [2.0433s] [ 99%] test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_SparseAdam_cuda_float32 PASSED [1.1788s] [100%] ================== 1471 passed, 242 skipped in 2440.52s (0:40:40) ============ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160389 Approved by: https://github.com/janeyx99 Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-08-26 03:07:47 +00:00
angelayi	e34b6a0103	Add meta for add.Scalar (#161332 ) Fixes https://github.com/pytorch/pytorch/issues/161076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161332 Approved by: https://github.com/Skylion007	2025-08-26 02:26:51 +00:00
RajeshvShiyal	f795e92802	space added between type and checking for typechecking (#161352 ) space added between type and checking for "typechecking" Fixes #161282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161352 Approved by: https://github.com/malfet	2025-08-26 02:07:33 +00:00
Huy Do	becd6cd744	Increase timeout value when pushing to ghcr.io (#161444 ) Seeing this timing out a lots in trunk now https://github.com/pytorch/pytorch/actions/runs/17165552358/job/48705069047. The benchmark image is the largest one we have on CI, so it's probably over the 30 minutes limit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161444 Approved by: https://github.com/atalman	2025-08-26 01:51:16 +00:00
FFFrog	ec21cafd85	[OpenReg] Refactor and Optimize the OpenReg for Preparation of Docs (#159640 ) As the title stated. Changes: - Fixed a bug where abs_stub could not be triggered - Refactor registration to prepare for documentation - Add meta, fallback for openreg Pull Request resolved: https://github.com/pytorch/pytorch/pull/159640 Approved by: https://github.com/albanD	2025-08-26 01:44:21 +00:00
PyTorch MergeBot	908b0ccb1f	Revert "Increase timeout value when pushing to ghcr.io (#161444 )" This reverts commit b9e9e92817fd7d1a778f074105603efb07e05004. Reverted https://github.com/pytorch/pytorch/pull/161444 on behalf of https://github.com/huydhn due to Reland this to generate a different has value for the benchmark Docker image ([comment](https://github.com/pytorch/pytorch/pull/161444#issuecomment-3222257119))	2025-08-26 01:41:59 +00:00
amdfaa	85adf80cf1	Disable inductor/test_flex_attention.py (#161450 ) Currently inductor/test_flex_attention.py is causing rocm pytorch mi250 shard 1 to go over the timeout limit. This PR is for disabling that test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161450 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-26 01:28:51 +00:00
Benjamin Glass	74c4c758af	[cpp_wrapper] Swap to new PyBind11 simple GIL header (#161063 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161063 Approved by: https://github.com/Skylion007 ghstack dependencies: #160754	2025-08-26 01:21:18 +00:00
Benjamin Glass	660b0b8128	Update pybind11 submodule to 3.0.1 (#160754 ) Upgrade to PyBind11 v3. This allows us to strip out our own (possibly broken?) handling of the C++ ABI when building extensions, in favor of the more-complete PyBind11 internal handling. Fixes a few test failures due to https://github.com/pybind/pybind11/issues/5774, which effectively makes the `__qualname__` attribute of functions platform-dependent. Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/160754 Approved by: https://github.com/Skylion007	2025-08-26 01:21:18 +00:00
Yiming Zhou	089ad1d88b	[1/n][export] Refactor PT2 Archive weight saving and loading (#160394 ) Summary: We split the refactoring in two parts for forward compatibility concerns First, we land the deserialization (loading part) Then, we land the serialization (saving part) Save weights and constants as individual files in PT2 archive. Each weight/constant will be saved as raw bytes, unless it is a custom object (TorchBind object) or a non-fake tensor subclass, for these two special cases we still save them using pickle. The metadata of saved tensors along with the file name will be saved as `PayloadMeta`. The mapping from FQN to `PayloadMeta` will be saved as `PayloadConfig` under `WEIGHTS_CONFIG_FORMAT` and `CONTANTS_CONFIG_FORMAT` This changes the serialization in python side when calling `torch.export.save()`. For deserialization in python `torch.export.load()`, we make it BC-safe by allowing loading legacy format weights/constants. For deserialization in C++ `torch/nativert/ModelRunner.cpp`, we make this a BC breaking change as currently the OSS ModelRunner API is not being used. The file structure ``` ├── archive_format ├── archive_version ├── byteorder ├── .data │ ├── serialization_id │ └── version ├── data │ ├── sample_inputs │ │ └── model.pt │ ├── constants │ │ ├── tensor_0 │ │ ├── tensor_1 │ │ └── model_constants_config.json │ └── weights │ ├── weight_0 │ ├── weight_1 │ ├── weight_2 │ ├── weight_3 │ └── model_weights_config.json └── models └── model.json ``` Test Plan: CI Rollback Plan: Differential Revision: D80035490 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160394 Approved by: https://github.com/SherlockNoMad	2025-08-26 01:15:42 +00:00
William Wen	67d31f6b28	[dynamo, nested graph breaks] prevent excessive recompilations (#159786 ) Nested continuation function code objects are now unique w.r.t. stack trace below (and including) the current code object. Without this change, e.g. in the added test, `f3` would be recompiled on the second graph break. Followup: we can skip guards on continuation functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159786 Approved by: https://github.com/anijain2305 ghstack dependencies: #157971, #159281, #144516, #159329, #159678, #159817, #160138	2025-08-26 00:58:38 +00:00
William Wen	ac6316caaa	[dynamo, nested graph breaks] clean up comments and codegen (#160138 ) Fix comments to reflect that we no longer codegen cells to be sent to resume function as inputs - they are instead codegen'd after the unsupported instruction in order to build resume functions that are closures. Also simplify some codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160138 Approved by: https://github.com/anijain2305 ghstack dependencies: #157971, #159281, #144516, #159329, #159678, #159817	2025-08-26 00:58:38 +00:00
William Wen	ef0ef6f93f	[dynamo, nested graph breaks] support nested closures (#159817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159817 Approved by: https://github.com/anijain2305 ghstack dependencies: #157971, #159281, #144516, #159329, #159678	2025-08-26 00:58:28 +00:00
William Wen	02fa5bf6d8	[dynamo, nested graph breaks] support nested graph breaks x context managers (#159678 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159678 Approved by: https://github.com/anijain2305 ghstack dependencies: #157971, #159281, #144516, #159329	2025-08-26 00:58:18 +00:00
William Wen	8dab6d4c41	[dynamo, nested graph breaks] support very simple nested graph breaks (#159329 ) e.g. this graph breaks once now: ```python import torch torch._dynamo.config.nested_graph_breaks = True def inner(x): x = x + 1 torch._dynamo.graph_break() return x + 2 @torch.compile(backend="eager") def outer(x): return inner(x) print(outer(torch.ones(3))) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159329 Approved by: https://github.com/anijain2305 ghstack dependencies: #157971, #159281, #144516	2025-08-26 00:58:07 +00:00
William Wen	9a756c2d71	[dynamo, nested graph breaks] add nested graph break tests (#144516 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144516 Approved by: https://github.com/anijain2305 ghstack dependencies: #157971, #159281	2025-08-26 00:57:58 +00:00
William Wen	504a6445a4	[dynamo, nested graph breaks] use CALL_FUNCTION_EX when calling resume function (#159281 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159281 Approved by: https://github.com/anijain2305 ghstack dependencies: #157971	2025-08-26 00:57:48 +00:00
William Wen	2df9b437e3	[dynamo, nested graph breaks] implement new resume frame stack/locals/cell layout convention (#157971 ) The comments/conventions are not exactly correct here, as the implementation at this PR is partial. They will be fixed in #160138. No tests added, since there shouldn't be any overall semantic changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157971 Approved by: https://github.com/anijain2305	2025-08-26 00:57:39 +00:00
rzou	4e19c1906a	Get Inductor periodic CI green (#161297 ) I'll file hi-pri issues for the things that need looking into. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/161297 Approved by: https://github.com/angelayi	2025-08-26 00:49:49 +00:00
Nikhil Patel	332fa5b388	[Inductor][Triton] Fix SCALING_ROWWISE misclassification for scalar scales (#160450 ) Summary: In `tuned_scaled_mm()`, we unsqeeuze any scalar scale from [] -> [1, 1]. Later, when we are determining how to set the `SCALING_ROWWISE` kernel attribute, we check whether the scale has 2 dimensions. However, since we previously unsqueezed any scalar scales, this will always evaluate to True. Test Plan: Run the following tests in test/inductor/test_fp8.py: test_tensorwise_scaling_tma_template test_rowwise_scaling_tma_template Rollback Plan: Differential Revision: D80108117 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160450 Approved by: https://github.com/eellison	2025-08-26 00:24:55 +00:00
Huy Do	b9e9e92817	Increase timeout value when pushing to ghcr.io (#161444 ) Seeing this timing out a lots in trunk now https://github.com/pytorch/pytorch/actions/runs/17165552358/job/48705069047. The benchmark image is the largest one we have on CI, so it's probably over the 30 minutes limit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161444 Approved by: https://github.com/atalman	2025-08-25 23:52:59 +00:00
Tsung-Hsien Lee	e6aa7287f8	[pytorch] Leverage `unordered_map.try_emplace()` to simplify code (#161388 ) Summary: Because [`unordered_map.try_emplace()`](https://en.cppreference.com/w/cpp/container/unordered_map/try_emplace.html) does not invoke value's constructor if key is already existed, this matches with the previous the behavior on checking the key's existence first, and then instantiate the value. Test Plan: OSS CI & tests Rollback Plan: Differential Revision: D80916349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161388 Approved by: https://github.com/janeyx99	2025-08-25 23:33:59 +00:00
atalman	94b9569c4a	Forward fix periodic vision build (#161408 ) Trying to forward fix: https://github.com/pytorch/pytorch/issues/161358 use SM 80 architecture by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/161408 Approved by: https://github.com/zou3519, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-08-25 23:28:22 +00:00
morrison-turnansky	2cf7ac2fb7	Issue 160495 inductor complex float (#160736 ) Avoiding calling tensor.view(tensor.real.dtype) when tensor.ndim =0 fixes the issue. Called a reshape. Fixes #160495 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160736 Approved by: https://github.com/ngimel	2025-08-25 23:23:13 +00:00
zhxchen17	447d34b5f8	[dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900 ) convert_frame.compile_frame used to take a callback transform function which will capture the frame object it has, but the frame information is not passed directly into compile_frame function. This PR changes the signature of compile_frame so that frame information is directly passed in the function without taking a callback. This makes it easier to build fullgraph capture API on top of compile_frame. @exported-using-ghexport Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801/) Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160900 Approved by: https://github.com/tugsbayasgalan, https://github.com/anijain2305	2025-08-25 23:16:21 +00:00
Wenyuan Chi	b2e06e0194	[Inductor] Prune configs that require more shared memory than the hardware limit (#161040 ) Summary: This diff removes configs that require more shared memory than the hardware limit, which causes the following compilation error: ``` No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help. ``` Test Plan: ``` buck2 test mode/dev-nosan fbcode//caffe2/test/inductor:max_autotune -- test_max_autotune_prune_choices -v 1,stderr ``` Rollback Plan: Differential Revision: D80594562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161040 Approved by: https://github.com/eellison	2025-08-25 23:09:09 +00:00
drisspg	fc69c2bc67	Ensure large tensor int32 -> int64 indexing is enabled (#157767 ) Fixes: #https://github.com/pytorch/pytorch/issues/157446 I think that this delta is worth the switch form block-ptrs especially since they are deprecated ## Perf Summary A is nightly B is this diff, so `negative` means this diff improves perf TOP 5 differences <img width="805" height="754" alt="Screenshot 2025-08-24 at 5 49 49 PM" src="https://github.com/user-attachments/assets/aa359cdf-ee9a-427d-be72-1b9aef6f3115" /> <details> <summary><strong>Full perf table (click to expand)</strong></summary> \| attn_type \| dtype \| shape(B,Hq,M,Hkv,N,D) \| TFlops Version A \| TFlops Version B \| \| --- \| --- \| --- \| --- \| --- \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 258.38834144791923 \| 258.6353685004612 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.2192450677751 \| 140.12393320464972 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 122.32683823617003 \| 118.51603755647925 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.48556906165314 \| 137.24259849208627 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 86.59814488695922 \| 84.59431398586257 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 288.52679758135764 \| 292.9174195871856 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 172.25541683643277 \| 172.94326459828508 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 164.40864610599826 \| 165.035129576335 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 176.54876886433945 \| 175.08057670028145 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 125.22491679812626 \| 121.06201152859151 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 339.11952481874283 \| 339.0132835601695 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 227.58583240284406 \| 228.21824999409597 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 185.98569659868966 \| 182.32850843255093 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 188.9495725191772 \| 180.31385312481657 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 106.25789530994302 \| 106.55084959448476 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 357.6430536888533 \| 363.30843452247274 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 262.3241154406613 \| 265.73250045488 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 249.30498953911416 \| 249.35928192833785 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 224.74126243851808 \| 223.71776504077988 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 168.26977014013707 \| 165.47991483333809 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 382.8178701785897 \| 384.34752965862685 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 308.1449710013853 \| 311.0653716044644 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 251.96365252505072 \| 243.92283557225903 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 226.69316232745368 \| 215.22769268913356 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 153.34142545296405 \| 151.9312673939401 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 396.0998000753126 \| 398.35036286102473 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 333.5198415274966 \| 344.6354466169716 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 310.5955933379696 \| 305.66347819546 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 260.4012412689896 \| 259.758666997307 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 234.13034252182635 \| 227.61676497283614 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 396.17615538477196 \| 401.1419104525502 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 359.98648311998414 \| 360.8285563463094 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 291.97720707257736 \| 281.41694809965253 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 250.1703628419691 \| 238.556760291579 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 199.50782826294306 \| 191.52327358439223 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 411.0632004785396 \| 413.6362648405517 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 382.9404387613185 \| 397.74886235657607 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 357.0998545146633 \| 350.5115200772392 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 281.8033924428203 \| 281.98601309215843 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 282.56595134222135 \| 277.4565795466672 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 408.89838018149516 \| 405.14531386840076 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 396.07662058160264 \| 393.4598228299578 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 317.8822887267849 \| 304.754931401036 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 265.8801304948243 \| 254.22961974295112 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 227.87390579965614 \| 222.19481980110393 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 427.36821778477025 \| 431.3766620314935 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 410.67994346825 \| 423.4666944003808 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 381.1968748374038 \| 381.77668006420424 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 292.5540046358546 \| 296.5439130720502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 321.04573768858114 \| 310.7423616656888 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 427.46148866769903 \| 426.162091037068 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 419.75580537687347 \| 421.88640120274334 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 337.3208051798903 \| 327.4912454675092 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 276.5638854539581 \| 262.988360558083 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 250.82791326036886 \| 245.07367032501736 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 435.8055824506086 \| 441.8803729460534 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 432.02638235921006 \| 450.33161016596273 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 402.25525939224883 \| 393.8564689669916 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 297.5337286675904 \| 297.0131881135074 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 343.8697037899545 \| 329.8194073407783 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 267.58912366821056 \| 256.91606054118375 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 150.81723692609629 \| 146.32172267858743 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 129.51029293209245 \| 122.72144394093334 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 147.627656359087 \| 141.68956350566188 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 87.55100546003591 \| 84.91293287692788 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 299.5931492743986 \| 305.884253766691 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 179.39026367843837 \| 181.64741311605096 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 173.93547669282367 \| 173.23972950980564 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 185.90234171599252 \| 182.80844545446686 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 128.08176696266082 \| 123.27722685662111 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 340.50674552770664 \| 338.9071088484576 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 225.4438318650432 \| 230.22899884832975 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 194.15123248528312 \| 185.02793973094865 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 200.74289714108176 \| 191.76606719670647 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 107.03564946728423 \| 106.82432377861258 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 371.31799283918406 \| 379.7555394732925 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 275.97762744310455 \| 276.71106853992995 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 261.6648679783462 \| 259.4127232060398 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 237.03108223577615 \| 233.92710216149527 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 172.13926800371152 \| 168.74390922407585 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 381.50199487767276 \| 383.9043681999597 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 307.9748883093411 \| 312.2403515462001 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 251.11319684705438 \| 243.17870127827277 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 236.3253127246763 \| 223.81250201769552 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 154.55693991756874 \| 153.11360584987685 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 407.11400078586615 \| 413.53709886086557 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 348.1705797722622 \| 360.09771155957367 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 321.8593280850388 \| 318.2882327401255 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 270.089032013835 \| 268.767323026064 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 238.07324557907788 \| 228.09842078362692 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 399.8172853171901 \| 401.0954526332136 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 363.4387330438581 \| 364.13111024232677 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 294.1752429133857 \| 283.7235663368415 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 256.8389394007649 \| 246.91771015606483 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 199.3378564292656 \| 192.40439590901758 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 425.5150965556111 \| 430.8190098707553 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 396.00437184073013 \| 411.3873625655787 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 369.92803661607815 \| 361.43244467343663 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 293.4277354412933 \| 295.2529537595746 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 288.0208673072841 \| 281.51896404878863 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 408.3005367220567 \| 408.96116482298913 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 396.90095962766304 \| 396.87385456176486 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 319.0534576137999 \| 302.50950358107764 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 270.3334977708081 \| 258.8506349486557 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 227.46824134365394 \| 222.23759438128766 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 438.24247309479694 \| 437.7975163205371 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 428.34012029699227 \| 433.3215899950434 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 386.52672049728875 \| 388.26216893354984 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 302.71976814728083 \| 302.3574867306459 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 327.39760662780986 \| 308.6348428844912 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 423.31308678262695 \| 426.6306972137279 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 412.6983690923106 \| 419.4961977664297 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 337.41003544742273 \| 324.2155049126126 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 278.7755890910794 \| 265.9194286636502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 251.55678254755364 \| 244.8843180141462 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 452.5930781172308 \| 457.7117122300742 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 445.05676260348116 \| 463.9304535499636 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 415.78302138389415 \| 406.29229555271456 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 308.0311067300895 \| 304.91354721414314 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 351.43943626809335 \| 329.4476923070317 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 295.1801525813241 \| 291.36521287398904 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 183.23250549178067 \| 182.35421238887605 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 151.56832453117747 \| 151.3422139154794 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 171.02111935180432 \| 160.72516856727913 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 74.05765122783826 \| 74.5885345035243 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 314.3587394591763 \| 319.2938677773619 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 224.57002084153177 \| 225.48868542008177 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.00964804143052 \| 215.39576159953486 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.1174237618258 \| 214.28437413525663 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 121.08920423648368 \| 119.55813661872644 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 362.2193857281911 \| 360.05005804275936 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 279.8840217430121 \| 279.5437918286659 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 227.76617121021982 \| 222.8655938229316 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 215.43141176970562 \| 207.71852284994702 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 121.35588364218539 \| 121.20636565046884 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 365.1545280898012 \| 373.37585444987326 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 304.360119952975 \| 309.1247297936263 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 287.2603904544586 \| 289.25547903162595 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 257.9852675272418 \| 257.59069234098115 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 188.35158496670232 \| 184.24683960154857 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 389.9744911369211 \| 388.43466897254166 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 345.9228295166513 \| 342.63034895210126 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 279.56334658247437 \| 271.2724375402088 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 245.66477202810066 \| 233.49688207371258 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 170.3270720653187 \| 166.23863845657382 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 400.0041140827554 \| 402.11182445396497 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 363.64641830327434 \| 375.9288663364792 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 341.5776139573363 \| 335.1160003213424 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 281.1811770268521 \| 280.21438270014005 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 247.78716118997716 \| 245.3269825179633 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 403.794126680488 \| 405.2353919019577 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 387.079178426863 \| 385.1461762057035 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 309.7847188173431 \| 298.0443968374749 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 262.4721750159666 \| 250.81679725428586 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 205.70866004479979 \| 202.9620839129557 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 413.380982988662 \| 418.40270594263103 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 398.450064800682 \| 409.6794973994029 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 372.26297458194466 \| 364.44415106552196 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 293.0818569905912 \| 292.85172400643984 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 296.46717085592087 \| 285.76362010612763 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 419.3186786037592 \| 426.08801580934437 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 408.1648467766632 \| 409.4122254207817 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 329.24396020457345 \| 313.5200995121138 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 274.61257504571876 \| 255.7801815432177 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 232.63806001220684 \| 230.03020843492314 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 435.0785891054788 \| 440.39101804225345 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 424.86925312752817 \| 435.18898057396825 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 393.000417896268 \| 395.11543361225256 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 297.7755459218185 \| 300.7208114715287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 331.71570861760534 \| 318.07127352552885 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 424.58602747137405 \| 425.84897078470715 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 422.66607285025725 \| 423.5524945535485 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 344.8625760048626 \| 331.6793888458635 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 282.0787281511649 \| 263.7895634445868 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 252.7301927385177 \| 245.41844170037427 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 437.0658069164588 \| 442.9101960063628 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 433.13788271434646 \| 452.3873572709863 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 404.0959191546953 \| 396.7077863894884 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 300.45502211883206 \| 301.3439134717943 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 344.11003202413934 \| 330.8897663350314 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 298.4364205341705 \| 291.6793556507056 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 187.6382133139633 \| 191.05409897308772 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 156.55822078636112 \| 154.178925976516 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 173.47765221825162 \| 169.30862508068464 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 74.5885345035243 \| 74.52689061607104 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 323.12233826013045 \| 328.53889207933514 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 236.75872140126316 \| 235.8378325547398 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 227.17836523816675 \| 226.75357076139966 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 224.07209453308036 \| 224.07209453308036 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 122.85572156047981 \| 121.11642183704716 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 361.3123326658092 \| 360.71014086458337 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 281.5287983927017 \| 281.94301754758345 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 232.7456696285686 \| 226.50976826432776 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 221.5612361744038 \| 214.96188822837055 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 121.38311528944315 \| 120.85441868178513 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 380.2579019244734 \| 389.2520157863988 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 316.95230660496924 \| 317.87597790618906 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 301.07968126657323 \| 298.02424098422983 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 267.2240756921594 \| 267.16353549228154 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 189.82761622494257 \| 186.736450261963 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 389.88665375406805 \| 387.9125133037077 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 348.70619958684887 \| 346.6750499749774 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 280.5472989906087 \| 271.22300822012187 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 250.02397620165968 \| 241.22532776331445 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 171.67817496107645 \| 166.95679280483972 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 412.626880230807 \| 417.60238657950777 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 374.8829313933945 \| 389.4448546468815 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 353.20410434172436 \| 345.7072490717473 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 292.51045924209586 \| 291.66621022138287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 251.6264062063495 \| 248.45110052911542 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 404.0155784550126 \| 401.90546837237514 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 384.4389015599863 \| 386.9684324594344 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 313.3731284132225 \| 298.17074251037894 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 264.19199737284265 \| 252.8982463999916 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 207.03696315185684 \| 202.86697323136772 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 428.2436763312506 \| 433.45005568619536 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 411.8516531869893 \| 428.2753623461049 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 384.9095037182509 \| 372.90888743000744 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 303.2438915629836 \| 302.05095952914337 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 301.8689122735564 \| 285.0363190513223 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 423.13592231504805 \| 420.3991500185611 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 407.44527331585493 \| 408.5064370765247 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 330.50050996167414 \| 316.8763979925965 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 274.6833786307413 \| 259.86098862141324 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 232.24019584158367 \| 226.52040268160232 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 444.4596314237808 \| 455.99558915752266 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 437.4245561244369 \| 455.98275147271966 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 397.3350686877605 \| 397.88875599028063 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 308.53809114394545 \| 307.1359822042007 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 331.32379843423774 \| 316.85293191675646 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 422.4622274366379 \| 425.0407156418684 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 420.9547052783101 \| 430.33779243510276 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 345.50265346504085 \| 332.094855328957 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 280.81715528243365 \| 264.6543640282054 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 252.25635200421783 \| 245.46235499490305 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 452.5524207341139 \| 461.7512032176736 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 445.2316469907137 \| 464.4523799578466 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 416.87264016717023 \| 409.17124592157046 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 309.42579489389846 \| 307.9734464665731 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 350.50782004300623 \| 330.98959545427294 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157767 Approved by: https://github.com/Skylion007	2025-08-25 22:51:00 +00:00
Michael Lazos	adecb0c9e8	[Cutlass-EVT] Fix buffer size issues (#161335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161335 Approved by: https://github.com/henrylhtsang ghstack dependencies: #161398	2025-08-25 22:08:30 +00:00
Michael Lazos	d57c79e609	[Cutlass] Fix regression from f7ad69f (#161398 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161398 Approved by: https://github.com/henrylhtsang	2025-08-25 22:08:30 +00:00
atalman	1a566c4909	Remove Python 3.9 nightly builds (#161427 ) Please see https://github.com/pytorch/pytorch/issues/161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161427 Approved by: https://github.com/huydhn	2025-08-25 22:05:40 +00:00
Michael Lazos	37a34022b5	[Pattern Matcher] improve error msg (#161423 ) Updates pattern matcher error message Pull Request resolved: https://github.com/pytorch/pytorch/pull/161423 Approved by: https://github.com/mengluy0125, https://github.com/masnesral	2025-08-25 21:48:54 +00:00
Huy Do	763053dc53	Always run OIDC auth on B200 to be able to upload artifacts to S3 (#161436 ) Reported by @drisspg , in its current form, the OIDC auth step wasn't run when the previous test step failed. We need this to always run to be able to upload artifacts to S3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161436 Approved by: https://github.com/nWEIdia, https://github.com/drisspg	2025-08-25 21:05:20 +00:00
Daniel Galvez	cf94cadbee	[CUDAGraph] Add getter for cuda graph exec (#161294 ) This is far simpler than #155164 since we never destroy the cudaGraphExec_t. The request comes from TRT-LLM specifically. The motivation is that some power users would like to mutate specific kernel parameters via APIs like `cudaGraphExec*SetParams` after a cuda graph has been instantiated. For example, a common request has been to be able to change the sequence length of attention kernels, after having captured a graph for the largest possible sequence length. It turns out that the host overhead you eliminate via cuda graphs in LLM inference ends up causing an increase in computation time when you size your kernels to the maximum possible sequence length (which I believe is done in both TRT-LLM and vLLM). Attention is the most problematic kernel because its computation time is quadratic in the sequence length, rather than linear. This can work if your attention kernel can work for arbitrary shapes (this is not the case for all attention implementations! Many of them specialize with templates), and you have a persistent kernel that allocates only as many blocks as you have SM's (so you don't have to figure out how many blocks to allocate for a specific sequence length). Using a conditional SWITCH node is a better generic approach to this problem, but that requires more infrastructure work. Note that this requires knowledge of the exact location of the value in your kernel's parameter buffer to mutate. It won't work with arbitrary stream capture code whose kernels you don't know before hand. So I expect this code path to be rarely used. Testing: ``` pytest -s -k raw_graph_exec test/test_cuda.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161294 Approved by: https://github.com/ngimel, https://github.com/BoyuanFeng, https://github.com/eellison, https://github.com/eqy	2025-08-25 20:57:37 +00:00
Sandeep Narendranath Karjala	995397d47a	[inductor] structured-log graph execution order + test (#160448 ) Summary: - Emit a structured trace per compiled graph execution to reconstruct execution order in TLParse. - Adds debug.log_graph_execution(name) called from `CompiledFxGraph.__call__`, producing an artifact named inductor_graph_execution with payload {"graph": "graph_<id>"}. Testing: - Add inline test to verify structure and output Pull Request resolved: https://github.com/pytorch/pytorch/pull/160448 Approved by: https://github.com/xmfan	2025-08-25 20:12:18 +00:00
Chen	ffa1ce7650	Fix the parity of original and exported module parameters (#160600 ) ## Problem Fixing parameter mismatch issue during torch.export with strict mode (see "How to reproduce the issue" section below): When there are two attribute mapping to the same tensor, the strict mode will 1. Have a standard param buffer table to standardize the name (bug happens [here](`f861dc1826/torch/export/_trace.py (L356)`)! when 2 parameter have same id(param), the latter name will overwrite the previous name) 2. [Update](`f861dc1826/torch/export/_trace.py (L1481)`) exported signature with updated standard FQN (problematic) 3. When getting exported_program.module(), it will call [_unlift_exported_program_lifted_states](`f861dc1826/torch/export/exported_program.py (L1297)`) to recover attribute from exported signature where the parameter name is defined and standardized Then the named_parameter of this module will have overwritten name instead of original name ## How to reproduce the issue? reproduce issue shared by @taotaohuang001 torch version: 2.8.0 ```python import torch from torch import nn # ---- Toy model with embedding weight sharing (aliasing) ---- class Toy(nn.Module): def __init__(self): super().__init__() self.embedding_layers = nn.ModuleDict() tbl = nn.Embedding(100, 8) self.embedding_layers["ActorId"] = tbl # Alias: reuse the SAME module instance for another feature self.embedding_layers["RootActorId"] = self.embedding_layers["ActorId"] self.proj = nn.Linear(16, 1) def forward(self, feats: dict[str, torch.Tensor]): e1 = self.embedding_layers["ActorId"](feats["ActorId"]) e2 = self.embedding_layers["RootActorId"](feats["RootActorId"]) return self.proj(torch.cat([e1, e2], dim=-1)) torch.manual_seed(0) m = Toy().eval() # Show pre-export parameter names (canonicalized; shared weight appears once) print("PRE-EXPORT named_parameters:") print([name for name, _ in m.named_parameters()]) # Sanity: the two feature names point to the same weight object w1 = m.embedding_layers["ActorId"].weight w2 = m.embedding_layers["RootActorId"].weight print("PRE-EXPORT alias -> same object:", w1 is w2, "\| same storage:", w1.data_ptr() == w2.data_ptr()) # Example inputs (dict structure will be captured by export) ex_in = { "ActorId": torch.randint(0, 100, (4,)), "RootActorId": torch.randint(0, 100, (4,)), } # ---- Export (in memory) and materialize the runnable module ---- ep = torch.export.export(m, (ex_in,), strict=True) gm = ep.module() # GraphModule with new (canonical) parameter names print("\nPOST-EXPORT named_parameters (GraphModule):") post_names = [name for name, _ in gm.named_parameters()] print(post_names) # Prove alias persists after export: run fwd/bwd and check a single grad tensor exists out = gm(ex_in).sum() out.backward() # Find the embedding weight in the exported module by shape (100, 8) emb_names = [name for name, p in gm.named_parameters() if p.shape == torch.Size([100, 8])] print("\nEmbedding param (post-export) canonical name:", emb_names[0] if emb_names else "<not found>") # Show that only one grad exists for the shared table for name, p in gm.named_parameters(): if p.grad is not None and p.shape == torch.Size([100, 8]): print("Grad present on shared embedding weight:", name, "\| grad shape:", tuple(p.grad.shape)) break ``` And you will see parameters are different before and after export ``` PRE-EXPORT named_parameters: ['embedding_layers.ActorId.weight', 'proj.weight', 'proj.bias'] PRE-EXPORT alias -> same object: True \| same storage: True POST-EXPORT named_parameters (GraphModule): ['embedding_layers.RootActorId.weight', 'proj.weight', 'proj.bias'] Embedding param (post-export) canonical name: embedding_layers.RootActorId.weight Grad present on shared embedding weight: embedding_layers.RootActorId.weight \| grad shape: (100, 8) ``` ## Solution Fixing this issue by making sure latter named parameter will not overwrite the `param_buffer_table` when original model's named parameter already maps to certain parameter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160600 Approved by: https://github.com/angelayi	2025-08-25 19:40:06 +00:00
PyTorch MergeBot	3e210f90c2	Revert "[dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900 )" This reverts commit 1113e7de30da95973c1eac7921601f9a0e94f2db. Reverted https://github.com/pytorch/pytorch/pull/160900 on behalf of https://github.com/atalman due to executorch failure ([comment](https://github.com/pytorch/pytorch/pull/160900#issuecomment-3221372096))	2025-08-25 18:56:18 +00:00
Scott Wolchok	660b5656a4	Inline is_read_only_alias_match in _correct_storage_aliasing (#161285 ) Drives down the overhead of return_and_correct_storage_aliasing slightly. Hopefully you'll agree it doesn't compromise readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161285 Approved by: https://github.com/wconstab ghstack dependencies: #161231, #161234, #161235, #161240, #161284	2025-08-25 18:35:21 +00:00
Scott Wolchok	0e0bb4f1fd	Remove unnecessary len() call in _correct_storage_aliasing.is_read_only_alias_match (#161284 ) Containers are truthy iff they're non-empty. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161284 Approved by: https://github.com/Skylion007, https://github.com/wconstab ghstack dependencies: #161231, #161234, #161235, #161240	2025-08-25 18:35:21 +00:00
Scott Wolchok	b048f0e189	Improve efficiency of _python_dispatch.return_and_correct_aliasing (#161240 ) get_write_alias() call count reduction explained briefly in code comment. We don't need to check write_aliases against None in the final outs_to_return calculation because we just did that check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161240 Approved by: https://github.com/wconstab ghstack dependencies: #161231, #161234, #161235	2025-08-25 18:35:21 +00:00
Scott Wolchok	c35538d3c5	Minor cleanup of DeviceMesh.__eq__ (#161235 ) `self is other` means the same thing as `id(self) == id(other)`, but it's one operator instead of 3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161235 Approved by: https://github.com/wconstab, https://github.com/zpcore, https://github.com/fduwjj ghstack dependencies: #161231, #161234	2025-08-25 18:35:21 +00:00
Scott Wolchok	cfafd98c53	Use comparison key in OpSchema to avoid duplicate work between `__hash__` and `__eq__` (#161234 ) The performance cost of `dict` lookups keyed by `OpSchema` is a significant minority of DTensor overhead. With this change we shave a net ~1% off the total running time of the benchmark from #160580, as measured by using cProfile and comparing cumulative time spent in propagate + OpSchema's `__post_init__`. (`__post_init__` grew from 2.5% to 6.4% (+3.9%) and propagate shrank from 12.5% to 7.8% (-4.7%)). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161234 Approved by: https://github.com/wconstab ghstack dependencies: #161231	2025-08-25 18:35:21 +00:00
Scott Wolchok	5d6434b132	Fix OpSchema equality check (#161231 ) `__eq__` didn't compare lists of DTensorSpec, but `__hash__` did (and it looks like attention was paid to hash, so I made comparison follow suit). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161231 Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/zpcore	2025-08-25 18:35:21 +00:00
xinan.lin	2f0de0ff93	[Inductor] Update Intel Triton for PyTorch 2.9. (#161050 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161050 Approved by: https://github.com/anmyachev, https://github.com/EikanWang, https://github.com/jansel	2025-08-25 17:18:19 +00:00
angelayi	c081481bbe	[aoti-fx] Output OpOverload fallbacks (#161195 ) Updates the inductor-wrapper-fxir code to use the kernel.op_overload when generating extern kernel calls. This way we can keep the IR consistent with using ATen ops. TODO: we're also inserting torch.empty_strided calls -- need to turn this into aten too Pull Request resolved: https://github.com/pytorch/pytorch/pull/161195 Approved by: https://github.com/blaine-rister	2025-08-25 17:03:05 +00:00
PyTorch MergeBot	df571ae7ad	Revert "Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387 )" This reverts commit 3ea6cc8c2d443d6104159d50e8328c144f6caa39. Reverted https://github.com/pytorch/pytorch/pull/159387 on behalf of https://github.com/jeffdaily due to breaks ROCm, AttributeError: 'torch._C._CudaDeviceProperties' object has no attribute 'shared_memory_per_block_optin' ([comment](https://github.com/pytorch/pytorch/pull/159387#issuecomment-3220989480))	2025-08-25 16:50:03 +00:00
Animesh Jain	9e1c954134	[dynamo] Pass requires_grad to nn.Parameter construction (#161364 ) Fixes https://github.com/pytorch/pytorch/issues/161191 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161364 Approved by: https://github.com/Skylion007, https://github.com/StrongerXi	2025-08-25 16:49:28 +00:00
Tom Ritchford	83283ce7f5	docstring_linter: Fix #151692 and other issues (#156596 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156596 Approved by: https://github.com/eellison	2025-08-25 16:04:14 +00:00
Hashem Hashemi	ab8d60f4c8	[ROCm] Unroll loads in global_reduce (#161181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161181 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-25 15:45:49 +00:00
Xuehai Pan	af3265d20f	[BE][CI] fix `pkg=<pin>` to `pkg==<pin>` in pip requirement specs (#160811 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160811 Approved by: https://github.com/seemethere	2025-08-25 15:31:21 +00:00
Eddie Yan	f391afe9bf	[cuDNN][convolution] remove redundant conv3d 64bit test (#161177 ) turns out it's the same as ``` @onlyCUDA @largeTensorTest("40GB") @largeTensorTest("24GB", "cpu") @tf32_on_and_off(0.005) def test_conv3d_64bit_indexing(self, device): x = torch.rand(1, 32, 512, 512, 256) m = torch.nn.Conv3d(32, 1, kernel_size=1, padding=0, stride=1, bias=False) yref = m(x) y = m.to(device=device)(x.to(device=device)) self.assertEqual(yref, y) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161177 Approved by: https://github.com/Skylion007	2025-08-25 15:01:05 +00:00
zhxchen17	1113e7de30	[dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900 ) convert_frame.compile_frame used to take a callback transform function which will capture the frame object it has, but the frame information is not passed directly into compile_frame function. This PR changes the signature of compile_frame so that frame information is directly passed in the function without taking a callback. This makes it easier to build fullgraph capture API on top of compile_frame. @exported-using-ghexport Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801/) Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160900 Approved by: https://github.com/tugsbayasgalan, https://github.com/anijain2305	2025-08-25 14:53:54 +00:00
PyTorch MergeBot	40c0e700a4	Revert "[AMD] Fix AMD User Defined Kernel Autotune (#160671 )" This reverts commit 431846a6323c6f1d02da49e311ac694324f386f4. Reverted https://github.com/pytorch/pytorch/pull/160671 on behalf of https://github.com/atalman due to new test is failing: inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_rocm_triton_autotuning_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17172795679/job/48725235301) [HUD commit link](`431846a632`) ([comment](https://github.com/pytorch/pytorch/pull/160671#issuecomment-3220442141))	2025-08-25 14:07:48 +00:00
zeshengzong	510825e5fe	Optimize `dynamo` typing (#147499 ) Optimize dynamo methods type annotation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147499 Approved by: https://github.com/anijain2305	2025-08-25 13:20:45 +00:00
PyTorch MergeBot	ab7787fb82	Revert "[inductor] Windows inductor use intel-openmp. (#160258 )" This reverts commit 41673110cd7c5960824cc74a6fcaeda1a8bc7a23. Reverted https://github.com/pytorch/pytorch/pull/160258 on behalf of https://github.com/malfet due to Reverting to fix https://github.com/pytorch/pytorch/issues/160898 and https://github.com/pytorch/pytorch/issues/160962 ([comment](https://github.com/pytorch/pytorch/pull/160258#issuecomment-3220158145))	2025-08-25 12:57:47 +00:00
PyTorch MergeBot	1eccfb157a	Revert "[BE] Remove intel-openmp dependency in setup.py (#160976 )" This reverts commit e4839470470168648dee5997f57347bb8541ea2b. Reverted https://github.com/pytorch/pytorch/pull/160976 on behalf of https://github.com/malfet due to This PR is doing something strange ([comment](https://github.com/pytorch/pytorch/pull/160976#issuecomment-3220120462))	2025-08-25 12:46:12 +00:00
Raman Kumar	4651aaac47	Fix typo: 'complext' (#160335 ) minor fix for a typo: `complext` to `complex` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160335 Approved by: https://github.com/Skylion007	2025-08-25 10:37:59 +00:00
Liang Wang	037c43d3b2	[tgif] fix getattr_recursive with ModuleList (#161204 ) Summary: This change updates `getattr_recursive` to handle qualnames with ModuleList that contain digit indices, for example, `op_instances.1.value_model.feature_weights` Test Plan: TBA Rollback Plan: Reviewed By: jiayisuse Differential Revision: D80503985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161204 Approved by: https://github.com/jiayisuse	2025-08-25 10:08:47 +00:00
Dmitry Rogozhkin	eb5549a431	xpu: fix cpp_extension compatibility with oneapi dpc++ 2025.2 compiler (#161012 ) Intel oneapi DPC++ compiler has changed (fixed) parsing of `-fsycl-host-compiler-options` option in the respect of treating arguments with escaped quotes. This commit adds an if branches depending on compiler versions. Fixes: https://github.com/intel/torch-xpu-ops/issues/1938 CC: @chuanqi129 @EikanWang @guangyey Pull Request resolved: https://github.com/pytorch/pytorch/pull/161012 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-25 09:29:53 +00:00
FFFrog	56ebed627a	[OpenReg] Add OSX/Windows Support for OpenReg (#159441 ) As the title stated. Changes: - Abstract platform-specific APIs - Add OSX/Windows support - Set default symbol visibility to "hidden" Co-authored-by: @can-gaa-hou Original PR:https://github.com/pytorch/pytorch/pull/159029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159441 Approved by: https://github.com/albanD Co-authored-by: jiahaochen666 <jiahaochen535@gmail.com>	2025-08-25 08:03:27 +00:00
Liao, Wei	80df27a612	port distributed pipeline test files for Intel GPU (#159033 ) In this PR we will port all distributed pipeline test files. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. instantiate_device_type_tests() 2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend 3. use "requires_accelerator_dist_backend()" to replace requires_nccl() 4. use "get_default_backend_for_device()" to get backend 5. enabled XPU for some test path Pull Request resolved: https://github.com/pytorch/pytorch/pull/159033 Approved by: https://github.com/guangyey, https://github.com/kwen2501	2025-08-25 05:24:27 +00:00
Will Constable	e3d68dfae2	[DTensor] Make default RNG semantics match user-passed generator (#160482 ) Previously, DTensor kept its own copy of the generator state after the first time a random operator was called on a DTensor. This copy would evolve independently from the generator outside of DTensor. After adding support for users to pass a specific generator into random operators (e.g. `uniform_(..., generator=)`), it was determined (in discussion on #159991) to change the semantics so that any random operations performed on DTensor would evolve the state of the publicly visible generators (either the default one or user-passed one). The upsides are (1) it is now possible to call torch.manual_seed() at any point in the program and have a consistent effect on DTensor, (2) DTensor ops have an observable effect on the generator. The downside is that users are now responsible for seeding their generator before using DTensor, ensuring all ranks use the same seed. Fixes #159991 confirmed docs rendered OK <img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160482 Approved by: https://github.com/wanchaol	2025-08-25 04:21:19 +00:00
Natalia Gimelshein	726dce3c94	[nccl symm mem] don't use arg for mempool, correctly use symmetric registration in hooks (#161238 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/161238 Approved by: https://github.com/kwen2501, https://github.com/syed-ahmed	2025-08-25 03:09:32 +00:00
Chuanhao Zhuge	74280d0913	[muon] Introduce Muon optimizer to PyTorch (#160213 ) A single-device version of Muon. Algorithm refers Keller Jordan's [Muon blogpost](https://kellerjordan.github.io/posts/muon/), and optionally incorporates [Moonshot's](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf) learning rate adjustment strategy. This implementation maintains a minimalist API and is consistent with other optimizer conventions. PyTorch team prefers to handle parameter filtering at a higher level, with the Muon optimizer performing only the msign computation for orthogonalization on all parameters it receives. Users are responsible for grouping parameters for different optimizers as needed. An example usage is shown below, and a more detailed example will be added to the [PyTorch examples](https://github.com/pytorch/examples) directory. Usage ```python model = MyModelForCausalLM # filter out your params manually muon_params = [...] adamw_params = [...] muon = Muon( params = muon_params lr=lr, wd=wd, ) adamw = AdamW( params = adamw_params lr=lr, wd=wd, ) # in training loop loss = model(input) loss.backward() muon.step() adamw.step() muon.zero_grad() adamw.zero_grad() ``` ~~Additional usage~~ ~~Users are also able to pass in self-defined `msign` function for orthogonalization, and learning rate adjustment function. Interface defined below:~~ ```python ~~AdjustLrFn: TypeAlias = Callable[[float, torch.Size], float]~~ ~~MsignFn: TypeAlias = Callable[[Tensor, BaseMsignFnConfig], Tensor]~~ ``` As discussed with team and in comment, we prefer to make the interface simpler and cleaner, thus we removed the callback interface, and canonicalize the original NS algorithm for Muon. The only configs available to users are `ns_steps`, `coefficients`, and `eps`, configurable through kwargs. By default, we use 5-step Newton-Schulz, with coefficients proposed by [Keller](https://kellerjordan.github.io/posts/muon/). We use LR adjustment proposed by [Moonshot](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf), which grafts learning rate from AdamW. Testing ~~1. Unit tests: the newly introduced Muon is covered in `test/test_optim.py`. We updated the test cases to pass named parameters to the optimizer under test. Additionally, we introduced a new test case to verify that when the user provides an empty FQN list, Muon correctly falls back to AdamW behavior.~~ As discussed, in order not to complicate the codebase, we prefer not to include reference implementation into PyTorch. We also updated the interface so we don't need to test the FQN based filtering. Muon is covered by the existing `test_optim.py` unit test. 2. End-to-end test: we added a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curve is compared against the Moonshot implementation to confirm behavioral consistency. <img width="1102" height="472" alt="Screenshot 2025-07-29 at 1 04 12 AM" src="https://github.com/user-attachments/assets/ceab0733-497d-4070-8032-02ae7995c64c" /> Numerics We evaluate our implementation with existing implementation to confirm numerical consistency. As discussed, our implementation closely follows the algorithm described in [Keller's post](https://kellerjordan.github.io/posts/muon/), while incorporating the learning rate adjustment from [Moonlight](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf). This captures a key insight that allows users to reuse hyper-parameters tuned for `adamW`, making Muon a drop-in swap. As expected, the numerics difference mainly comes from `adjust_lr`, a max of ~5% relative diff in an example unit test setup below. ```python # dummy model and data model0 = Linear(10, 10, bias=False) model1 = copy.deepcopy(model0) inputs = torch.randn(8, 10) targets = torch.randn(8, 10) loss = MSELoss() lr = 1e-3 wd = 0.1 momentum = 0.95 opt_ref_muon = KellySingleDeviceMuon( params=model0.parameters(), lr=lr, weight_decay=wd, momentum=momentum, ) opt_exp_muon = Muon( params=model1.parameters(), lr=lr, weight_decay=wd, momentum=momentum, ) out_ref = model0(inputs) loss_ref = loss(out_ref, targets) opt_ref_muon.zero_grad() loss_ref.backward() opt_ref_muon.step() out_exp = model1(inputs) loss_exp = loss(out_exp, targets) opt_exp_muon.zero_grad() loss_exp.backward() opt_exp_muon.step() for p_ref, p_exp in zip(model0.parameters(), model1.parameters()): torch.testing.assert_close(p_ref, p_exp) ``` As explained above, including this `adjust_lr` is preferable. This is validated by an e2e training runs on training a qwen-2-like 0.5b model, where the curves show that training with `adjust_lr` converges more effectively than without. <img width="1179" height="464" alt="Screenshot 2025-08-18 at 10 12 33 AM" src="https://github.com/user-attachments/assets/e797d3da-c2f0-4187-b99e-5d48b7437c3c" /> Performance Training for one epoch of openwebtext-100k on eight H100 GPUs with DDP: - adamw_ddp finishes in 13.12 min - pytorch_muon_ddp finishes in 13.45 min Muon runs ~20s slower compared to AdamW. Assuming no other changes, Muon is 2.5% slower than AdamW. AdamW: Optimizer.step() takes ~13.5 ms, step time ~930 ms <img width="726" height="590" alt="Screenshot 2025-07-29 at 1 56 14 AM" src="https://github.com/user-attachments/assets/ebcd7e1c-d129-4b20-9396-39f568edf03d" /> Muon: Optimizer.step() takes ~54 ms, step time ~960 ms <img width="751" height="597" alt="Screenshot 2025-07-29 at 2 02 20 AM" src="https://github.com/user-attachments/assets/72f5b904-ebd5-4502-a6ff-d3e9e5a6da81" /> Note We restrict the implementation to accept only 2D parameters. An alternative approach is to allow parameters with more than two dimensions and apply orthogonalization over the last two dimensions. We opt not to go with this approach as it can be error-prone. For example, with a kernel shaped `[in_channel, height, width, out_channel]`, applying orthogonalization to the last two dimensions is not meaningful. Since Muon is designed to operate orthogonalization on 2D matrices, preserving this assumption keeps the implementation clean and sound. Next Steps 1. Add `MuP` 2. Open-source optimized triton kernel for symmetric matmul. A preliminary benchmark found 1.23x - 1.48x speedup on small - large (n = 256 -> 16384) matrices. 3. Open-source unsharded Muon co-designed with FSDP2. **** Pull Request resolved: https://github.com/pytorch/pytorch/pull/160213 Approved by: https://github.com/janeyx99	2025-08-24 08:03:04 +00:00
Ting Lu	1de4540449	Use -compress-mode=size for CUDA 13 build for binary size reduction (#161316 ) https://github.com/pytorch/pytorch/issues/159779 CUDA 13 added the support for --compress-mode flag for nvcc across all drivers of CUDA 13.X toolkits, enabling the possibility to use --compress-mode=size for significant size reduction (~71% less for CUDA Math APIs for example). https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/ Why we have to add for CUDA 13 only, quote from @ptrblck : Any usage of --compress-mode=size/balance will drop the support of older CUDA drivers and will bump the min. driver requirement to CUDA 12.4. https://github.com/pytorch/pytorch/pull/157791#issuecomment-3058027353 Default for CUDA 13 will be --compress-mode=balance which gives smaller binaries than LZ4 speed mode used in previous CUDA versions. Related - https://github.com/pytorch/pytorch/pull/157791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161316 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007	2025-08-24 03:28:29 +00:00
Aidyn-A	3e5b021f21	[ATen][CPU][Sparse] Use Third-Party Eigen for sparse add and addmm (#155357 ) This pull request adds the following ops for sparse matrices using Eigen library: ```python add(a_csr, b_csr) add(a_csc, b_csc) addmm(c_csr, a_csr, b_csr) addmm(c_csr, a_csr, b_csc) addmm(c_csr, a_csc, b_csc) addmm(c_csr, a_csc, b_csr) addmm(c_csc, a_csr, b_csr) addmm(c_csc, a_csr, b_csc) addmm(c_csc, a_csc, b_csc) addmm(c_csc, a_csc, b_csr) ``` Currently, the operations for sparse matrices on CPU are available through MKL only. The non-existence of MKL on `aarch64` causes the unavailability of these ops on any machines with ARM based CPUs, including Apple Silicon, AWS Graviton and NVIDIA Grace. This PR addresses this issue by using Eigen as a backend for the above ops. This is a re-factored version of my previous PR #101814. The main difference with the old one, this does not enable Eigen by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155357 Approved by: https://github.com/pearu, https://github.com/eqy Co-authored-by: Eli Uriegas <eliuriegas@meta.com>	2025-08-23 19:03:55 +00:00
Nikita Shulga	4acdbb8311	[MPS] Fix index_copy for strided indices (#161333 ) By passing strides to strided variant of the tensor Fixes https://github.com/pytorch/pytorch/issues/160993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161333 Approved by: https://github.com/huydhn, https://github.com/wdvr ghstack dependencies: #161206, #161267	2025-08-23 14:38:57 +00:00
PyTorch MergeBot	f912c93344	Revert "Move non inductor workflows to Python 3.9 -> 3.10 (#161182 )" This reverts commit e20f6d798606f3245686e950c43635bbe526232d. Reverted https://github.com/pytorch/pytorch/pull/161182 on behalf of https://github.com/zou3519 due to broke dynamo_wrapped tests, those are a bit finicky to fix (there is probably more than one failure!) ([comment](https://github.com/pytorch/pytorch/pull/161182#issuecomment-3216953097))	2025-08-23 13:00:42 +00:00
Paul de Supinski	33346b5814	Support NUMA Binding for Callable Entrypoints, Take 2 (#161183 ) # Context In #160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See #160006 for a full description of the challenges. Although #160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set. # This PR On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](https://github.com/pytorch/pytorch/issues/160006#issuecomment-3162018836) Specifically, for each local rank `i`: 1. The parent process sets its own CPU affinity to what local rank `i`'s should be. 2. Then, the parent spawns the subprocess for local rank `i`. 3. Finally, the parent resets its own CPU affinity to what it was originally. There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for both `str` and `Callable`, and it's pretty simple. This required a bit of refactoring: 1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`. 2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.). 3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process. 4. Use the context manager for both `str` and `Callable` paths # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161183 Approved by: https://github.com/d4l3k	2025-08-23 07:23:22 +00:00
Chong Gu	431846a632	[AMD] Fix AMD User Defined Kernel Autotune (#160671 ) Summary: AMD specific kwargs need to be removed from the guard, otherwise a keyerror will be raised when executing the kernel. Test Plan: ``` buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR ``` can succeed after this change. Rollback Plan: Differential Revision: D80285441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160671 Approved by: https://github.com/muchulee8	2025-08-23 07:23:09 +00:00
Malay Bag	cd31be28ec	Reland D80238201: [Torch.Export] Add flat arg paths in error message (#160919 ) Summary: [The diff was reverted due to CLA error, in the process of retrieving account] Previous error message ``` RuntimeError: Expected input at args.<unknown location>.shape[0] to be equal to 4096, but got 7680. If you meant for this dimension to be dynamic, please re-export and specify dynamic_shapes (e.g. with Dim.DYNAMIC) ``` New error message ``` RuntimeError: Expected input at args.[0].supervision_input.weight.shape[0] to be equal to 4096, but got 7680. If you meant for this dimension to be dynamic, please re-export and specify dynamic_shapes (e.g. with Dim.DYNAMIC) ``` Test Plan: ``` buck test mode/opt apf/rec/ir/tests:ir_export_deserialize_test ``` https://www.internalfb.com/intern/testinfra/testrun/4785074906254375 ``` buck run mode/opt caffe2/test:test_export -- -r unflatten ``` ``` Ran 413 tests in 208.414s OK (skipped=1, expected failures=13) ``` Rollback Plan: Differential Revision: D80487367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160919 Approved by: https://github.com/angelayi	2025-08-23 07:20:58 +00:00
PyTorch MergeBot	710514a2a5	Revert "Enable output padding when only outermost dim is dynamic (#159404 )" This reverts commit f15ada5c6fad97a7dcbfa4673f067b6942dda640. Reverted https://github.com/pytorch/pytorch/pull/159404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/159404#issuecomment-3216517032))	2025-08-23 07:17:30 +00:00
Xu Han	22df59efc0	[inductor] add MSVC language pack check. (#161298 ) Check MSVC's language pack: https://github.com/pytorch/pytorch/issues/157673#issuecomment-3051682766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161298 Approved by: https://github.com/angelayi	2025-08-23 07:06:48 +00:00
Angel Li	3a4140bf8e	[FlexAttention] fixing learnable bias assertion error in inductor (#161170 ) Users encountered unexpected behaviour when using FlexAttention with learnable biases, including assertion errors (#157677) We traced the root cause to the registration of subgraph buffers—this caused inconsistencies in the naming and ultimately incorrect retrieval later on. This problem only arose if the model was compiled as a whole (ie using @torch.compile) since only then would there be naming conflicts. In this PR, we register the buffers with the base graph to solve this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161170 Approved by: https://github.com/drisspg	2025-08-23 06:24:22 +00:00
Yang Wang	6443ea337d	enable more tests (#161192 ) Enable more vllm test against pytorch main, add schedule to run the test every 12 hours. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161192 Approved by: https://github.com/huydhn	2025-08-23 06:01:22 +00:00
Justin Chu	36ac916929	[ONNX] Fix lower opset version support in dynamo=True (#161056 ) After we switched to constructing the registry with the specified opset version in dynamo=True, support for opset<18 was broken because there would be no torchlib ops registered for these opsets. I updated the registry creation logic to always use opset 18 if the requested opset is lower, and use the version converter (as designed) to target those opsets. This requires onnxscript>=0.4 (https://github.com/pytorch/pytorch/pull/161312) Fixes https://github.com/onnx/onnx/issues/7235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161056 Approved by: https://github.com/titaiwangms	2025-08-23 05:04:36 +00:00
PyTorch UpdateBot	7131bfab89	[vllm hash update] update the pinned vllm hash (#161227 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161227 Approved by: https://github.com/pytorchbot	2025-08-23 04:25:16 +00:00
PyTorch UpdateBot	ac8d9418ae	[audio hash update] update the pinned audio hash (#161331 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161331 Approved by: https://github.com/pytorchbot	2025-08-23 04:21:03 +00:00
Justin Chu	38a492d40d	[ONNX] Remove unused _onnx_supported_ops (#161322 ) Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161322 Approved by: https://github.com/titaiwangms	2025-08-23 02:42:25 +00:00
Kurt Mohler	394728bab2	[MPS] Update `avg_pool3d` kernel to use `opmath_t` (#161071 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161071 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #161011	2025-08-23 02:36:22 +00:00
Kurt Mohler	121afd6a8f	[MPS] Update `avg_pool2d` to use Metal kernel when `ceil_mode=True` (#161011 ) Fixes #160743 The MPS impl of `avg_pool2d` seems to only give incorrect results when `ceil_mode=True`. I wrote a performance measurement script (`0ee6e58643/avg_pool_mps/perf_2d.py`) which tests a bunch of different cases and also marks the cases where MPS and CPU results do not match. I found that if I update `avg_pool2d` to use the new Metal kernel in all cases, that fixes all the mismatches, but it also decreases performance for some of the `ceil_mode=False` cases. So I opted to only run the new Metal kernel when `ceil_mode=True`, which does not significantly decrease performance in any of the cases tested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161011 Approved by: https://github.com/malfet	2025-08-23 02:36:22 +00:00
Blaine Burton Rister	d228a776e9	[Inductor-FX] Support Tensorbox outputs (#161245 ) # Problem The FX converter previously supported graph outputs which were `StorageBox`, but not `TensorBox`. The latter seems to show up in certain cases when the output is a slice/view of the input. # Fix This PR generalizes the code to handle `MutableBox` instead of `StorageBox` specifically. # Test Added a CI test exposing the issue. The test case was found by intentionally breaking `TensorBox(ReinterpretView` support in https://github.com/pytorch/pytorch/pull/161258. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161245 Approved by: https://github.com/angelayi	2025-08-23 02:04:13 +00:00
can-gaa-hou	cee72119b2	[Test] Adding a testcase for constant_pad_nd (#161259 ) Fixes #161066 This PR adds a simple testcase for constant_pad_nd on MPS as mentioned in https://github.com/pytorch/pytorch/pull/161149#issuecomment-3211701274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161259 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-23 01:00:50 +00:00
PyTorch MergeBot	47d267364c	Revert "[SymmMem] Support rendezvous on slice of a tensor (#160825 )" This reverts commit 9d9cc9897ac44a1a8df38211b03d8342a8af48c3. Reverted https://github.com/pytorch/pytorch/pull/160825 on behalf of https://github.com/kwen2501 due to Change of course; use storage_ptr as key ([comment](https://github.com/pytorch/pytorch/pull/160825#issuecomment-3215951048))	2025-08-22 23:41:55 +00:00
Justin Chu	0d9da384ef	Bump onnxscript to 0.4.0 in CI (#161312 ) Use onnxscript apis for torch 2.9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161312 Approved by: https://github.com/titaiwangms, https://github.com/malfet	2025-08-22 23:23:08 +00:00
Aaron Pollack	f521e82a4e	Update pyrefly config for better codenav (#161200 ) This fixes behavior in codenav by switching from `replace_imports_with_any` to `ignore-missing-imports` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161200 Approved by: https://github.com/aorenste, https://github.com/albanD	2025-08-22 23:05:07 +00:00
Ivan Zaitsev	bcfe1b2d71	Add initial bc-linter configuration (#161319 ) Preparation for https://github.com/pytorch/test-infra/pull/7016 Currently merging this PR is a noop change for PyTorch repo (bc-linter is not looking at the config yet). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161319 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi	2025-08-22 22:54:25 +00:00
Justin Chu	419a2dbf5f	[ONNX] Remove enable_fake_mode and exporter_legacy (#161222 ) Remove enable_fake_mode and exporter_legacy entirely. Even though this is bc breaking, `enable_fake_mode` is no longer compatible with the latest version of transformers, and so it is no longer useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161222 Approved by: https://github.com/titaiwangms	2025-08-22 22:15:27 +00:00
Shivam Raikundalia	3373b074f5	[Profiler] Add GC Events to Python Stack Tracer (#161209 ) Summary: Adds Python Garbage Collection to Kineto Traces and Profiler FunctionEvents. Create custom cpp callback in profiler_python.cpp. Then define a python function with cpp and register that callback for all python garbage collection. We don't worry about thread safety in this case because we are only doing init/teardown for main thread while holding GIL. Currently we are hiding this behind experimental config because python tracing tends to be unstable especially when adding any new feature. If this is found to not add too much overhead we can set this to on by default. NOTE: To enable this you need both with_stack=True and the experimental config on! Test Plan: Ran trace with GC induced and saw it on trace Also added a test Rollback Plan: Differential Revision: D80491146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161209 Approved by: https://github.com/ngimel	2025-08-22 22:11:25 +00:00
Nikita Shulga	c8bb0e4720	[MPS] Fix `index_copy` for scalars (#161267 ) By `squeezing the input` when copying into scalar tensor from a 1d one And enable `test_index_copy_scalars_mps` Fixes https://github.com/pytorch/pytorch/issues/160737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161267 Approved by: https://github.com/manuelcandales, https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #161206	2025-08-22 21:45:34 +00:00
Rob Timpe	4c36c8a994	[dynamo] Support method calls on complex ConstantVariables (#161122 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161122 Approved by: https://github.com/mlazos, https://github.com/guilhermeleobas	2025-08-22 21:40:03 +00:00
Yiming Zhou	9d882fd9ff	[benchmark] Add torchscript jit.trace to benchmark option (#161223 ) For comparing NativeRT and TorchScript. We add `torchscript-jit-trace` as an option in the benchmark. With this option, we can run trace a model and run inference with the traced module using TorchScript interpreter ``` python ./benchmarks/dynamo/huggingface.py --performance --inference --torchscript-jit-trace python ./benchmarks/dynamo/timm_models.py --performance --inference --torchscript-jit-trace python ./benchmarks/dynamo/torchbench.py --performance --inference --torchscript-jit-trace ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161223 Approved by: https://github.com/huydhn	2025-08-22 21:38:28 +00:00
Eddie Yan	2835cc5e91	[cuDNN] head dim > 128 works on H100 again in cuDNN SDPA? (#161210 ) reference: https://github.com/pytorch/torchtitan/pull/1610 9.10 only for now, we would want to hold off on upgrading to either cuDNN frontend 1.14+/cuDNN 9.11+ due to some head-dim > 128 handling issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/161210 Approved by: https://github.com/Skylion007	2025-08-22 21:21:53 +00:00
PyTorch MergeBot	3f1a97a99c	Revert "[dynamic shapes] unbacked-safe slicing (#157944 )" This reverts commit 44549c7146bd6c4166f97e856037babe1b7f4f49. Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/pianpwk due to this PR & internal diff landed out of sync, just reverted internal with D80720654, will revert this & reland as codev ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3215610135))	2025-08-22 20:48:46 +00:00
PyTorch MergeBot	981ac533c6	Revert "Close some sources of fake tensor leakages (#159923 )" This reverts commit 5afa4187dfe1e99278f8e372ec09102d5b937572. Reverted https://github.com/pytorch/pytorch/pull/159923 on behalf of https://github.com/zou3519 due to broke aoti test in inductor periodic ([comment](https://github.com/pytorch/pytorch/pull/159923#issuecomment-3215580688))	2025-08-22 20:42:50 +00:00
Gabriel Ferns	3ea6cc8c2d	Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387 ) Conv exhuastive currently throws an error, and I think it's worth adding tests to the other ops too in order to prevent regression in exhaustive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159387 Approved by: https://github.com/coconutruben	2025-08-22 20:06:09 +00:00
PyTorch MergeBot	2c0650a00a	Revert "[BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711 )" This reverts commit 8dbe7f99bd707ee28ae12ecb9cab54e1785bf13e. Reverted https://github.com/pytorch/pytorch/pull/160711 on behalf of https://github.com/davidberard98 due to internal failure - T235384144 - I'll revert while I investigate. ([comment](https://github.com/pytorch/pytorch/pull/160711#issuecomment-3215343200))	2025-08-22 19:10:35 +00:00
PyTorch MergeBot	eba1ad09e4	Revert "[SymmMem] Support rendezvous on view of a tensor (#160925 )" This reverts commit 9d7cecdd6c44c5421d341bcc359be4097ea9a2f5. Reverted https://github.com/pytorch/pytorch/pull/160925 on behalf of https://github.com/kwen2501 due to Change of course: use storage ptr as symm mem keys as in the old days and force no_split in MemPool ([comment](https://github.com/pytorch/pytorch/pull/160925#issuecomment-3215315717))	2025-08-22 18:59:25 +00:00
Wang, Chuanqi	a43480d19c	[CD] Enable triton xpu Windows build for Python 3.14 (#161255 ) Follow #159869 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161255 Approved by: https://github.com/atalman	2025-08-22 18:39:31 +00:00
Xu Han	17b0263e86	[inductor] fix march=native pass to Windows CC. (#161264 ) fix march=native pass to Windows CC. <img width="593" height="218" alt="image" src="https://github.com/user-attachments/assets/1caedffa-d9be-43d9-9ce2-590c055980cd" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161264 Approved by: https://github.com/angelayi	2025-08-22 18:38:51 +00:00
Xu Han	97200c9711	[inductor] Add get page_size support for Windows. (#161273 ) `resource` can't work on Windows, as it is a Unix specific package as seen in https://docs.python.org/2/library/resource.html Use Windows system API to get page_size. Local tested: <img width="467" height="433" alt="image" src="https://github.com/user-attachments/assets/47a39060-3aea-46c3-bd8e-35a39413c51f" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161273 Approved by: https://github.com/angelayi	2025-08-22 18:36:14 +00:00
PyTorch MergeBot	1d458e2947	Revert "[Inductor] Update Outer Reduction Heuristic (#159093 )" This reverts commit f085f299584b06a2a7d8855eda2a411313e782ad. Reverted https://github.com/pytorch/pytorch/pull/159093 on behalf of https://github.com/seemethere due to this fails internal tests, see D80630416 for more info ([comment](https://github.com/pytorch/pytorch/pull/159093#issuecomment-3215263317))	2025-08-22 18:35:36 +00:00
Yidi Wu	266784ec6a	remove old while_loop_schema_gen test (#161202 ) Fixes https://github.com/pytorch/pytorch/issues/141202. This test is flaky for mysterious reasons and we have created a new way of creating schemas for hops. So delete the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161202 Approved by: https://github.com/zou3519	2025-08-22 18:22:29 +00:00
Jeff Daily	25df65afd8	[ROCm] revamp HIPCachingAllocatorMasqueradingAsCUDA (#161221 ) HIPAllocatorMasqueradingAsCUDA and HIPCachingAllocatorMasqueradingAsCUDA are now proper complete wrappers of HIPAllocator and HIPCachingAllocator, respectively. HIPAllocatorMasqueradingAsCUDA now subclasses HIPAllocator instead of Allocator. This fixes usability of hipify replacing c10::cuda::CUDACachingAllocator::get() where callers expect a CUDAAllocator to be returned but instead were getting a very thin Allocator shim instead. This also fixes using cudagraph trees with torch compile. The hip:0 device was not being replaced by the cuda:0 device in all methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161221 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-22 18:13:12 +00:00
atalman	e20f6d7986	Move non inductor workflows to Python 3.9 -> 3.10 (#161182 ) Related to: https://github.com/pytorch/pytorch/issues/161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161182 Approved by: https://github.com/malfet, https://github.com/huydhn	2025-08-22 16:48:43 +00:00
Nikita Shulga	c2390087c3	[MPS] Fix index_select for scalar_types (#161206 ) By copy-n-pasting logic from `index_select_out_cpu` (and `_cuda`), where essentially the resizing is done inside the op, which also fixes faulty logic for scalars Pull Request resolved: https://github.com/pytorch/pytorch/pull/161206 Approved by: https://github.com/manuelcandales	2025-08-22 16:45:35 +00:00
zeshengzong	f09458c2e1	Enable `test/test_numpy_interop.py` config in mypy (#158556 ) ## Test Result ```bash lintrunner --take MYPY test/test_numpy_interop.py Warning: Could not find a lintrunner config at: '.lintrunner.private.toml'. Continuing without using configuration file. ok No lint issues. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158556 Approved by: https://github.com/soulitzer	2025-08-22 16:18:58 +00:00
Jithun Nair	7fcdd8d6af	Use ROCm MI325 runners for trunk.yml (#161184 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161184 Approved by: https://github.com/jeffdaily	2025-08-22 16:18:55 +00:00
PyTorch MergeBot	c7a77470c5	Revert "[DTensor] Make default RNG semantics match user-passed generator (#160482 )" This reverts commit d1faf2ef0476eb60b42c057baee9af0f48ae849a. Reverted https://github.com/pytorch/pytorch/pull/160482 on behalf of https://github.com/jeffdaily due to failing cuda and rocm jobs ([comment](https://github.com/pytorch/pytorch/pull/160482#issuecomment-3214694297))	2025-08-22 15:04:28 +00:00
Rex Zhang	ce467df5d1	rm platform args xplat/langtech/mobile/BUCK (#161018 ) Differential Revision: D80460691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161018 Approved by: https://github.com/drisspg	2025-08-22 14:47:36 +00:00
IvanKobzarev	db44de4c0d	[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113 ) 1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory: ``` """ Alternative version of estimate_peak_memory, that respects the fact, that every SchedulerNode has multiple phases: 1. alloc ( outputs ) 2. run_kernel 3. dealloc last_use buffers estimate_peak_memory collapses memory into one value: size_alloc - size_free While peak memory happens after alloc. Duplicating the code to not migrate all callsites at once, In future usages of estimate_peak_memory will migrate to this version. """ ``` - Applying this in `reorder_communication_preserving_peak_memory` pass. 2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode. - Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size). 4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder. What is after this PR: Iterative recomputation of memory estimations matches full memory estimations. Active memory is not regressing a lot, but reserved memory is significantly regressed. Investigation and fix of "reserved" memory will be in following PRs. BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb ``` [rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step: 1 loss: 12.2722 grad_norm: 4.2192 active_memory: 24.66GiB(25.96%) reserved_memory: 25.38GiB(26.72%) tps: 99 tflops: 5.71 mfu: 0.58% [rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step: 2 loss: 13.1738 grad_norm: 50.5566 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 4,448 tflops: 257.63 mfu: 26.05% [rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step: 3 loss: 15.6866 grad_norm: 80.0862 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,900 tflops: 341.72 mfu: 34.55% [rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step: 4 loss: 13.4853 grad_norm: 7.8538 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,881 tflops: 340.57 mfu: 34.44% [rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step: 5 loss: 16.1191 grad_norm: 53.2481 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,867 tflops: 339.77 mfu: 34.35% ``` REORDER: active: 32Gb reserved: 36Gb ``` [rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step: 1 loss: 12.2490 grad_norm: 4.1944 active_memory: 24.66GiB(25.96%) reserved_memory: 26.81GiB(28.22%) tps: 85 tflops: 4.90 mfu: 0.50% [rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step: 2 loss: 13.1427 grad_norm: 39.5942 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 3,205 tflops: 185.61 mfu: 18.77% [rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step: 3 loss: 14.6084 grad_norm: 51.0743 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,688 tflops: 329.44 mfu: 33.31% [rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step: 4 loss: 13.6181 grad_norm: 8.1122 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,744 tflops: 332.68 mfu: 33.64% [rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step: 5 loss: 15.8913 grad_norm: 59.8510 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,046 tflops: 292.22 mfu: 29.55% ``` REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb ``` [rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step: 1 loss: 12.2646 grad_norm: 4.1282 active_memory: 27.60GiB(29.05%) reserved_memory: 32.49GiB(34.20%) tps: 173 tflops: 10.00 mfu: 1.01% [rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step: 2 loss: 13.2353 grad_norm: 42.4234 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,152 tflops: 356.26 mfu: 36.02% [rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step: 3 loss: 13.8205 grad_norm: 24.0156 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,169 tflops: 357.29 mfu: 36.13% [rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step: 4 loss: 13.1033 grad_norm: 9.1167 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,183 tflops: 358.10 mfu: 36.21% [rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step: 5 loss: 16.3530 grad_norm: 51.8118 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,130 tflops: 355.03 mfu: 35.90% ``` Differential Revision: [D80718143](https://our.internmc.facebook.com/intern/diff/D80718143) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113 Approved by: https://github.com/wconstab, https://github.com/eellison Co-authored-by: eellison <elias.ellison@gmail.com>	2025-08-22 14:19:57 +00:00
PyTorch MergeBot	639b8cc51d	Revert "cd: Add no-cache for test binaries (#149218 )" This reverts commit 523bffd38856dc9fca36bddded64f74822a6e1a2. Reverted https://github.com/pytorch/pytorch/pull/149218 on behalf of https://github.com/atalman due to Lets not use no-cache flags on test binaries ([comment](https://github.com/pytorch/pytorch/pull/149218#issuecomment-3214338844))	2025-08-22 13:14:23 +00:00
Ting Lu	49ff884b1e	Add CUDA 13.0 x86 builds (#160956 ) https://github.com/pytorch/pytorch/issues/159779 CUDA 13.0.0 NVSHMEM 3.3.20 CUDNN 9.12.0.46 Adding x86 linux builds for CUDA 13. Adding libtorch docker. Package naming changed for CUDA 13 (removed postfix -cu13 for some packages). Preparation checklist: 1. Update index https://download.pytorch.org/whl/nightly/cu130 with pypi packages 2. Update packaging name based on https://pypi.org/project/cuda-toolkit/ metadata Pull Request resolved: https://github.com/pytorch/pytorch/pull/160956 Approved by: https://github.com/atalman Co-authored-by: atalman <atalman@fb.com>	2025-08-22 11:31:09 +00:00
Ting Lu	a68f63e331	Add Windows CUDA 13 build and magma script (#161073 ) Add magma build 13.0 for Windows Add cuda_install.bat 13.0 for Windows build https://github.com/pytorch/pytorch/issues/159779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161073 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2025-08-22 11:24:25 +00:00
Tom Ritchford	774b4befa1	[BE] [dynamo] Simplify two methods in ConstDictVariable (#159361 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159361 Approved by: https://github.com/anijain2305	2025-08-22 11:11:30 +00:00
FFFrog	2beffb3311	Refactoring TensorImpl by using constexpr and std::is_same_v (#161043 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161043 Approved by: https://github.com/Skylion007	2025-08-22 10:49:49 +00:00
frost-intel	9b4adc4db7	[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL (#158568 ) Adds support for FlightRecorder in ProcessGroupXCCL. See https://github.com/intel/torch-xpu-ops/pull/1867 for XCCL implementation and more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158568 Approved by: https://github.com/guangyey, https://github.com/fduwjj	2025-08-22 09:03:35 +00:00
Arsh Zahed	9e491f753e	[dynamo] Remove extra if statement in builder _wrap (#161215 ) Removes a redundant if statement. Does not impact logic so no test changes needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161215 Approved by: https://github.com/StrongerXi	2025-08-22 08:56:06 +00:00
Yu, Guangye	373e25c2eb	Disable background threads for XPU host allocator (#161242 ) # Motivation https://github.com/pytorch/pytorch/pull/160505 enables background threads for XPU host allocator. However, it will hang on Windows during program exit. Now disable it until we narrow down the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161242 Approved by: https://github.com/EikanWang	2025-08-22 08:40:13 +00:00
IvanKobzarev	595987d28d	[bucketing] allow convert_element_type after fsdp reduce_scatter (#161159 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161159 Approved by: https://github.com/eellison	2025-08-22 06:41:50 +00:00
Xu Han	c4670e40c9	[inductor] remove Windows unsupported build options. (#161197 ) Changes: 1. Math related build option is not supported by msvc, skip them on Windows. 2. Move all math related build option to `_get_ffast_math_flags` function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161197 Approved by: https://github.com/jansel	2025-08-22 06:23:43 +00:00
Xu Han	9b3ebd25ac	[inductor] Enable max compatible to msvc for oneAPI headers. (#161196 ) Enable max compatible to msvc for oneAPI headers. The key context is `The /permissive- option is compatible with almost all of the header files from the latest Windows Kits` from https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161196 Approved by: https://github.com/jansel	2025-08-22 06:23:26 +00:00
zeshengzong	f8bd85827d	Optimzie `zero_grad` description (#161239 ) Optimize [zero_grad doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html) format and description. ## Test Result ### Before <img width="996" height="534" alt="image" src="https://github.com/user-attachments/assets/e1db973c-57e8-4525-90e7-0500cde2263d" /> ### After <img width="890" height="496" alt="image" src="https://github.com/user-attachments/assets/5579c4fb-a857-4030-9303-34770083d1a5" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161239 Approved by: https://github.com/janeyx99	2025-08-22 06:18:25 +00:00
Huy Do	bc7eaa0d8a	[BE] Remove the default TORCH_CUDA_ARCH_LIST in CI Docker image (#161137 ) This doesn't make sense to have this default to Maxwell, which is too old. All other places in CI/CD needs to overwrite this value. IMO, it makes more sense to not set this at all and let CI/CD jobs set it for their own use cases instead. This is partly responsible for the build failure in https://github.com/pytorch/pytorch/issues/160988 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161137 Approved by: https://github.com/msaroufim	2025-08-22 06:03:11 +00:00
Yang Wang	0dea191ff7	[VLLM TEST]setup test workflow (#160583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160583 Approved by: https://github.com/huydhn, https://github.com/atalman	2025-08-22 05:38:39 +00:00
Simon Fan	8aad3a60ce	[dynamo] propagate tensor metadata on Tensor.__setitem__(tensor) (#161036 ) Fixes silent incorrectness for autograd function tracing, where we rely on FakeTensor metadata (requires_grad) to determine whether to HOP or not: `5ee464db5c/torch/_dynamo/variables/misc.py (L671)` Stared at this with @anijain2305 yesterday, `Tensor.__setitem__` can update tensor metadata, and we can just run the fake prop and extract the output metadata from the updated FakeTensor. FIXES https://github.com/pytorch/pytorch/issues/160901 It should also be the root cause behind the issue in https://github.com/pytorch/torchtitan/pull/1604 @bdhirsh @ruisizhang123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161036 Approved by: https://github.com/anijain2305 ghstack dependencies: #160805	2025-08-22 04:43:22 +00:00
PyTorch UpdateBot	c7fb031706	[audio hash update] update the pinned audio hash (#161226 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161226 Approved by: https://github.com/pytorchbot	2025-08-22 04:22:08 +00:00
Yiming Zhou	c60dea5261	[export] Allow tempfile._TemporaryFileWrapper in package_pt2 (#161203 ) Summary: We use tempfile.NamedTemporaryFile to create a temporary pt2 file in `test_nativert.py` However, it is not recognized as an allowed file format and a warning will be thrown. Test Plan: CI Rollback Plan: Differential Revision: D80740916 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161203 Approved by: https://github.com/angelayi	2025-08-22 04:10:35 +00:00
Phoslight	bf8431ba06	[inductor][cpu] Fix double-offset issue in `GEMM_TEMPLATE` (#159233 ) Fixes #158076 Basically, the gemm template generates code like ``` cpp_CppMicroGemmRef_micro_gemm<static_cast<bool>(false), static_cast<bool>(false)>( &(X[static_cast<int64_t>(k_start + 196LLm_start + 38416LLks_b_index)]), &(W[static_cast<int64_t>(200704000LL + n_start + 80LLk_start + 15680LLks_b_index)]), &(local_acc_buf[static_cast<int64_t>(Nrnci + ((-1LL)Nrnc))]), static_cast<int64_t>(m_end + ((-1LL)m_start)), static_cast<int64_t>(Nr), static_cast<int64_t>(k_end + ((-1LL)k_start)), static_cast<int64_t>(196LL), static_cast<int64_t>(80LL), static_cast<int64_t>(Nc_blocksNr) ); ``` However, when the input tensor W has a storage offset, this results in a double offset issue. That is, the resulting pointer is `2 * 200704000LL` away from `W.storage().data_ptr()`, which causes an out-of-bounds access. The storage offset of `W` is introduced by [this patch](https://github.com/pytorch/pytorch/pull/136421/files), but I think it's a reasonable fix. So `cpp_gemm_template.py` should handle input matrices with storage offsets properly. I think a good way to fix this issue is to create a new matrix that has no storage offset. When `should_block_weights` is true, `block_weight()` creates a clean new matrix, so that branch is not affected by this issue. BTW I've also examined the FX IRs generated by `torch.compile()`, as well as the generated python module, and they are correct. The newly-added test in `test_cpu_select_algorithm.py` can reproduce the issue. With this patch, the crash is fixed. It also resolves the crash reported in #158076. I ran CPU tests in `test_cpu_select_algorithm.py`, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159233 Approved by: https://github.com/leslie-fang-intel, https://github.com/swolchok	2025-08-22 03:47:28 +00:00
Jovian Anthony Jaison	2fdd4f918c	Log exception_stack_trace to dynamo_compile (#161096 ) Note: Adding unit test for this is tricky as having errors in the specific unit test would cause test_utils.py to crash all together. Tested as follows: 1. Added x = 1/0 after guarded_code = compile_inner(code, one_graph, hooks, transform) in convert_frame.py 2. Printed exception_stack_trace and got: ['Traceback (most recent call last):\n File "/data/users/jovian/pytorch/torch/_dynamo/convert_frame.py", line 1207, in _compile\n x = 1/0\n ~^~\nZeroDivisionError: division by zero\n'] Pull Request resolved: https://github.com/pytorch/pytorch/pull/161096 Approved by: https://github.com/c00w	2025-08-22 03:29:15 +00:00
Scott Todd	31a41daff4	[ROCm][Windows] Include native_transformers srcs to fix link errors. (#160373 ) Following up on https://github.com/pytorch/pytorch/pull/152951#discussion_r2267714825, this removes a few lines added in that pull request, fixing link errors like ``` [7019/7028] Linking CXX shared library bin\torch_hip.dll FAILED: [code=4294967295] bin/torch_hip.dll lib/torch_hip.lib C:\Windows\system32\cmd.exe /C "cd . && D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\cmake\data\bin\cmake.exe -E vs_link_dll --msvc-ver=1942 --intdir=caffe2\CMakeFiles\torch_hip.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100261~1.0\x64\rc.exe --mt=C:\PROGRA~2\MICROS~2\2022\BUILDT~1\VC\Tools\Llvm\x64\bin\llvm-mt.exe --manifests -- D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO && cd ." LINK: command "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /MANIFEST:EMBED,ID=2" failed (exit code 1) with the following output: lld-link: error: undefined symbol: __declspec(dllimport) class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::native::transform_bias_rescale_qkv_cuda(class at::Tensor const &, class at::Tensor const &, __int64) >>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_CUDA___transform_bias_rescale_qkv(class 0xE9BF7323::Tensor const &, class 0xE9BF7323::Tensor const &, __int64)) >>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterNestedTensorCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_NestedTensorCUDA___transform_bias_rescale_qkv(class 0xEFEB5304::Tensor const &, class 0xEFEB5304::Tensor const &, __int64)) ``` The `native_transformers_hip_hip` and `native_transformers_hip_cpp` sources are okay to define (and are required) even if accelerated versions of these operations are not available. I've tested downstream builds of torch with ROCm on native Windows via https://github.com/ROCm/TheRock both with and without aotriton and these changes were needed for the build to succeed in both cases. I have _not_ tested Linux, WSL, or with the HIP SDK. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160373 Approved by: https://github.com/alugorey, https://github.com/jeffdaily	2025-08-22 01:43:25 +00:00
Jane Xu	cc791d5857	Quick fix to headers in stable/tensor_inl.h (#161168 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161168 Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007	2025-08-22 01:27:44 +00:00
Yiming Zhou	be2e6b3158	[export] Remove unused Model, tensor_paths, constant_paths (#161185 ) Summary: Removed `Model`, it's not being used anywhere so it's safe. Removed `tensor_paths` and `constant_paths` fields in `ExportedProgram` - BC: when the current deserializer load a previously serialized EP (that comes with empty `tensor_paths` and `constant_paths`), it will just ignore those two fields - FC: when the old deserializer load a newly serialized EP (that doesn't come with `tensor_paths` and `constant_paths`, it will also ignore those two fields in `_dict_to_dataclass()` Differential Revision: D80725094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161185 Approved by: https://github.com/SherlockNoMad	2025-08-22 01:07:01 +00:00
eellison	a85711d565	Avoid making node a successor/predecessor of itself (#161205 ) This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161205 Approved by: https://github.com/IvanKobzarev, https://github.com/bdhirsh	2025-08-22 00:30:29 +00:00
dolpm	ff4f5dd8ed	[nativert] oss layout planner tests (#160942 ) Summary: att - changed one of the tests to get rid of torcharrow dep. Test Plan: ``` buck2 test //caffe2/test/cpp/nativert:layout_planner_tests Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Reviewed By: SherlockNoMad Differential Revision: D80108549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160942 Approved by: https://github.com/georgiaphillips, https://github.com/henryoier	2025-08-22 00:26:25 +00:00
Ankita George	46429be723	[DCP][HF] Add option to parallelize reads in HF Storage Reader (#160205 ) Parallelize reading of data behind thread_count argument to HFStorageReader Test plan: ensure existing tests pass and run a job successfully with these changes Differential Revision: [D79478188](https://our.internmc.facebook.com/intern/diff/D79478188/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160205 Approved by: https://github.com/meetv18	2025-08-21 23:58:02 +00:00
dependabot[bot]	f5bf5147ad	Bump uv from 0.8.4 to 0.8.6 in /.ci/lumen_cli (#161212 ) Bumps [uv](https://github.com/astral-sh/uv) from 0.8.4 to 0.8.6. - [Release notes](https://github.com/astral-sh/uv/releases) - [Changelog](https://github.com/astral-sh/uv/blob/main/CHANGELOG.md) - [Commits](https://github.com/astral-sh/uv/compare/0.8.4...0.8.6) --- updated-dependencies: - dependency-name: uv dependency-version: 0.8.6 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-08-21 15:54:34 -07:00
PyTorch MergeBot	fc0683b1e7	Revert "[ATen][CPU][Sparse] Use Third-Party Eigen for sparse add and addmm (#155357 )" This reverts commit ce048de608180fa88335e5821070472539968b54. Reverted https://github.com/pytorch/pytorch/pull/155357 on behalf of https://github.com/seemethere due to This is causing buck builds to fail since we didn't add the definition of AT_USE_EIGEN_SPARSE in the buckbuild.bzl file, will follow-up and re-land this. ([comment](https://github.com/pytorch/pytorch/pull/155357#issuecomment-3212270510))	2025-08-21 22:38:40 +00:00
Nikita Shulga	cb57953215	[BE] Enable `test_index_put_accumulate_duplicate_indices` on MPS (#161201 ) By changing dtype to float if device is MPS Note: for some reason test runs much longer on MPS than on CPU ``` % python ../test/test_indexing.py -v -k test_index_put_accumulate_duplicate_indices_mps test_index_put_accumulate_duplicate_indices_mps (__main__.TestIndexingMPS.test_index_put_accumulate_duplicate_indices_mps) ... ok ---------------------------------------------------------------------- Ran 1 test in 9.139s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161201 Approved by: https://github.com/dcci	2025-08-21 22:05:42 +00:00
PaulZhang12	f085f29958	[Inductor] Update Outer Reduction Heuristic (#159093 ) Update outer reduction heuristics for significant speedups. HuggingFace: <img width="572" height="705" alt="Screenshot 2025-08-20 at 12 44 51 AM" src="https://github.com/user-attachments/assets/4872a23b-d136-423a-b2e6-187895bccba1" /> Average ~20% speedup on a kernel by kernel basis TorchBench: <img width="572" height="705" alt="Screenshot 2025-08-20 at 12 45 10 AM" src="https://github.com/user-attachments/assets/b8357b6d-6107-4104-b906-292a17d14d48" /> Average ~40% speedup on a kernel by kernel basis <img width="1705" height="729" alt="Screenshot 2025-08-21 at 5 50 32 PM" src="https://github.com/user-attachments/assets/a9715a2b-9e6c-4b33-ba9f-7870dc561e31" /> Differential Revision: [D80630416](https://our.internmc.facebook.com/intern/diff/D80630416) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159093 Approved by: https://github.com/jansel	2025-08-21 22:02:49 +00:00
Will Constable	d1faf2ef04	[DTensor] Make default RNG semantics match user-passed generator (#160482 ) Previously, DTensor kept its own copy of the generator state after the first time a random operator was called on a DTensor. This copy would evolve independently from the generator outside of DTensor. After adding support for users to pass a specific generator into random operators (e.g. `uniform_(..., generator=)`), it was determined (in discussion on #159991) to change the semantics so that any random operations performed on DTensor would evolve the state of the publicly visible generators (either the default one or user-passed one). The upsides are (1) it is now possible to call torch.manual_seed() at any point in the program and have a consistent effect on DTensor, (2) DTensor ops have an observable effect on the generator. The downside is that users are now responsible for seeding their generator before using DTensor, ensuring all ranks use the same seed. Fixes #159991 confirmed docs rendered OK <img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160482 Approved by: https://github.com/wanchaol	2025-08-21 22:02:16 +00:00
Yang Wang	cc2b65a91a	[VLLM]setup test cli logics (#160361 ) setup vllm test logics. 1. install wheels generated from previous build stage 2. generate and install vllm test pkg list on run time based on the torch wheels in the instance 3. run test based on the pre-defined test plan notice the test-plan format is temporary for some basic vllm testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/160361 Approved by: https://github.com/atalman, https://github.com/huydhn	2025-08-21 21:59:41 +00:00
Gabriel Ferns	67fc16c744	Add profiler analysis flag to combine multiple profiles into one (#161145 ) Combine multiple profiles into one: ``` python profile_analysis.py --combine <file1> <file2> ... <out> ``` This only works well if they have different pids, like from different programs in a distributed run. <img width="1521" height="465" alt="combining_multiple_profiles" src="https://github.com/user-attachments/assets/aba7112b-e9a9-4075-b82b-a4e4408384da" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161145 Approved by: https://github.com/xmfan	2025-08-21 21:36:58 +00:00
Ankita George	fb241d0a44	[dcp][hf] Fix multi-rank consolidation for no files to process case (#160660 ) Summary: In the consolidate_safetensors_files_on_every_rank method, where we use multiple ranks to combine sharded safetensors files, if there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work. Test Plan: tested this case on a job e2e added a unit test Rollback Plan: Differential Revision: D80273616 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160660 Approved by: https://github.com/sibuachu	2025-08-21 21:18:03 +00:00
Jagadish Krishnamoorthy	d2b8c0d431	forward fix of #152198 (#161166 ) torch._inductor.virtualized.OpsValue objects instance does not have shape attribute. This breaks the fp8 test on ROCm. Add the OpsValue class in todo list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161166 Approved by: https://github.com/jeffdaily	2025-08-21 21:09:48 +00:00
can-gaa-hou	e25ee0290e	Fix constant_pad_nd_mps bug when pad is empty (#161149 ) Fixes #161066 There is a size check here, which causes the error. `8ce81bcee1/aten/src/ATen/native/mps/operations/Pad.mm (L39-L40)` If the argument `pad` is empty, it will return the cloned tensor on CPU. `8ce81bcee1/aten/src/ATen/native/PadNd.cpp (L43-L64)` Therefore, this PR fixes the empty padding argument error by checking the size first and returning a cloned tensor immediately if the padding size is 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161149 Approved by: https://github.com/malfet	2025-08-21 20:45:26 +00:00
Animesh Jain	5805c4210b	[invoke_subgraph][inductor] Thread graphsafe rng input states for hops (#160713 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160713 Approved by: https://github.com/eellison	2025-08-21 20:41:29 +00:00
Xu Han	db38c44ad6	[inductor] add libraries_dirs for level_zero (#161146 ) Changes: 1. change set `include_dirs` to append value. 2. add append `libraries_dirs` for level_zero. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161146 Approved by: https://github.com/angelayi	2025-08-21 19:55:12 +00:00
Xu Han	1e3fe78a10	[inductor] disable min/max macro on Windows. (#161133 ) Disable min/max macro on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161133 Approved by: https://github.com/angelayi	2025-08-21 19:52:56 +00:00
Tsung-Hsien Lee	a445b41e4f	[pytorch] Simplify PyTorch `foreach_*` API restrictions check (#161039 ) Summary: C++'s polymorphism and reusing components help us reduce the amount of bolierplate codes here. Test Plan: CI & tests Rollback Plan: Differential Revision: D80594353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161039 Approved by: https://github.com/janeyx99	2025-08-21 19:50:02 +00:00
Tsung-Hsien Lee	801851086d	[pytorch] Invoke `vector.reserve()` consistently for non-inplace foreach operations (#161128 ) Summary: The `reserve()` method is used to pre-allocate memory for the result vector before adding elements to it. This is an optimization that makes sense for several reasons: 1. Performance improvement: By pre-allocating memory for the exact number of elements needed, it avoids multiple reallocations and memory copies that would occur as the vector grows dynamically. 2. Memory efficiency: It ensures that the vector allocates exactly the amount of memory needed, no more and no less, which is efficient when we know the final size in advance. 3. Reduced overhead: Each reallocation typically involves: - Allocating a new, larger block of memory - Copying all existing elements to the new location - Destroying the old elements - Deallocating the old memory block - Consistent performance: Without reservation, vector growth typically follows a geometric progression (like 1, 2, 4, 8, 16...), which can lead to unpredictable performance spikes when reallocation occurs. Test Plan: OSS CI & tests Rollback Plan: Differential Revision: D80674453 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161128 Approved by: https://github.com/Skylion007	2025-08-21 19:43:11 +00:00
dolpm	958f9ca88e	[nativert] oss static kernel tests (#161087 ) Summary: att - should be no-op Test Plan: buck2 test //caffe2/test/cpp/nativert:static_kernel_ops_tests Tests finished: Pass 24. Fail 0. Fatal 0. Skip 0. Build failure 0 Rollback Plan: Differential Revision: D80216488 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161087 Approved by: https://github.com/georgiaphillips, https://github.com/henryoier	2025-08-21 19:42:21 +00:00
James Wu	9668210302	Allow bypasses for Precompile when guards, etc. cannot be serialized (#160902 ) This adds a new function `bypass_package` and `CompilePackage.bypass_current_entry()`. This allows us to safely bypass if there are models with unserializable or incompatible parts. When we encounter something incompatible, we'll raise a bypass and ignore that particular code in DynamoCodeEntry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160902 Approved by: https://github.com/zhxchen17	2025-08-21 18:20:42 +00:00
Huy Do	3f5a8e2003	Fix torchaudio build when TORCH_CUDA_ARCH_LIST is not set (#161084 ) Fixes https://github.com/pytorch/pytorch/issues/160988. The root cause can be found in the same issue. This fix ensures that when reuse old wheel is on and `torchaudio` wheel is not there, the inductor test job can still rebuild the wheel it needs Pull Request resolved: https://github.com/pytorch/pytorch/pull/161084 Approved by: https://github.com/malfet, https://github.com/zou3519	2025-08-21 17:38:32 +00:00
Angela Yi	3dacaf0e1e	[aoti-fx] Add meta["val"] metadata (#161019 ) Summary: Added a `_set_node_metadata_hook` which automatically adds node.meta["val"] to every new node that gets created under this context. Test Plan: ` buck2 test //mtia/host_runtime/afg/tests:test_dynamic_shapes_advanced_ops` https://www.internalfb.com/buck2/866439a2-2ba6-42d1-8e43-508d60456e2e `buck2 test //mtia/host_runtime/afg/tests:test_dynamic_shapes_basic_ops` https://www.internalfb.com/intern/testinfra/testrun/11540474149662857 Rollback Plan: Differential Revision: D80579336 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161019 Approved by: https://github.com/blaine-rister	2025-08-21 16:45:41 +00:00
PyTorch MergeBot	a6401cb5aa	Revert "flip the list-as-tuple behavior for short lists (#160794 )" This reverts commit febfc3ec03004116dfd6d504e6853ff02a1dd6e0. Reverted https://github.com/pytorch/pytorch/pull/160794 on behalf of https://github.com/seemethere due to This if failing internal tests, see D80671241 ([comment](https://github.com/pytorch/pytorch/pull/160794#issuecomment-3211314867))	2025-08-21 16:33:30 +00:00
PyTorch MergeBot	7006fd0c88	Revert "[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113 )" This reverts commit 517d38d3406abbba35d0694bff259a698cad3ec9. Reverted https://github.com/pytorch/pytorch/pull/160113 on behalf of https://github.com/IvanKobzarev due to Segment tree starts failing on trunk even ciflows/trunk passed on PR ([comment](https://github.com/pytorch/pytorch/pull/160113#issuecomment-3211286092))	2025-08-21 16:22:44 +00:00
IvanKobzarev	517d38d340	[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113 ) 1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory: ``` """ Alternative version of estimate_peak_memory, that respects the fact, that every SchedulerNode has multiple phases: 1. alloc ( outputs ) 2. run_kernel 3. dealloc last_use buffers estimate_peak_memory collapses memory into one value: size_alloc - size_free While peak memory happens after alloc. Duplicating the code to not migrate all callsites at once, In future usages of estimate_peak_memory will migrate to this version. """ ``` - Applying this in `reorder_communication_preserving_peak_memory` pass. 2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode. - Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size). 4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder. What is after this PR: Iterative recomputation of memory estimations matches full memory estimations. Active memory is not regressing a lot, but reserved memory is significantly regressed. Investigation and fix of "reserved" memory will be in following PRs. BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb ``` [rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step: 1 loss: 12.2722 grad_norm: 4.2192 active_memory: 24.66GiB(25.96%) reserved_memory: 25.38GiB(26.72%) tps: 99 tflops: 5.71 mfu: 0.58% [rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step: 2 loss: 13.1738 grad_norm: 50.5566 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 4,448 tflops: 257.63 mfu: 26.05% [rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step: 3 loss: 15.6866 grad_norm: 80.0862 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,900 tflops: 341.72 mfu: 34.55% [rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step: 4 loss: 13.4853 grad_norm: 7.8538 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,881 tflops: 340.57 mfu: 34.44% [rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step: 5 loss: 16.1191 grad_norm: 53.2481 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,867 tflops: 339.77 mfu: 34.35% ``` REORDER: active: 32Gb reserved: 36Gb ``` [rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step: 1 loss: 12.2490 grad_norm: 4.1944 active_memory: 24.66GiB(25.96%) reserved_memory: 26.81GiB(28.22%) tps: 85 tflops: 4.90 mfu: 0.50% [rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step: 2 loss: 13.1427 grad_norm: 39.5942 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 3,205 tflops: 185.61 mfu: 18.77% [rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step: 3 loss: 14.6084 grad_norm: 51.0743 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,688 tflops: 329.44 mfu: 33.31% [rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step: 4 loss: 13.6181 grad_norm: 8.1122 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,744 tflops: 332.68 mfu: 33.64% [rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step: 5 loss: 15.8913 grad_norm: 59.8510 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,046 tflops: 292.22 mfu: 29.55% ``` REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb ``` [rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step: 1 loss: 12.2646 grad_norm: 4.1282 active_memory: 27.60GiB(29.05%) reserved_memory: 32.49GiB(34.20%) tps: 173 tflops: 10.00 mfu: 1.01% [rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step: 2 loss: 13.2353 grad_norm: 42.4234 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,152 tflops: 356.26 mfu: 36.02% [rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step: 3 loss: 13.8205 grad_norm: 24.0156 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,169 tflops: 357.29 mfu: 36.13% [rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step: 4 loss: 13.1033 grad_norm: 9.1167 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,183 tflops: 358.10 mfu: 36.21% [rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step: 5 loss: 16.3530 grad_norm: 51.8118 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,130 tflops: 355.03 mfu: 35.90% ``` Differential Revision: [D79886535](https://our.internmc.facebook.com/intern/diff/D79886535) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113 Approved by: https://github.com/wconstab, https://github.com/eellison Co-authored-by: eellison <elias.ellison@gmail.com>	2025-08-21 15:45:06 +00:00
Andy Lugo	3caddd4daa	[ROCm] SDPA fix mem fault when dropout is enabled (#154864 ) Fixes issue that exhibited a device side memory access fault due to incorrect tensor life management Pull Request resolved: https://github.com/pytorch/pytorch/pull/154864 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-21 14:23:13 +00:00
Kaichao You	18271148d3	[dist] expose unsafe_get_ptr for dist.ProcessGroupNCCL.NCCLConfig (#161136 ) expose the pointer so that we can create the `ncclConfig_t` object from pytorch and use it elsewhere. this is useful to control the nccl communicator parameters for multiple nccl communicators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161136 Approved by: https://github.com/kwen2501	2025-08-21 10:47:03 +00:00
Xia, Weiwen	a941d7ffe5	[Quant][CPU] Avoid NaN in fp8 output of qlinear and qconv (#160957 ) Summary When output dtype is fp8, oneDNN does not ensure intermediate results in the range of [-448, 448] before converting to fp8. So, we may get NaN in the output, which is a disaster for inference. This PR fixes this issue by clamping the intermediate results by oneDNN's post-op clip. Test plan ``` pytest -sv test/quantization/core/test_quantized_op.py -k "q and fp8" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160957 Approved by: https://github.com/Valentine233, https://github.com/CaoE	2025-08-21 08:36:21 +00:00
PyTorch MergeBot	acb00d3ccf	Revert "Fix torchaudio build when TORCH_CUDA_ARCH_LIST is not set (#161084 )" This reverts commit cfdaaaaa26d7f34427ba941569eca46f02f79f3e. Reverted https://github.com/pytorch/pytorch/pull/161084 on behalf of https://github.com/huydhn due to My mistake in not checking for nvidia-smi availability ([comment](https://github.com/pytorch/pytorch/pull/161084#issuecomment-3209498435))	2025-08-21 08:17:04 +00:00
PyTorch MergeBot	bd5857a1d6	Revert "[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113 )" This reverts commit 9d18bf01b1661d227f6af41ac07a1e9ef20a9e1a. Reverted https://github.com/pytorch/pytorch/pull/160113 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but lots of failures showing up after this lands ([comment](https://github.com/pytorch/pytorch/pull/160113#issuecomment-3209487237))	2025-08-21 08:13:33 +00:00
CaoE	23b033452f	[Inductor][CPP] Fix layout for local buf in outer loop fusion (#160857 ) Fixes #159154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160857 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-08-21 06:00:04 +00:00
Dylan Maloy	2f50ae7d20	[nativert] make runtime const folding aware of run_const_graph (#160760 ) Summary: it's possible that we have foldable nodes that use things that will be folded by run_const_graph Test Plan: CI Rollback Plan: Differential Revision: D80355542 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160760 Approved by: https://github.com/SherlockNoMad	2025-08-21 05:22:03 +00:00
IvanKobzarev	9d18bf01b1	[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113 ) 1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory: ``` """ Alternative version of estimate_peak_memory, that respects the fact, that every SchedulerNode has multiple phases: 1. alloc ( outputs ) 2. run_kernel 3. dealloc last_use buffers estimate_peak_memory collapses memory into one value: size_alloc - size_free While peak memory happens after alloc. Duplicating the code to not migrate all callsites at once, In future usages of estimate_peak_memory will migrate to this version. """ ``` - Applying this in `reorder_communication_preserving_peak_memory` pass. 2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode. - Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size). 4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder. What is after this PR: Iterative recomputation of memory estimations matches full memory estimations. Active memory is not regressing a lot, but reserved memory is significantly regressed. Investigation and fix of "reserved" memory will be in following PRs. BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb ``` [rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step: 1 loss: 12.2722 grad_norm: 4.2192 active_memory: 24.66GiB(25.96%) reserved_memory: 25.38GiB(26.72%) tps: 99 tflops: 5.71 mfu: 0.58% [rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step: 2 loss: 13.1738 grad_norm: 50.5566 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 4,448 tflops: 257.63 mfu: 26.05% [rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step: 3 loss: 15.6866 grad_norm: 80.0862 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,900 tflops: 341.72 mfu: 34.55% [rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step: 4 loss: 13.4853 grad_norm: 7.8538 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,881 tflops: 340.57 mfu: 34.44% [rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step: 5 loss: 16.1191 grad_norm: 53.2481 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,867 tflops: 339.77 mfu: 34.35% ``` REORDER: active: 32Gb reserved: 36Gb ``` [rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step: 1 loss: 12.2490 grad_norm: 4.1944 active_memory: 24.66GiB(25.96%) reserved_memory: 26.81GiB(28.22%) tps: 85 tflops: 4.90 mfu: 0.50% [rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step: 2 loss: 13.1427 grad_norm: 39.5942 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 3,205 tflops: 185.61 mfu: 18.77% [rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step: 3 loss: 14.6084 grad_norm: 51.0743 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,688 tflops: 329.44 mfu: 33.31% [rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step: 4 loss: 13.6181 grad_norm: 8.1122 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,744 tflops: 332.68 mfu: 33.64% [rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step: 5 loss: 15.8913 grad_norm: 59.8510 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,046 tflops: 292.22 mfu: 29.55% ``` REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb ``` [rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step: 1 loss: 12.2646 grad_norm: 4.1282 active_memory: 27.60GiB(29.05%) reserved_memory: 32.49GiB(34.20%) tps: 173 tflops: 10.00 mfu: 1.01% [rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step: 2 loss: 13.2353 grad_norm: 42.4234 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,152 tflops: 356.26 mfu: 36.02% [rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step: 3 loss: 13.8205 grad_norm: 24.0156 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,169 tflops: 357.29 mfu: 36.13% [rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step: 4 loss: 13.1033 grad_norm: 9.1167 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,183 tflops: 358.10 mfu: 36.21% [rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step: 5 loss: 16.3530 grad_norm: 51.8118 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,130 tflops: 355.03 mfu: 35.90% ``` Differential Revision: [D79886535](https://our.internmc.facebook.com/intern/diff/D79886535) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113 Approved by: https://github.com/wconstab, https://github.com/eellison Co-authored-by: eellison <elias.ellison@gmail.com>	2025-08-21 05:19:38 +00:00
dolpm	67b98da1b2	[nativert] oss static kernel test utils (#161086 ) Summary: att - should be a no-op Test Plan: ci Rollback Plan: Differential Revision: D80214768 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161086 Approved by: https://github.com/georgiaphillips	2025-08-21 04:49:06 +00:00
PyTorch UpdateBot	b0420d2438	[vllm hash update] update the pinned vllm hash (#161121 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161121 Approved by: https://github.com/pytorchbot	2025-08-21 04:21:09 +00:00
PyTorch UpdateBot	6096d277c5	[audio hash update] update the pinned audio hash (#161021 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161021 Approved by: https://github.com/pytorchbot	2025-08-21 04:20:56 +00:00
Huy Do	cfdaaaaa26	Fix torchaudio build when TORCH_CUDA_ARCH_LIST is not set (#161084 ) Fixes https://github.com/pytorch/pytorch/issues/160988. The root cause can be found in the same issue. This fix ensures that when reuse old wheel is on and `torchaudio` wheel is not there, the inductor test job can still rebuild the wheel it needs Pull Request resolved: https://github.com/pytorch/pytorch/pull/161084 Approved by: https://github.com/malfet, https://github.com/zou3519	2025-08-21 03:47:15 +00:00
Eddie Yan	117f11adb4	[FlexAttention][TF32] Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` (#161102 ) For https://github.com/pytorch/pytorch/issues/161022 The warning says the old API will be deprecated in 2.9+ anyway, leaving it up to the author of #125888 to decide on initialization behavior then Pull Request resolved: https://github.com/pytorch/pytorch/pull/161102 Approved by: https://github.com/ngimel, https://github.com/drisspg, https://github.com/BoyuanFeng	2025-08-21 03:36:52 +00:00
Rohit Manav	a154c2093c	remove redundant installation (#160634 ) Fixes #160302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160634 Approved by: https://github.com/sekyondaMeta, https://github.com/malfet	2025-08-21 03:31:12 +00:00
Xia, Weiwen	39862acb2e	[CPU][Inductor] improve performance of A16W4 GEMM template (#159127 ) Summary This PR improves performance of A16W4 GEMM template by removing boundary check of prefetch in the kernel code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159127 Approved by: https://github.com/CaoE	2025-08-21 03:16:26 +00:00
bobrenjc93	9a41570199	[rfc] add hint_override kwarg to mark_dynamic (#161007 ) The motivation for this change can be seen through the following example: ``` import torch GPU_TYPE = "cuda" @torch.compile def no_override(x): return x.sum(dim=0) @torch.compile def override(x): return x.sum(dim=0) x_small = torch.randn(4096, 512, device=GPU_TYPE) no_override(x_small) torch._dynamo.decorators.mark_dynamic(x_small, 0, hint_override=4096 * 1000) override(x_small) ``` Previously, when reductions were split, codegen relied only on the first observed shape. With a small input, this resulted in a small split size: ``` def triton_red_fused_sum_0(in_ptr0, out_ptr0, ks0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr): xnumel = 16384 rnumel = r0_numel ``` With the new scheme, inductor honors hint_override during codegen, producing larger and more appropriate split sizes: ``` def triton_red_fused_sum_0(in_ptr0, out_ptr0, ks0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr): xnumel = 1024000 rnumel = r0_numel ``` This addresses a broader problem with dynamism: performance and numerics previously depended on whichever shape was seen first. For example: ``` f(s0) -> f(s2) f(s1) -> f(s2) ``` could generate different kernels. With the new approach, an explicit override pins the chosen configuration: ``` f(s0, hint_override=s0) -> f(s2) f(s1, hint_override=s0) -> f(s2) ``` ensuring consistent kernel generation regardless of input order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161007 Approved by: https://github.com/jansel	2025-08-21 02:22:52 +00:00
PyTorch MergeBot	f9875166a9	Revert "[FSDP][Collectives] skipping reduce_scatter when world size is 1 (#160136 )" This reverts commit 3d126e17e0c2630031e7a359d6a6fd1dbe52c4f7. Reverted https://github.com/pytorch/pytorch/pull/160136 on behalf of https://github.com/jithunnair-amd due to Sorry, but looks like this broke ROCm distributed CI ([comment](https://github.com/pytorch/pytorch/pull/160136#issuecomment-3208632921))	2025-08-21 01:34:19 +00:00
PyTorch MergeBot	6b5be1f4a0	Revert "[FSDP][Replicate] replicate tests for param registration and input device movements (#160147 )" This reverts commit a3a82e3da85a53afc4bbf3d75bd3d3dcc2e06645. Reverted https://github.com/pytorch/pytorch/pull/160147 on behalf of https://github.com/jithunnair-amd due to Sorry, but looks like this broke ROCm distributed CI ([comment](https://github.com/pytorch/pytorch/pull/160136#issuecomment-3208632921))	2025-08-21 01:34:19 +00:00
Huamin Li	0924304e72	[AOTI] Add a new config cpp.use_constexpr_for_int_array (#160927 ) Summary: Default True so same as before, but make it configurable Differential Revision: D80185094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160927 Approved by: https://github.com/henryoier	2025-08-21 01:16:27 +00:00
Natalia Gimelshein	d875d3ca1e	don't try to set lazy module loading env var (#161103 ) This is not needed on drivers >=525, and in DriverAPI::get() we are initializing the context anyway, so setting environment variable after that is beside the point As a result of calling DriverAPI::get on systems that don't have gpus available (e.g. due to CUDA_VISIBLE_DEVICES="") people were getting confusing errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161103 Approved by: https://github.com/eqy, https://github.com/malfet	2025-08-21 01:06:51 +00:00
Yuxuan Chen	a825557ed5	Workaround ATen SFINAE under libc++ (#161101 ) The existing logic here to workaround dealing with SFINAE under Microsoft platforms also applies to libc++ platforms. It appears that nvcc reports ambiguity in overload resolution for `pow_`. This seems like a nvcc limitation. ``` fbcode/caffe2/aten/src/ATen/native/cuda/Pow.cuh(42): error: more than one instance of overloaded function "pow" matches the argument list: function template "std::__2::enable_if<<expression>, std::__2::__promote<_A1, _A2, void>>::type::type pow(_A1, _A2) noexcept" (declared at line 848 of fbcode/third-party-buck/platform010-libcxx/build/libcxx/include/c++/v1/math.h) function template "std::__2::enable_if<<expression>, std::__2::__promote<_Tp, _Up, void>>::type pow(_Tp, _Up) noexcept" (declared at line 11308 of fbcode/third-party-buck/platform010/build/cuda/12.4/bin/..//include/crt/math_functions.h) argument types are: (double, float) return ::pow(base, exp); ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161101 Approved by: https://github.com/malfet	2025-08-21 00:55:58 +00:00
Nikita Shulga	3e3e83418d	[BE] Move indexing tests to test_indexing (#160994 ) Which enables them on MPS device - xfail all `test_index_reduce` on MPS, as op is not implemented - xfail all `test_index_copy` on MPS due to the silent correctness problems, see https://github.com/pytorch/pytorch/issues/160993 - Fixed hard crash in `index_fill` and replaced `skipIfMPS` with `expectedFailueMPS` - Created issue for the lack of deterministic algorithms for MPS backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/160994 Approved by: https://github.com/manuelcandales ghstack dependencies: #160850, #160889, #160926	2025-08-21 00:42:55 +00:00
Jazlyn Li	667245dc60	TritonKernel.inductor_meta_common() -> self.inductor_meta_common() (#160895 ) Summary: use `self.inductor_meta_common()` to call the static method, since the custom subclasses may overwrite the method to be an instance method Test Plan: ``` caffe2/test/inductor:select_algorithm -- test_finalized_subclass_hooks ``` Rollback Plan: Differential Revision: D80375351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160895 Approved by: https://github.com/eellison, https://github.com/blaine-rister	2025-08-21 00:22:51 +00:00
Grant	54c2b66592	Replace _device_t with torch.types.Device in torch/cpu/__init__.py (#161031 ) Fixes #152952 Replace `_device_t` with `torch.types.Device` in `torch/cpu/__init__.py`. Did basic smoke test by running tests that `import torch.cpu` including `test/distributed/test_c10d_functional_native.py` and `test/test_decomp.py`. Based this PR off of #152935 which is referenced in the main issue. (also, this is my first contribution but I followed the contributing guide closely) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161031 Approved by: https://github.com/janeyx99	2025-08-21 00:22:43 +00:00
Xu Han	be87f22dfb	[inductor] Enable updated __cplusplus macro (#161064 ) Intel oneAPI has some header depends on `__cplusplus` macro. This PR is enable updated __cplusplus macro for msvc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161064 Approved by: https://github.com/angelayi	2025-08-21 00:17:08 +00:00
Xu Han	2a7a7ad711	[inductor] add level zero for xpu (#161061 ) Add level zero for Inductor xpu on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161061 Approved by: https://github.com/angelayi	2025-08-21 00:14:15 +00:00
Teja Rao	7e6ce41555	[dcp_poc] add async checkpointing tests (#161034 ) Summary: add tests for async checkpointer for the experimental checkpointer Test Plan: tests Rollback Plan: Differential Revision: D80590461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161034 Approved by: https://github.com/pradeepfn	2025-08-21 00:08:53 +00:00
Ben Niu	4ed3184dee	Conditionally enable ACL for bmm_out_or_baddbmm_ (#161065 ) Summary: Similar to #ifdef checks added in addmm_impl_cpu_ to conditionally enable ACL, we add the same checks in bmm_out_or_baddbmm_. This essentially disables ACL for bmm_out_or_baddbmm_ and enables ArmPL, which seems to be performing better. Test Plan: AR SL Rollback Plan: Reviewed By: Nicoshev Differential Revision: D80494623 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161065 Approved by: https://github.com/q10	2025-08-20 23:32:25 +00:00
Pian Pawakapan	44549c7146	[dynamic shapes] unbacked-safe slicing (#157944 ) Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944 Approved by: https://github.com/laithsakka	2025-08-20 22:52:56 +00:00
Natalia Gimelshein	febfc3ec03	flip the list-as-tuple behavior for short lists (#160794 ) Per title, previously we started throwing noisy warnings, but given how popular this pattern was in our test suite decided to leave it as warning, not as silent behavior change for one release. Now `treatSequenceAsTuple` would return `true` in the only case where the sequence was indeed a tuple, so no need for a special function anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160794 Approved by: https://github.com/albanD	2025-08-20 22:40:42 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	5afa4187df	Close some sources of fake tensor leakages (#159923 ) Differential Revision: D79694055 Couple of fixes: 1. When we run into an operation we didn't proxy, we end up emitting fake constants. We detect this and error using the FQN of the lifted constant 2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict 3. We modify yolov3 to fix the previous silent incorrect behaviour When upgrading torchbench pin, opacus_cifar10 seems to not run on eager anymore. I verified this by pushing a temporary PR on master with new pin. So i added it to expect_fail list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159923 Approved by: https://github.com/avikchaudhuri	2025-08-20 22:24:23 +00:00
Mikayla Gawarecki	30384abcb1	Decrease number of bytes used by uninitialized tokens_ in KernelFunction (#160764 ) std::unique_ptr to decrease bytes from 24 to 8 Since std::unique_ptr is not copyable this required defining the copy / copy assignment constructors. Which made me realize we shouldn't be copying `tokens_` in those. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160764 Approved by: https://github.com/albanD	2025-08-20 21:33:27 +00:00
Ethan Wee	16e811e0b5	[CI] remove tb-nightly (#160996 ) Removing tb-nightly because we found issues when importing tensorboard as having both tb-nightly and tensorboard causes issues when pip would report 2.18.0 (pinned tensorboard) but importing in a python shell would report 2.13.XXX. This mismatch causes issues when running tests in a numpy2.X environment. e.g. ``` /var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler /opt/venv/lib/python3.12/site-packages/redis/connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing warnings.warn(msg) /opt/venv/lib/python3.12/site-packages/google/protobuf/internal/well_known_types.py:91: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC). _EPOCH_DATETIME_NAIVE = datetime.datetime.utcfromtimestamp(0) E ====================================================================== ERROR: test_event_handler (__main__.TestMonitorTensorboard.test_event_handler) ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/lib/jenkins/pytorch/test/test_monitor.py", line 116, in setUp from tensorboard.backend.event_processing import ( File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/plugin_event_multiplexer.py", line 25, in <module> from tensorboard.backend.event_processing import ( File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/plugin_event_accumulator.py", line 25, in <module> from tensorboard.backend.event_processing import event_file_loader File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/event_file_loader.py", line 21, in <module> from tensorboard import dataclass_compat File "/opt/venv/lib/python3.12/site-packages/tensorboard/dataclass_compat.py", line 33, in <module> from tensorboard.plugins.hparams import metadata as hparams_metadata File "/opt/venv/lib/python3.12/site-packages/tensorboard/plugins/hparams/metadata.py", line 32, in <module> NULL_TENSOR = tensor_util.make_tensor_proto( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/venv/lib/python3.12/site-packages/tensorboard/util/tensor_util.py", line 405, in make_tensor_proto numpy_dtype = dtypes.as_dtype(nparray.dtype) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/venv/lib/python3.12/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py", line 677, in as_dtype if type_value.type == np.string_ or type_value.type == np.unicode_: ^^^^^^^^^^ File "/opt/venv/lib/python3.12/site-packages/numpy/__init__.py", line 400, in __getattr__ raise AttributeError( AttributeError: `np.string_` was removed in the NumPy 2.0 release. Use `np.bytes_` instead. ---------------------------------------------------------------------- Ran 1 test in 0.355s FAILED (errors=1) ``` After removing tb-nightly and ensuring that tensorboard 2.18.0 is the only tensoboard in the env: ``` root@rocm-framework-47:/var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler . ---------------------------------------------------------------------- Ran 1 test in 0.409s OK ``` ``` >>> import tensorboard >>> print(tensorboard.__version__) 2.13.0a20230426 ``` ```:/# pip show tensorboard Name: tensorboard Version: 2.18.0 Summary: TensorBoard lets you watch Tensors Flow Home-page: https://github.com/tensorflow/tensorboard Author: Google Inc. Author-email: packages@tensorflow.org License: Apache 2.0 Location: /opt/venv/lib/python3.12/site-packages Requires: absl-py, grpcio, markdown, numpy, packaging, protobuf, setuptools, six, tensorboard-data-server, werkzeug Required-by: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160996 Approved by: https://github.com/huydhn	2025-08-20 21:25:58 +00:00
Tsung-Hsien Lee	19c70c2f3d	[pytorch] Faster and safer lambda expression capture in `has_integral_tensor()` (#161042 ) Summary: Because `includeBool` is already a small value type (i.e., `bool`, 1 byte) that's passed by value to the function. Capturing by reference (4 or 8 bytes depending on the system) is unnecessary and could potentially lead to dangling reference issues if the lambda outlives the original variable. Capturing by value is more efficient for small types and safer. Test Plan: OSS CI & tests Rollback Plan: Differential Revision: D80595698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161042 Approved by: https://github.com/Skylion007	2025-08-20 20:59:41 +00:00
Will Feng	8047cde0f3	Try to fix Inductor CI periodic tests (#160932 ) - hf_Reformer: this one starts failing due to increased graph breaks due to transformers pin bump (#159291). We can likely just bump the expected graph break count. - dla102: this one starts timing out on 8/13 Wed between commit 6e8865f and ee1b041. But based on the PT2 dashboard, this model actually doesn't have compile time or runtime regression. Will try to bump up the timeout and see if it can work. - hf_BigBird: this one has its accuracy status improved since today. Will update hf_BigBird accuracy status. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160932 Approved by: https://github.com/zou3519, https://github.com/huydhn, https://github.com/malfet	2025-08-20 20:36:46 +00:00
Dmitry Nikolaev	24e7f3c21c	[ROCm] fix large tensor sort on MI350 (#161054 ) Currently std::min -> ::min did not work as expected on ROCm when input values >= 2147483648 Replace `std::min` to ternary statement Also `std::min` can be replaced by explicit typing `std::min<int64_t>` fixes on ROCm: test_sort_and_select.py::TestSortAndSelectCUDA::test_sort_large_cuda_float16 error: RuntimeError: Cannot sort dimension of length 8192 Similar PR to fix large tensors on ROCm https://github.com/pytorch/pytorch/pull/130994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161054 Approved by: https://github.com/jeffdaily	2025-08-20 19:58:01 +00:00
Nikita Shulga	e1a64b75ff	[CD] Delete full builds (#161075 ) As they are no longer needed for Colab, see https://github.com/googlecolab/colabtools/issues/5508#issuecomment-3200871941 and [<img width="896" height="128" alt="image" src="https://github.com/user-attachments/assets/a287393c-bde7-4e10-99bf-2e0d66346efe" /> ](https://colab.research.google.com/drive/1YJ5Y0xsApXSewM1cQwWQ_AS3A77vytgq) Fixes https://github.com/pytorch/pytorch/issues/160972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161075 Approved by: https://github.com/atalman	2025-08-20 19:40:15 +00:00
eellison	b708966201	Fix bucketing introducing cycles (#160967 ) We were just looking at direct arguments, but not transitive dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160967 Approved by: https://github.com/IvanKobzarev	2025-08-20 19:38:46 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	dbef606631	Add support for tracing vmap in pre-dispatch export (#154650 ) Summary: ONNX team and recent transformer upgrade ran into this error and we also ran into during our export benchmarking. This diff makes it possible to trace through vmap implementation in pre-dispatch IR. Note that we don't support serializing functorch ops in pre-dispatch IR and in the future, we should desugar them to post-grad ops. The implementation strategy is: 1. We add python wrappers around vmap APIs so that we attach custom torch function handler that is only on during non-strict export. The reason is we don't want to add this to default torch_function handler because it will break BC. 2. Some dynamo changes to make sure it picks up new python wrapper APIs. The reason is when we do strict export, we need to re-materialize these APIs in pre-dispatch IR from torch IR. We can avoid this by special casing in dynamo for export to proxy different API calls but i feel that is too much chaos because you need to be able to proxy 2 different variants of same vmap API. Test Plan: CI Differential Revision: D75623875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154650 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-08-20 19:31:07 +00:00
Ruben Rodriguez Buchillon	c5cb255625	[inductor][mm] fix tma issue (#161025 ) # why - head is broken # what - the template for experimental API is broken - the test assumes not experimental API # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_regular_mm_persistent_tma_strided_a_transposed_True_b_transposed_False_dynamic_True -v ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161025 Approved by: https://github.com/PaulZhang12	2025-08-20 18:52:38 +00:00
redwrasse	957b170d8e	Fix SVD forward-mode AD multiplication priority (#161027 ) Multiplication order priority for the SVD JVP appears to have been the opposite of the optimal one. Results from a crude CPU benchmark on my laptop for random matrices of various ratios: ``` Performance Results Table \| Test Case \| Matrix Size \| Aspect Ratio \| Before JVP (ms) \| After JVP (ms) \| Change (ms) \| % Change \| Status \| \|----------------------------------\|-------------\|--------------\|-----------------\|----------------\|-------------\|----------\|---------------------\| \| Tall matrix (10:1 ratio) \| 1000×100 \| 10:1 tall \| 3.13 \| 3.24 \| +0.11 \| -3.5% \| ❌ Regression \| \| Tall matrix (10:1 ratio, larger) \| 2000×200 \| 10:1 tall \| 15.72 \| 14.66 \| -1.06 \| +6.7% \| ✅ Improvement \| \| Tall matrix (10:1 ratio, large) \| 5000×500 \| 10:1 tall \| 105.97 \| 101.84 \| -4.13 \| +3.9% \| ✅ Improvement \| \| Wide matrix (1:10 ratio) \| 100×1000 \| 1:10 wide \| 5.90 \| 4.64 \| -1.26 \| +21.4% \| ✅ Major Improvement \| \| Wide matrix (1:10 ratio, larger) \| 200×2000 \| 1:10 wide \| 18.29 \| 17.78 \| -0.51 \| +2.8% \| ✅ Improvement \| \| Wide matrix (1:10 ratio, large) \| 500×5000 \| 1:10 wide \| 137.40 \| 128.70 \| -8.70 \| +6.3% \| ✅ Improvement \| \| Square matrix (baseline) \| 1000×1000 \| 1:1 square \| 116.16 \| 106.09 \| -10.07 \| +8.7% \| ✅ Improvement \| \| Square matrix (larger baseline) \| 2000×2000 \| 1:1 square \| 714.30 \| 673.23 \| -41.07 \| +5.7% \| ✅ Improvement \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161027 Approved by: https://github.com/soulitzer	2025-08-20 18:47:11 +00:00
Jovian Anthony Jaison	c02e26bf31	Fix filename showing up as ints in dynamo_compile stack_trace column. (#160916 ) Test plan: $ python -m test_utils Note: Another way is adding the actual file_name to from_traceback, but since it's referenced in multiple places and may have associated tests this seems safer. Lmk if changes are needed @c00w Pull Request resolved: https://github.com/pytorch/pytorch/pull/160916 Approved by: https://github.com/c00w, https://github.com/masnesral	2025-08-20 18:38:38 +00:00
eqy	c74e5f6061	[CUDA] Bump tolerances for `test_baddmm` (#159915 ) Only one mismatch out of the entire result tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159915 Approved by: https://github.com/nWEIdia, https://github.com/drisspg	2025-08-20 18:05:51 +00:00
dolpm	1471b20cb3	add static dispatch kernel registration to open source (#160439 ) Summary: static dispatch registry should be moved to open source. the rest can maintain internally for now, since delegates will all go through ET hop. Test Plan: spot checked existing tests and didn't see any missing registrations Differential Revision: D80099377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160439 Approved by: https://github.com/SherlockNoMad, https://github.com/zhxchen17	2025-08-20 17:58:00 +00:00
Kevin Yin	b2632e7982	Fix error message for fsdp_pre_all_gather (#160817 ) See: `20e40492b0/test/distributed/_composable/fsdp/test_fully_shard_extensions.py (L97-L104)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160817 Approved by: https://github.com/weifengpy, https://github.com/H-Huang	2025-08-20 17:43:57 +00:00
zhxchen17	5255e65c01	[dynamo] Refactor convert_frame to remove usage of nonlocal tracer output return. [4/n] (#160899 ) Today convert_frame is implemented like the following: ``` def _compile(): tracer_output = None def transform(): nonlocal tracer_output ... def _compile_inner(): transform(...) compile_inner(...) ``` The code is using unconventional nonlocal variable as the return value. This is not ideal for 2 reasons: 1. Reasoning about the code, especially together with error handling code becomes harder. 2. more importantly, this makes it harder to extract out common code pieces into a shared library because everything must depend on a central global state. In this diff we remove the usage of nonlocal return and just use the conventional function return to output the compilation data. Differential Revision: [D80461258](https://our.internmc.facebook.com/intern/diff/D80461258/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160899 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #160814, #160815, #160855	2025-08-20 17:37:26 +00:00
zhxchen17	9e050b6339	[dynamo] Refactor convert_frame._compile_inner to return compiled bytecode + output graph. [3/n] (#160855 ) We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export). This PR adds a new helper function compile_frame() which takes a bytecode and a transform function and return compiled bytecode + output graph as DynamoOutput type. Differential Revision: [D80430802](https://our.internmc.facebook.com/intern/diff/D80430802/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160855 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #160814, #160815	2025-08-20 17:37:26 +00:00
eellison	b3e215b864	Trigger h100 on test_max_autotune, mm, grouped_mm changes (#160678 ) Following @henrylhtsang 's pr here: https://github.com/pytorch/pytorch/pull/160656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160678 Approved by: https://github.com/henrylhtsang, https://github.com/ngimel	2025-08-20 16:56:30 +00:00
Wang, Chuanqi	e483947047	[BE] Remove intel-openmp dependency in setup.py (#160976 ) Fixes #160962 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160976 Approved by: https://github.com/xuhancn, https://github.com/atalman	2025-08-20 16:33:16 +00:00
Angel Li	8e17709055	FlexDecode not guarding on GQA groups correctly (#160904 ) Addressing #151359 Updates flex_decode dispatch to use flex attention rather than flex decode if number of groups is not a power of 2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160904 Approved by: https://github.com/drisspg	2025-08-20 16:32:16 +00:00
Isuru Fernando	e631557518	Fix meta function for aten.complex (#160894 ) Closes https://github.com/pytorch/pytorch/issues/160882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160894 Approved by: https://github.com/mlazos	2025-08-20 16:30:04 +00:00
Charlie West-Taylor	7f201baf41	Allow exposing more functions during initial template expansion (#159554 ) Also adds a `_register_hook` utility, and documents & type annotates `PartialRender`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159554 Approved by: https://github.com/laithsakka, https://github.com/kundaMwiza	2025-08-20 16:08:55 +00:00
Aidyn-A	ce048de608	[ATen][CPU][Sparse] Use Third-Party Eigen for sparse add and addmm (#155357 ) This pull request adds the following ops for sparse matrices using Eigen library: ```python add(a_csr, b_csr) add(a_csc, b_csc) addmm(c_csr, a_csr, b_csr) addmm(c_csr, a_csr, b_csc) addmm(c_csr, a_csc, b_csc) addmm(c_csr, a_csc, b_csr) addmm(c_csc, a_csr, b_csr) addmm(c_csc, a_csr, b_csc) addmm(c_csc, a_csc, b_csc) addmm(c_csc, a_csc, b_csr) ``` Currently, the operations for sparse matrices on CPU are available through MKL only. The non-existence of MKL on `aarch64` causes the unavailability of these ops on any machines with ARM based CPUs, including Apple Silicon, AWS Graviton and NVIDIA Grace. This PR addresses this issue by using Eigen as a backend for the above ops. This is a re-factored version of my previous PR #101814. The main difference with the old one, this does not enable Eigen by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155357 Approved by: https://github.com/pearu, https://github.com/eqy	2025-08-20 15:44:54 +00:00
PyTorch MergeBot	90ea9ccefe	Revert "[rfc] add hint_override kwarg to mark_dynamic (#161007 )" This reverts commit 0533ff2ccba7e77622ac3c6758f1032bdc10feff. Reverted https://github.com/pytorch/pytorch/pull/161007 on behalf of https://github.com/jeffdaily due to failing on both cuda and rocm ([comment](https://github.com/pytorch/pytorch/pull/161007#issuecomment-3206893756))	2025-08-20 15:31:33 +00:00
PyTorch MergeBot	6ea4be1e2e	Revert "[dynamic shapes] unbacked-safe slicing (#157944 )" This reverts commit 2f0cba934de7094a66c6ce68f5e937254f23142a. Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/seemethere due to This is blocking internal sync due to merge conflicts ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3206833193))	2025-08-20 15:16:45 +00:00
Joshua Su	a818fa77e3	Back out "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" (#160999 ) Summary: reverting this diff since it caused S551328. Please see D80217492 for dertails. Test Plan: NA Rollback Plan: Differential Revision: D80553314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160999 Approved by: https://github.com/izaitsevfb, https://github.com/jingsh	2025-08-20 15:04:36 +00:00
Mwiza Kunda	5ee464db5c	[inductor] Fix descriptor broadcasting for singleton dimensions (#160310 ) This fixes the case when an input / output contains both zero strides and singleton dimensions. In this case the broadcasting dimensions generated for the descriptor need to ignore dimensions that have zero strides with size 1, otherwise the determination of which dimensions to broadcast will fail. As an example, consider the following store instruction: ``` name=buf1 index=x2 + 192y0 + 64y1 valule=TritonCSEVariable('tmp7') params = BlockParameters( shape=[3, 4, 1, 1, 64], block_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), 1, 1, XBLOCK], strides=[64, 192, 0, 0, 1], offsets=[(yoffset//4), ModularIndexing(yoffset, 1, 4), 0, 0, xoffset] ) broadcasting_dims=[False, False, True, True, False] broadcast_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), XBLOCK] ``` Because `len(self.broadcasting_dims) != self.broadcast_shape)`, dim3 is incorrectly marked as a broadcast dimension when the pre-broadcast shape is computed in `codegen_broadcast_and_reshape`. ``` 9 pre_broadcast_shape = [ 280 sympy.S.One if is_broadcasting else dim 281 for dim, is_broadcasting in zip( 282 -> self.broadcast_shape, self.broadcasting_dims 283 ) 284 ] ``` The pre_broadcast_shape is now wrong: `[((YBLOCK + 3)//4), Min(4, YBLOCK), 1]` Triton throws the following error: `reshape() cannot change total number of elements in tensor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160310 Approved by: https://github.com/blaine-rister	2025-08-20 09:48:58 +00:00
bobrenjc93	0533ff2ccb	[rfc] add hint_override kwarg to mark_dynamic (#161007 ) The motivation for this change can be seen through the following example: ``` import torch GPU_TYPE = "cuda" @torch.compile def no_override(x): return x.sum(dim=0) @torch.compile def override(x): return x.sum(dim=0) x_small = torch.randn(4096, 512, device=GPU_TYPE) no_override(x_small) torch._dynamo.decorators.mark_dynamic(x_small, 0, hint_override=4096 * 1000) override(x_small) ``` Previously, when reductions were split, codegen relied only on the first observed shape. With a small input, this resulted in a small split size: ``` def triton_per_fused_sum_1(in_ptr0, out_ptr0, xnumel, r0_numel, XBLOCK : tl.constexpr): xnumel = 512 r0_numel = 32 ``` With the new scheme, inductor honors hint_override during codegen, producing larger and more appropriate split sizes: ``` def triton_red_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr): xnumel = 16384 r0_numel = 128 ``` This addresses a broader problem with dynamism: performance and numerics previously depended on whichever shape was seen first. For example: ``` f(s0) -> f(s2) f(s1) -> f(s2) ``` could generate different kernels. With the new approach, an explicit override pins the chosen configuration: ``` f(s0, hint_override=s0) -> f(s2) f(s1, hint_override=s0) -> f(s2) ``` ensuring consistent kernel generation regardless of input order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161007 Approved by: https://github.com/jansel	2025-08-20 07:51:09 +00:00
Nick Riasanovsky	a9fabeb012	[BE] Fix old TMA API in persistent matmul template (#161030 ) Summary: Fixes a bug introduced by https://github.com/pytorch/pytorch/pull/159407 Test Plan: NA Rollback Plan: Differential Revision: D80588320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161030 Approved by: https://github.com/adamomainz, https://github.com/NikhilAPatel, https://github.com/nmacchioni, https://github.com/aakhundov	2025-08-20 05:53:57 +00:00
FFFrog	0f801a510f	Using std::vector or c10::SmallVector instead of CArray (#160959 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160959 Approved by: https://github.com/Skylion007	2025-08-20 05:32:29 +00:00
dolpm	576a0e64ed	[nativert] ensure that moveable outputs are set in other executionframe ctor (#161005 ) Summary: so we use this constructor in HigherOrderKernel. problems arise in the loop condition, where it's possible for an output from the prev. iteration to be an input to the next. so the Output(N) of a kernel may be the Input(M) to a kernel in the next iteration. Thus, if the output value is reset (via. fastresizetozero) or overwritten by a prev. kernel before it is to be used, we have major major issues. we need to enforce that outputs are moved, not copied, to ensure this doesn't happen. Test Plan: buck2 test //caffe2/test:test_export --local-only -- test_while_loop_tensor_constant_idx_cpp_runtime_nonstrict Rollback Plan: Differential Revision: D80565374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161005 Approved by: https://github.com/SherlockNoMad	2025-08-20 05:05:32 +00:00
Menglu Yu	a3fe1ced40	[Optimus][decompose_mm] Fix BooleanAtom corner case (#160987 ) Summary: We observe a case where the BooleanAtom does not support regular sum op for bool exp, thus we fix it by using bool() Rollback Plan: Differential Revision: D80550876 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160987 Approved by: https://github.com/Yuzhen11, https://github.com/mlazos	2025-08-20 04:36:12 +00:00
PyTorch UpdateBot	7e4bfa74ea	[vllm hash update] update the pinned vllm hash (#161020 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161020 Approved by: https://github.com/pytorchbot	2025-08-20 04:15:50 +00:00
Teja Rao	d8fcb2a4ac	[dcp_poc] Fix parameter order in distributed checkpoint API to use path-first for consistency (#160986 ) Summary: This commit standardizes the parameter order across PyTorch's experimental distributed checkpoint (DCP) API, changing all checkpoint operations from (state_dict, path) to (path, state_dict) for consistency with standard file I/O patterns. Test Plan: sandcastle tests Rollback Plan: Differential Revision: D80549014 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160986 Approved by: https://github.com/pradeepfn	2025-08-20 04:09:18 +00:00
Sandeep Narendranath Karjala	2b62ef7420	Add kernel information JSON generation for AOTI packages (#160540 ) Summary: Build on D80031559. Generate kernel_information.json in AOTI compiled artifacts by combining stack traces and node mappings from provenance tracking. This implementation delivers exactly what Zoomer team requested: 1. Core Function: `create_kernel_information_json()` in debug.py combines 3 data sources: - `_inductor_kernel_stack_trace` → `stack_traces` field - `_inductor_triton_kernel_to_post_grad_node_info` → `post_grad_nodes` field - `_inductor_post_to_pre_grad_nodes["postToPre"]` → `pre_grad_nodes` field 2. AOTI Integration: codecache.py writes `kernel_information.json` to pt2 packages when both AOTI packaging and provenance tracking are enabled. 3. Test Coverage: TestKernelInformationAOTI class validates: - JSON file creation in AOTI packages using zipfile - Exact format compliance - Proper disabling without provenance tracking Output Format (exact specification): ```json { "triton_kernel_name_1": { "stack_traces": [str, str, ...], "post_grad_nodes": [str, str, ...], "pre_grad_nodes": [str, str, ...] } } ``` Test Plan: ``` buck test fbcode//caffe2/test/inductor:provenance_tracing -- TestKernelInformationAOTI ``` Manual validation: ```python import torch model = torch.nn.Linear(10, 1) with torch._inductor.config.patch("aot_inductor.package", True): with torch._inductor.config.patch("trace.basic_provenance_tracking", True): # AOTI compilation should generate kernel_information.json compiled = torch.export.export(model, (torch.randn(1, 10),)) ``` --- Rollback Plan: Differential Revision: D80139160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160540 Approved by: https://github.com/yushangdi	2025-08-20 02:33:45 +00:00
Lucas Kabela	54cc63b467	[BE][Dynamo] Type coverage for symbolic_convert (#160922 ) As part of better engineering, we add type coverage to `dynamo/symbolic_convert.py`, which is the main work engine of dynamo for emulating python bytecode. Running ``` mypy torch/_dynamo/symbolic_convert.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Annotated \| Lines Total \| % lines covered \| Funcs Annotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 764 \| 4286 \| 17.83% \| 43 \| 241 \| 17.84% \| \| This PR \| 4322 \| 4322 \| 100.00% \| 241 \| 241 \| 100.00% \| \| Delta \| +3558 \| +36 \| +82.17% \| +198 \| 0 \| +82.16% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/160922 Approved by: https://github.com/StrongerXi	2025-08-20 01:24:31 +00:00
zhxchen17	599f639ddb	[dynamo] Refactor transform() so that instruction translator can be used as a tracing function. [2/n] (#160815 ) We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export). This PR follows the last one which separate out the part to run instruction translator on a given frame and return a DynamoTracerOutput. The end result is a free function that runs instruction translator indepedently. A follow up diff will wrap the low level function. Differential Revision: [D80388694](https://our.internmc.facebook.com/intern/diff/D80388694/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160815 Approved by: https://github.com/anijain2305 ghstack dependencies: #160814	2025-08-20 01:16:35 +00:00
Simon Fan	72e4786d16	[dynamo][dist] trace DeviceMesh's get_local_rank and get_rank as constants (#160805 ) Used in https://github.com/pytorch/torchtitan/pull/1555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160805 Approved by: https://github.com/StrongerXi, https://github.com/mlazos	2025-08-20 01:12:24 +00:00
CaoE	371909cfd1	[Inductor][CPP] Add float16 support for CppMicroGemmAMX (#147368 ) Add float16 support for CppMicroGemmAMX for float16 gemm template. Float16 CppMicroGemmAMX needs a higher version of compiler, e.g., GCC 13. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147368 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-08-20 01:04:05 +00:00
Mikayla Gawarecki	78a8e6a671	Add new_empty (with dtype argument only) to torch::stable (#159508 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159508 Approved by: https://github.com/janeyx99 ghstack dependencies: #160557	2025-08-20 00:50:42 +00:00
Jagadish Krishnamoorthy	543896fcf3	test_matmul_cuda: Refine MX test skipping (#161009 ) Replace return unittest.skip with raise unittest.SkipTest to ensure that the test suite correctly reports skipped tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161009 Approved by: https://github.com/jeffdaily	2025-08-20 00:47:45 +00:00
Anshul Sinha	a3a82e3da8	[FSDP][Replicate] replicate tests for param registration and input device movements (#160147 ) Summary: In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. To this end, I have added three test cases, one to test input device movement and the other two to test parameter registration during the forward and backward pass of a model. Test Cases 1. pytest test/distributed/_composable/test_replicate_training.py -k test_root_move_forward_input_to_device 2. pytest test/distributed/_composable/test_replicate_training.py -k TestReplicateRegisteredParams Pull Request resolved: https://github.com/pytorch/pytorch/pull/160147 Approved by: https://github.com/weifengpy ghstack dependencies: #160135, #160136	2025-08-20 00:47:00 +00:00
Ke Wen	9d7cecdd6c	[SymmMem] Support rendezvous on view of a tensor (#160925 ) `tensor.view` share the same `data_ptr()` as the original tensor, thus cannot serve as key to rendezvous' map (we want a 1:1 match between handle and tensor, thus need a unique key). @ezyang suggests using the raw `TensorImpl` of a tensor, for which `tensor.view` would have a different value than the original tensor. But the raw `TensorImpl` can be stumbled on again when a previous tensor gets deallocated and a new one allocated. For that reason, we'd also need to use a `weak_instrusive_ptr` to distinguish the two tensors, i.e. for the deallocated tensor, `weak_instrusive_ptr::expired()` would return true. Added `test_rendezvous_view` and `test_rendezvous_same`. Note: the view support has been added to NVSHMEM backend and NCCL backend. For CUDA backend, I have yet to investigate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160925 Approved by: https://github.com/ngimel ghstack dependencies: #160825	2025-08-19 23:49:25 +00:00
Natalia Gimelshein	0d19541284	fabric detection - fix build on an old toolkit (#160984 ) Fixes #160960 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160984 Approved by: https://github.com/eqy	2025-08-19 23:43:36 +00:00
eqy	e836323a23	[FP8][cuBLAS][SM100] cuBLAS doesn't support rowwise-scaling on `sm100` (#160693 ) See also: https://docs.nvidia.com/cuda/cublas/#id93 Only tensor-wide scales and 1D scales with tiled layout are supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160693 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007	2025-08-19 23:22:51 +00:00
Colin Peppler	512fc768e9	Add tlparse artifact for joint graph passes (for inference & non-freezing only) (#160589 ) Summary: Joint graph passes run several FX passes which can modify the graph before it hits Inductor. There's three usages of joint graph passes: - for inference & not freezing (we add structured loggings only for this) - for inference & freezing - for fw/bw split Rollback Plan: Reviewed By: yushangdi Differential Revision: D80130321 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160589 Approved by: https://github.com/yushangdi	2025-08-19 23:18:40 +00:00
Xilun Wu	a7b5955ea8	[ContextParallel] add Document Masking test (#160700 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #160700 Summary add test case to CP + FlexAttention for Document Masking Test `pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention_document_mask` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160700 Approved by: https://github.com/fegin	2025-08-19 23:03:18 +00:00
PyTorch MergeBot	e83825f91c	Revert "handling special case for pow(3) for GPU (#157537 )" This reverts commit 05e8fac4f374c4dbf0cd0e85e925e9112cf234a2. Reverted https://github.com/pytorch/pytorch/pull/157537 on behalf of https://github.com/malfet due to This is really really bad from performance point of view, wonder if any benchmarks will detect that ([comment](https://github.com/pytorch/pytorch/pull/157537#issuecomment-3202661810))	2025-08-19 22:57:45 +00:00
Pian Pawakapan	33c3794533	[dynamic shapes] use prims_common contiguity in create_example_tensors (#160933 ) Summary: forward fix T234739699 Test Plan: T234739699 Rollback Plan: Differential Revision: D80503451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160933 Approved by: https://github.com/henrylhtsang	2025-08-19 22:43:13 +00:00
Jane Xu	8f766d6839	Add ScalarType -> shim conversion, add stable::Tensor.scalar_type (#160557 ) TL;DR: Moving to ScalarType in user extensions and removing deprecated dtypes. This change _modifies_ the from/to behavior between ScalarType and StableValue! Whereas before, user extensions could only in abstract pass around obfuscated dtypes appearing as int32_ts, now, users can confidently use torch::headeronly::ScalarType in their extensions for major scalar types. This PR enables ABI stability by adding a translation layer through the shim, so that even if the ScalarType enum values change in the future, user extensions need not fear. Then we add a Tensor scalar_type API which reuses the from/to logic to return to the user a nice ScalarType (vs an abstracted int32_t). I then changed the test to test the scalar_type API. This code change required some refactoring because of circular dependencies. ## BC Breaking note This commit is (narrowly) BC-breaking for unpopular dtypes: `quint`s, `qint`s, `Bits`, `dummy_uint`s, `dummy_int*`s, `Float8_e8m0fnu`, and `Float4_e2m1fn_x2` in the narrow use case where an extension retrieves a Tensor dtype of the above and passes it into `aoti_torch_call_dispatcher`. As of now, I believe there are 0 users of this use case, so the benefits of this change significantly justify BC-breaking this API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160557 Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet	2025-08-19 22:13:47 +00:00
Raman Kumar	05e8fac4f3	handling special case for pow(3) for GPU (#157537 ) follows #152373 Special case for pow(3): Similar to the [CPU kernel](`d27d36136c/aten/src/ATen/native/cpu/PowKernel.cpp (L64)`), added corresponding GPU code for numerical stability. issue #150951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157537 Approved by: https://github.com/soulitzer	2025-08-19 21:57:08 +00:00
Zhengxu Chen	f90ccad165	[export] Relax FC requirement of serde.deserialize by allowing unknown fields. (#160918 ) Summary: Previously we will pass all serialized data to dataclass ctors. Now we just loop over all the existing fields in dataclass and fetch only the field we need to run ctor. This should help with the case when we deserializing a buffer with new field. Test Plan: CI Rollback Plan: Differential Revision: D80487716 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160918 Approved by: https://github.com/angelayi	2025-08-19 21:54:46 +00:00
Rob Timpe	35e4d97e04	[dynamo] Support builtin complex with constant args (#160799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160799 Approved by: https://github.com/guilhermeleobas, https://github.com/mlazos	2025-08-19 20:38:54 +00:00
Jazlyn Li	66166cf1e7	preserve node meta to fix inductor generated kernel name for pattern matched graphs (#160542 ) Summary: When using inductor pattern matcher to replace graphs, the graph generated by replacement function can be missing `original_aten` metadata for the replaced nodes. This further results in inductor failing to generate a sensible kernel name, eg. `tri_poi_fused_0` , missing the aten op name. This diff attempts to fix that by allowing tracing the graph in replacement function with `preserve_node_meta`. Included this as an option to turn on in `pattern_matcher.fwd_only` function. Can confirm that with the fix, MTIA's pattern matcher replaced original graph with a node that has original_aten meta, and inductor generated kernel name has op name. Test Plan: added kernel_name check to afg_inductor_test silu test Rollback Plan: Differential Revision: D80183670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160542 Approved by: https://github.com/eellison, https://github.com/bdhirsh	2025-08-19 20:32:17 +00:00
PyTorch MergeBot	eba20d2d74	Revert "[WIP] Merge Test (#160998 )" This reverts commit ef761c43538abae5bccc0c4b6ebaf42ff676db7a. Reverted https://github.com/pytorch/pytorch/pull/160998 on behalf of https://github.com/ZainRizvi due to Undoing test merge ([comment](https://github.com/pytorch/pytorch/pull/160998#issuecomment-3202125839))	2025-08-19 20:30:39 +00:00
John Stawinski	ef761c4353	[WIP] Merge Test (#160998 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160998 Approved by: https://github.com/ZainRizvi	2025-08-19 20:26:07 +00:00
Will Constable	1ea918caf9	[C10D] Make MultiProcContinuousTest less spammy (#160821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160821 Approved by: https://github.com/fduwjj ghstack dependencies: #160892	2025-08-19 20:17:19 +00:00
Will Constable	779fc29c04	[C10D] Fix spelling of MultiProcContinuousTest (#160892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160892 Approved by: https://github.com/fduwjj	2025-08-19 20:17:19 +00:00
Aaron Gokaslan	ed8bcccf31	[BE][Ez]: Update ruff to 0.12.9 (#160896 ) Updates ruff. Fixes false positives and other miscellaneous ruff linting and formatting fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160896 Approved by: https://github.com/zou3519	2025-08-19 19:56:24 +00:00
Ke Wen	9d9cc9897a	[SymmMem] Support rendezvous on slice of a tensor (#160825 ) When we search for a NVSHMEM allocation backing a tensor, don't limit it to an exact match between `tensor.data_ptr()` and `allocation.base_ptr`. Instead, test whether the former is within an allocation range, i.e. [base_ptr, base_ptr + size). This PR also squashed in original base PR #160795: Since (i) `handle = rendezvous(tensor)`, and (ii) we pass `handle->buffer_ptrs` to kernels, `handle` should carry the `data_ptr()` of tensor instead of the base address of a memory allocation (previous case). Pull Request resolved: https://github.com/pytorch/pytorch/pull/160825 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-08-19 19:08:45 +00:00
Markus Hoehnerbach	65d21dae18	[inductor] dont reuse buffers if it affects peak (#145883 ) (#159530 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159530 Approved by: https://github.com/eellison	2025-08-19 19:02:56 +00:00
atalman	62db8ec391	windows python 3.14 nightly builds (#159869 ) Related to https://github.com/pytorch/pytorch/issues/156856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159869 Approved by: https://github.com/malfet, https://github.com/williamwen42	2025-08-19 18:36:16 +00:00
Mengtian Xu	5dad5b4f57	[AIDIR] Revise the insight content (#160649 ) Summary: Make it more descriptive and understable to user. Rollback Plan: Differential Revision: D80218659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160649 Approved by: https://github.com/jingsh	2025-08-19 18:04:49 +00:00
Huy Do	fab5dac734	Tweak dependabot to run inductor jobs (#160935 ) After https://github.com/pytorch/pytorch/pull/160635, I can see dependabot creating the PR to bump `transformers` version at https://github.com/pytorch/pytorch/pull/160807. This a good start, but there are several tweaks we need: 1. Run inductor tests on the PR including one round of perf benchmark, which is always needed. So, we need `ciflow/inductor` label and a `pull_request` trigger for the benchmark 2. Per @anijain2305 feedback, we don't need to update patch version. So, I add a rule to ignore it. Again, we would need to test this out after this lands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160935 Approved by: https://github.com/anijain2305	2025-08-19 17:56:07 +00:00
Nikita Shulga	a44a0d3671	[MPS] Fix index_add for complex + int64 (#160926 ) By re-using deterministic algorithm from `bbc7c03e93/aten/src/ATen/native/cuda/Indexing.cu (L1106-L1113)` Fixes https://github.com/pytorch/pytorch/issues/160845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160926 Approved by: https://github.com/manuelcandales ghstack dependencies: #160850, #160889	2025-08-19 17:43:06 +00:00
Pian Pawakapan	2f0cba934d	[dynamic shapes] unbacked-safe slicing (#157944 ) Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944 Approved by: https://github.com/laithsakka	2025-08-19 17:32:47 +00:00
Sam Anklesaria	0a5ab612dd	Port amax to stable ABI (#160214 ) To enable porting torchaudio to the stable ABI, we need the `amax` operation to be accessible. This PR ports the op and provides tests that it behaves correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160214 Approved by: https://github.com/mikaylagawarecki	2025-08-19 17:24:53 +00:00
Jeff Daily	1fbe230b0d	forward fix #160747 (#160981 ) broke rocm inductor tests Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/160981 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007 Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-19 17:16:41 +00:00
PyTorch MergeBot	eddaaa6c2a	Revert "Recheck Autotune cache on Precompile serialization to prune compilation results (#158656 )" This reverts commit 664005662ad8c9aa1942015397048aa9ca14fd6d. Reverted https://github.com/pytorch/pytorch/pull/158656 on behalf of https://github.com/seemethere due to failing internal tests, see D80486843 ([comment](https://github.com/pytorch/pytorch/pull/158656#issuecomment-3201491561))	2025-08-19 16:53:20 +00:00
Richard Barnes	fecc5f6001	[codemod] Fix unused-local-typedef issue in caffe2/aten/src/ATen/native/cuda/CUDALoops.cuh +2 (#160944 ) Summary: LLVM has a warning `-Wunused-local-typedef` which we are enabling to remove unused code. This has the side-effect of making it easier to do refactors should as removing unnecessary includes. For questions/comments, contact r-barnes. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Rollback Plan: Differential Revision: D80511128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160944 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-08-19 16:49:29 +00:00
Isuru Fernando	f305019377	[inductor] propagate shapes in CSEVariable (#152198 ) Fixes #149905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152198 Approved by: https://github.com/eellison	2025-08-19 16:46:38 +00:00
Tialo	50cfe76231	Update checkpoint warning to target PyTorch 2.9 (#160725 ) Follow-up to #160534. Fixes the docstrings and the warning in checkpoint_sequential, which presumably should have same deprecation notice Pull Request resolved: https://github.com/pytorch/pytorch/pull/160725 Approved by: https://github.com/soulitzer	2025-08-19 15:08:50 +00:00
James Wu	9225c61994	Move save guard error throwing to separate phase (#160662 ) This diff makes it so that the portion saving guards that can throw is completely separated from GuardBuilder, and instead in `serialize_guards`. This lets me add a try catch around it for caching precompile later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160662 Approved by: https://github.com/zhxchen17	2025-08-19 14:46:43 +00:00
PyTorch MergeBot	e3ebf364e6	Revert "Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836 )" This reverts commit 5d9653d90ee003173dd03f93e09fed236500ef06. Reverted https://github.com/pytorch/pytorch/pull/160836 on behalf of https://github.com/malfet due to It broke inductor tests by improving them ([comment](https://github.com/pytorch/pytorch/pull/160836#issuecomment-3200834103))	2025-08-19 13:46:53 +00:00
FFFrog	284b719005	Remove the uncessary empty file (#160728 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160728 Approved by: https://github.com/Skylion007	2025-08-19 10:54:08 +00:00
FFFrog	daeb3a6094	Using std::make_unique<T>() instead of unique<T>(new T()) (#160723 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160723 Approved by: https://github.com/Skylion007	2025-08-19 10:25:47 +00:00
cyy	5d9653d90e	Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836 ) Because numpy 1.22.4 had reached EOL 3 years ago. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160836 Approved by: https://github.com/malfet	2025-08-19 09:15:06 +00:00
Nick Riasanovsky	df60736410	[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747 ) Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs. Test Plan: Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda` Rollback Plan: Differential Revision: D80348643 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747 Approved by: https://github.com/NikhilAPatel	2025-08-19 07:32:55 +00:00
thenumberouscode	8f31aa97a3	[dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#160934 ) Fixes #157399 cherry pick of d6a5c03 @mlazos Pull Request resolved: https://github.com/pytorch/pytorch/pull/160934 Approved by: https://github.com/mlazos	2025-08-19 06:01:26 +00:00
Nikita Shulga	29afde2020	[CD] Build libtorch without nvshmem (#160910 ) It was done once for cuSparseLT in `f01d7105b1` , now it's nvShmem's time Fixes https://github.com/pytorch/pytorch/issues/160762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160910 Approved by: https://github.com/Skylion007	2025-08-19 05:58:25 +00:00
David Berard	8dbe7f99bd	[BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711 ) allow_tf32 is deprecated. Also, this will make it easier to support tf32x3 (i.e. #160359). dashboard results on h100 show no change: [inference](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2011%20Aug%202025%2017%3A01%3A22%20GMT&stopTime=Mon%2C%2018%20Aug%202025%2017%3A01%3A22%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/399/orig&lCommit=ce12d0fd751a733f22b5bdda00bd58d323e0a526&rBranch=main&rCommit=e444cd24d48b3a46f067974f2cc157f5ed27709f), [training](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2011%20Aug%202025%2017%3A01%3A22%20GMT&stopTime=Mon%2C%2018%20Aug%202025%2017%3A01%3A22%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/399/orig&lCommit=ce12d0fd751a733f22b5bdda00bd58d323e0a526&rBranch=main&rCommit=e444cd24d48b3a46f067974f2cc157f5ed27709f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160711 Approved by: https://github.com/PaulZhang12, https://github.com/njriasan	2025-08-19 05:27:10 +00:00
PyTorch UpdateBot	1d46aa736f	[audio hash update] update the pinned audio hash (#160930 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160930 Approved by: https://github.com/pytorchbot	2025-08-19 04:22:55 +00:00
PyTorch UpdateBot	2cf69fe0e1	[vllm hash update] update the pinned vllm hash (#160929 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160929 Approved by: https://github.com/pytorchbot	2025-08-19 04:22:45 +00:00
dolpm	923bc46122	fix mul.Scalar with strided tensor (#160560 ) Summary: out variant has to be strided like self. since memory format isn't provided, this should be equivalent. Test Plan: prev. when we enable static dispatch this test would have numeric issues ``` buck2 test //caffe2/test:test_export -- test__scaled_dot_product_flash_attention_cpp_runtime_nonstrict --print-passing-details ``` Rollback Plan: Reviewed By: SherlockNoMad Differential Revision: D80191085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160560 Approved by: https://github.com/SherlockNoMad	2025-08-19 04:15:12 +00:00
Paul de Supinski	58f9a3dd63	[ez] Only use default numa bindings if nproc == cuda device count (#160848 ) # Context Another fix to enable broad rollout of #149334. The implementation assumes that the trainer process with local rank `n` only uses device `cuda:n`. However, there are sometimes jobs with more than one GPU per process, in which case our assumption could be incorrect and actually lead to worse memory locality. # This PR As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160848 Approved by: https://github.com/kiukchung	2025-08-19 02:50:01 +00:00
Will Feng	a391fa1c42	Make Inductor benchmarker more compatible with Triton do_bench (#160921 ) Common benchmark suites like TritonBench uses `triton.testing.do_bench` for kernel timing measurement which is not always fair for all backends. E.g. it includes torch.compile Dynamo invocation overhead and hence doesn't reflect real-world model use case where Dynamo overhead is usually hidden. I also opened a PR to use this timing measurement function on TritonBench side: https://github.com/meta-pytorch/tritonbench/pull/333. But regardless of whether that PR can land, I think we should enhance Inductor benchmark_gpu to match do_bench features, to make it easier to people to migrate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160921 Approved by: https://github.com/BoyuanFeng	2025-08-19 02:40:21 +00:00
Yidi Wu	209143ddeb	[while_loop][inductor] fix aliased inputs by cloning (#160668 ) [fx_graph_cse](https://github.com/pytorch/pytorch/blob/main/torch/_functorch/compile_utils.py#L46) is executed in min_cut partitioner which accidentally creates the aliasing for empty buffers and we could see the following graph node for joint graph with cmd: "pytest test/functorch/test_control_flow.py -k test_scan_multiple_layers_gradient_layers_2_device_cpu" ```python while_loop = torch.ops.higher_order.while_loop(while_loop_cond_graph_0_0, while_loop_body_graph_0_0, (full_default_4, empty_strided_default, full_default_2, full_default_3, full_default_2, full_default_3, full_default, full_default, rev, rev_1, rev_2, rev_3), (primals_4, primals_5, primals_6, primals_7)); ``` Notice the operands sequence "full_default_2, full_default_3, full_default_2, full_default_3, full_default, full_default", which indicates the gradient of different layers now sharing the same buffer, which create silent incorrectness. Fixes https://github.com/pytorch/pytorch/pull/158168. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160668 Approved by: https://github.com/zou3519 ghstack dependencies: #160548, #160374	2025-08-19 02:33:59 +00:00
Wang, Chuanqi	b1380f434d	[CD] Disable USE_MPI in XPU CI/CD wheel build (#159135 ) XPU wheel build need source MPI for distributed XCCL backend build, but it also enable USE_MPI by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159135 Approved by: https://github.com/malfet	2025-08-19 02:32:03 +00:00
mori360	e6e45e6ae8	[FSDP] Use post_reduce_stream.record_event() on hsdp+cpuoffload (#160481 ) Fixes https://github.com/pytorch/pytorch/issues/160291 `post_reduce_stream` is `all_reduce_stream` during HSDP, but CPU-GPU sync is hard coded to `reduce_scatter_stream` The hard-code could fail unit test on HSDP+CPU offload, add unit test here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160481 Approved by: https://github.com/weifengpy	2025-08-19 02:20:14 +00:00
Anshul Sinha	3d126e17e0	[FSDP][Collectives] skipping reduce_scatter when world size is 1 (#160136 ) Summary: In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_collectives to skip reduce_scatter in the foreach_reduce API when world_size ‎ = 1. I have created edited a test that uses CommDebugMode to verify that the reduce_scatter has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. I have also added a test command. Test Cases 1. pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_single_worldsize1 2. pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_tp_with_fsdp_offloading Pull Request resolved: https://github.com/pytorch/pytorch/pull/160136 Approved by: https://github.com/weifengpy ghstack dependencies: #160135	2025-08-19 02:13:30 +00:00
Kevin Fu	8d15af2320	[PT2]: Allow None for wrapped_fbgemm_linear_fp16_weight (#160802 ) Summary: Currently the implementation of [fbgemm_linear_fp16_weight](https://www.internalfb.com/code/fbsource/[ffe8ba561cb6af33fde5b32c27411d6d3f4f2c70]/fbcode/caffe2/aten/src/ATen/native/QuantizedLinear.cpp?lines=477) does not allow None for `bias`, but it's actually a valid case and internally `fbgemm_linear_fp16_weight_fp32_activation` accept None bias as well. For BC reason, we can't directly change the function signature. So wrapping an empty tensor if bias is None to workaround it in Sigmoid. Test Plan: P1906210273 ``` MODEL_TYPE=dpa_product_first_ctr_model MODEL_ENTITY_ID=778442870 SNAPSHOT_ID=6 MODULE=user SUFFIX=.predictor.precompute.remote_request_only buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice="" --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true --benchmarkNumIterations=10000 &> ~/logs/${MODEL_TYPE}/load_net_predictor_${MODEL_ENTITY_ID}_${SNAPSHOT_ID}_${MODULE} ``` Rollback Plan: Reviewed By: henryoier, hl475 Differential Revision: D80382652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160802 Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier	2025-08-19 01:46:53 +00:00
zhxchen17	e9209e0854	[dynamo] Refactor tracer logic in convert_frame so that it doesn't leak to outer layer. [1/n] (#160814 ) We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export). One incremental step we can take is to refactor out InstructionTranslator as a functional piece providing bytecode tracing. To separate out this part, we notice currently the tracer object is being passed around in the entire convert frame compile function. This is not very ideal because we want to build a boundary between the tracing and downstream compiler stack. Ideally, we should extract all the relevant information out of the tracer object and return a new data structure that is free of internal states of InstructionTranslator. Luckily, there aren't many data used from tracer, after tracing is finished. The major one is OutputGraph, other than that, we only need to record two boolean flags for error handling purposes. The new type we're adding is called DynamoTracerOutput, which contains all the information needed by torch.compile internal after symbolic convert is finished. To simplify the current PR, we leave out the part which reduce OutputGraph into a minimal set, since this can be done in a separate PR. Differential Revision: [D80388693](https://our.internmc.facebook.com/intern/diff/D80388693/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160814 Approved by: https://github.com/tugsbayasgalan	2025-08-19 01:46:24 +00:00
Pian Pawakapan	4cb31015f2	[dynamic shapes] prims_common non_overlapping_and_dense (#160462 ) Differential Revision: D80120333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160462 Approved by: https://github.com/laithsakka	2025-08-19 01:35:28 +00:00

1355 changed files with 81981 additions and 42815 deletions

									
										15

.bc-linter.yml
									
										Normal file
									
												View File
												
				@ -0,0 +1,15 @@

				version: 1

				paths:

				include:

				  - "**/*.py"

				exclude:

				  - ".*"

				  - ".*/**"

				  - "**/.*/**"

				  - "**/.*"

				  - "**/_*/**"

				  - "**/_*.py"

				  - "**/test/**"

				  - "**/benchmarks/**"

				  - "**/test_*.py"

				  - "**/*_test.py"

									
										26

.ci/aarch64_linux/aarch64_ci_build.sh
									
												View File
												
				@ -3,8 +3,20 @@ set -eux -o pipefail

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				if [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then

				# Set CUDA architecture lists to match x86 build_cuda.sh

				if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="8.0;9.0"

				elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"

				elif [[ "$GPU_ARCH_VERSION" == *"13.0"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;11.0;12.0+PTX"

				fi

				# Compress the fatbin with -compress-mode=size for CUDA 13

				if [[ "$DESIRED_CUDA" == *"13"* ]]; then

				    export TORCH_NVCC_FLAGS="-compress-mode=size"

				    # Bundle ptxas into the cu13 wheel, see https://github.com/pytorch/pytorch/issues/163801

				    export BUILD_BUNDLE_PTXAS=1

				fi

				SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"

				@ -18,7 +30,7 @@ cd /

				# on the mounted pytorch repo

				git config --global --add safe.directory /pytorch

				pip install -r /pytorch/requirements.txt

				pip install auditwheel==6.2.0

				pip install auditwheel==6.2.0 wheel

				if [ "$DESIRED_CUDA" = "cpu" ]; then

				    echo "BASE_CUDA_VERSION is not set. Building cpu wheel."

				    #USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

				@ -26,6 +38,16 @@ if [ "$DESIRED_CUDA" = "cpu" ]; then

				else

				    echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"

				    export USE_SYSTEM_NCCL=1

				    # Check if we should use NVIDIA libs from PyPI (similar to x86 build_cuda.sh logic)

				    if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				        echo "Bundling CUDA libraries with wheel for aarch64."

				    else

				        echo "Using nvidia libs from pypi for aarch64."

				        echo "Updated PYTORCH_EXTRA_INSTALL_REQUIREMENTS for aarch64: $PYTORCH_EXTRA_INSTALL_REQUIREMENTS"

				        export USE_NVIDIA_PYPI_LIBS=1

				    fi

				    #USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

				    USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda

				fi

									
										245

.ci/aarch64_linux/aarch64_wheel_ci_build.py
									
												View File
												
				@ -69,62 +69,186 @@ def replace_tag(filename) -> None:

				        f.writelines(lines)

				def patch_library_rpath(

				    folder: str,

				    lib_name: str,

				    use_nvidia_pypi_libs: bool = False,

				    desired_cuda: str = "",

				) -> None:

				    """Apply patchelf to set RPATH for a library in torch/lib"""

				    lib_path = f"{folder}/tmp/torch/lib/{lib_name}"

				    if use_nvidia_pypi_libs:

				        # For PyPI NVIDIA libraries, construct CUDA RPATH

				        cuda_rpaths = [

				            "$ORIGIN/../../nvidia/cudnn/lib",

				            "$ORIGIN/../../nvidia/nvshmem/lib",

				            "$ORIGIN/../../nvidia/nccl/lib",

				            "$ORIGIN/../../nvidia/cusparselt/lib",

				        ]

				        if "130" in desired_cuda:

				            cuda_rpaths.append("$ORIGIN/../../nvidia/cu13/lib")

				        else:

				            cuda_rpaths.extend(

				                [

				                    "$ORIGIN/../../nvidia/cublas/lib",

				                    "$ORIGIN/../../nvidia/cuda_cupti/lib",

				                    "$ORIGIN/../../nvidia/cuda_nvrtc/lib",

				                    "$ORIGIN/../../nvidia/cuda_runtime/lib",

				                    "$ORIGIN/../../nvidia/cufft/lib",

				                    "$ORIGIN/../../nvidia/curand/lib",

				                    "$ORIGIN/../../nvidia/cusolver/lib",

				                    "$ORIGIN/../../nvidia/cusparse/lib",

				                    "$ORIGIN/../../nvidia/nvtx/lib",

				                    "$ORIGIN/../../nvidia/cufile/lib",

				                ]

				            )

				        # Add $ORIGIN for local torch libs

				        rpath = ":".join(cuda_rpaths) + ":$ORIGIN"

				    else:

				        # For bundled libraries, just use $ORIGIN

				        rpath = "$ORIGIN"

				    if os.path.exists(lib_path):

				        os.system(

				            f"cd {folder}/tmp/torch/lib/; "

				            f"patchelf --set-rpath '{rpath}' --force-rpath {lib_name}"

				        )

				def copy_and_patch_library(

				    src_path: str,

				    folder: str,

				    use_nvidia_pypi_libs: bool = False,

				    desired_cuda: str = "",

				) -> None:

				    """Copy a library to torch/lib and patch its RPATH"""

				    if os.path.exists(src_path):

				        lib_name = os.path.basename(src_path)

				        shutil.copy2(src_path, f"{folder}/tmp/torch/lib/{lib_name}")

				        patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)

				def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				    """

				    Package the cuda wheel libraries

				    """

				    folder = os.path.dirname(wheel_path)

				    wheelname = os.path.basename(wheel_path)

				    os.mkdir(f"{folder}/tmp")

				    os.system(f"unzip {wheel_path} -d {folder}/tmp")

				    libs_to_copy = [

				        "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",

				        "/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",

				        "/usr/local/cuda/lib64/libcudnn.so.9",

				        "/usr/local/cuda/lib64/libcublas.so.12",

				        "/usr/local/cuda/lib64/libcublasLt.so.12",

				        "/usr/local/cuda/lib64/libcudart.so.12",

				        "/usr/local/cuda/lib64/libcufft.so.11",

				        "/usr/local/cuda/lib64/libcusparse.so.12",

				        "/usr/local/cuda/lib64/libcusparseLt.so.0",

				        "/usr/local/cuda/lib64/libcusolver.so.11",

				        "/usr/local/cuda/lib64/libcurand.so.10",

				        "/usr/local/cuda/lib64/libnccl.so.2",

				        "/usr/local/cuda/lib64/libnvJitLink.so.12",

				        "/usr/local/cuda/lib64/libnvrtc.so.12",

				        "/usr/local/cuda/lib64/libnvshmem_host.so.3",

				        "/usr/local/cuda/lib64/libcudnn_adv.so.9",

				        "/usr/local/cuda/lib64/libcudnn_cnn.so.9",

				        "/usr/local/cuda/lib64/libcudnn_graph.so.9",

				        "/usr/local/cuda/lib64/libcudnn_ops.so.9",

				        "/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",

				        "/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",

				        "/usr/local/cuda/lib64/libcudnn_heuristic.so.9",

				        "/lib64/libgomp.so.1",

				        "/usr/lib64/libgfortran.so.5",

				        "/acl/build/libarm_compute.so",

				        "/acl/build/libarm_compute_graph.so",

				        "/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",

				        "/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",

				        "/usr/local/lib/libnvpl_lapack_core.so.0",

				        "/usr/local/lib/libnvpl_blas_core.so.0",

				    ]

				    # Delete original wheel since it will be repackaged

				    os.system(f"rm {wheel_path}")

				    if "129" in desired_cuda:

				        libs_to_copy += [

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.9",

				            "/usr/local/cuda/lib64/libcufile.so.0",

				            "/usr/local/cuda/lib64/libcufile_rdma.so.1",

				    # Check if we should use PyPI NVIDIA libraries or bundle system libraries

				    use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"

				    if use_nvidia_pypi_libs:

				        print("Using nvidia libs from pypi - skipping CUDA library bundling")

				        # For PyPI approach, we don't bundle CUDA libraries - they come from PyPI packages

				        # We only need to bundle non-NVIDIA libraries

				        minimal_libs_to_copy = [

				            "/lib64/libgomp.so.1",

				            "/usr/lib64/libgfortran.so.5",

				            "/acl/build/libarm_compute.so",

				            "/acl/build/libarm_compute_graph.so",

				            "/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",

				            "/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",

				            "/usr/local/lib/libnvpl_lapack_core.so.0",

				            "/usr/local/lib/libnvpl_blas_core.so.0",

				        ]

				    # Copy libraries to unzipped_folder/a/lib

				    for lib_path in libs_to_copy:

				        lib_name = os.path.basename(lib_path)

				        shutil.copy2(lib_path, f"{folder}/tmp/torch/lib/{lib_name}")

				        os.system(

				            f"cd {folder}/tmp/torch/lib/; "

				            f"patchelf --set-rpath '$ORIGIN' --force-rpath {folder}/tmp/torch/lib/{lib_name}"

				        )

				        # Copy minimal libraries to unzipped_folder/torch/lib

				        for lib_path in minimal_libs_to_copy:

				            copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)

				        # Patch torch libraries used for searching libraries

				        torch_libs_to_patch = [

				            "libtorch.so",

				            "libtorch_cpu.so",

				            "libtorch_cuda.so",

				            "libtorch_cuda_linalg.so",

				            "libtorch_global_deps.so",

				            "libtorch_python.so",

				            "libtorch_nvshmem.so",

				            "libc10.so",

				            "libc10_cuda.so",

				            "libcaffe2_nvrtc.so",

				            "libshm.so",

				        ]

				        for lib_name in torch_libs_to_patch:

				            patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)

				    else:

				        print("Bundling CUDA libraries with wheel")

				        # Original logic for bundling system CUDA libraries

				        # Common libraries for all CUDA versions

				        common_libs = [

				            # Non-NVIDIA system libraries

				            "/lib64/libgomp.so.1",

				            "/usr/lib64/libgfortran.so.5",

				            "/acl/build/libarm_compute.so",

				            "/acl/build/libarm_compute_graph.so",

				            # Common CUDA libraries (same for all versions)

				            "/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",

				            "/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",

				            "/usr/local/lib/libnvpl_lapack_core.so.0",

				            "/usr/local/lib/libnvpl_blas_core.so.0",

				            "/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",

				            "/usr/local/cuda/lib64/libcudnn.so.9",

				            "/usr/local/cuda/lib64/libcusparseLt.so.0",

				            "/usr/local/cuda/lib64/libcurand.so.10",

				            "/usr/local/cuda/lib64/libnccl.so.2",

				            "/usr/local/cuda/lib64/libnvshmem_host.so.3",

				            "/usr/local/cuda/lib64/libcudnn_adv.so.9",

				            "/usr/local/cuda/lib64/libcudnn_cnn.so.9",

				            "/usr/local/cuda/lib64/libcudnn_graph.so.9",

				            "/usr/local/cuda/lib64/libcudnn_ops.so.9",

				            "/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",

				            "/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",

				            "/usr/local/cuda/lib64/libcudnn_heuristic.so.9",

				            "/usr/local/cuda/lib64/libcufile.so.0",

				            "/usr/local/cuda/lib64/libcufile_rdma.so.1",

				            "/usr/local/cuda/lib64/libcusparse.so.12",

				        ]

				        # CUDA version-specific libraries

				        if "13" in desired_cuda:

				            minor_version = desired_cuda[-1]

				            version_specific_libs = [

				                "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13",

				                "/usr/local/cuda/lib64/libcublas.so.13",

				                "/usr/local/cuda/lib64/libcublasLt.so.13",

				                "/usr/local/cuda/lib64/libcudart.so.13",

				                "/usr/local/cuda/lib64/libcufft.so.12",

				                "/usr/local/cuda/lib64/libcusolver.so.12",

				                "/usr/local/cuda/lib64/libnvJitLink.so.13",

				                "/usr/local/cuda/lib64/libnvrtc.so.13",

				                f"/usr/local/cuda/lib64/libnvrtc-builtins.so.13.{minor_version}",

				            ]

				        elif "12" in desired_cuda:

				            # Get the last character for libnvrtc-builtins version (e.g., "129" -> "9")

				            minor_version = desired_cuda[-1]

				            version_specific_libs = [

				                "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",

				                "/usr/local/cuda/lib64/libcublas.so.12",

				                "/usr/local/cuda/lib64/libcublasLt.so.12",

				                "/usr/local/cuda/lib64/libcudart.so.12",

				                "/usr/local/cuda/lib64/libcufft.so.11",

				                "/usr/local/cuda/lib64/libcusolver.so.11",

				                "/usr/local/cuda/lib64/libnvJitLink.so.12",

				                "/usr/local/cuda/lib64/libnvrtc.so.12",

				                f"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.{minor_version}",

				            ]

				        else:

				            raise ValueError(f"Unsupported CUDA version: {desired_cuda}.")

				        # Combine all libraries

				        libs_to_copy = common_libs + version_specific_libs

				        # Copy libraries to unzipped_folder/torch/lib

				        for lib_path in libs_to_copy:

				            copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)

				    # Make sure the wheel is tagged with manylinux_2_28

				    for f in os.scandir(f"{folder}/tmp/"):

				@ -132,14 +256,8 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				            replace_tag(f"{f.path}/WHEEL")

				            break

				    os.mkdir(f"{folder}/cuda_wheel")

				    os.system(f"cd {folder}/tmp/; zip -r {folder}/cuda_wheel/{wheelname} *")

				    shutil.move(

				        f"{folder}/cuda_wheel/{wheelname}",

				        f"{folder}/{wheelname}",

				        copy_function=shutil.copy2,

				    )

				    os.system(f"rm -rf {folder}/tmp/ {folder}/cuda_wheel/")

				    os.system(f"wheel pack {folder}/tmp/ -d {folder}")

				    os.system(f"rm -rf {folder}/tmp/")

				def complete_wheel(folder: str) -> str:

				@ -162,14 +280,7 @@ def complete_wheel(folder: str) -> str:

				            f"/{folder}/dist/{repaired_wheel_name}",

				        )

				    else:

				        repaired_wheel_name = wheel_name.replace(

				            "linux_aarch64", "manylinux_2_28_aarch64"

				        )

				        print(f"Renaming {wheel_name} wheel to {repaired_wheel_name}")

				        os.rename(

				            f"/{folder}/dist/{wheel_name}",

				            f"/{folder}/dist/{repaired_wheel_name}",

				        )

				        repaired_wheel_name = list_dir(f"/{folder}/dist")[0]

				    print(f"Copying {repaired_wheel_name} to artifacts")

				    shutil.copy2(

				@ -211,6 +322,16 @@ if __name__ == "__main__":

				    if enable_cuda:

				        build_vars += "MAX_JOBS=5 "

				        # Handle PyPI NVIDIA libraries vs bundled libraries

				        use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"

				        if use_nvidia_pypi_libs:

				            print("Configuring build for PyPI NVIDIA libraries")

				            # Configure for dynamic linking (matching x86 logic)

				            build_vars += "ATEN_STATIC_CUDA=0 USE_CUDA_STATIC_LINK=0 USE_CUPTI_SO=1 "

				        else:

				            print("Configuring build for bundled NVIDIA libraries")

				            # Keep existing static linking approach - already configured above

				    override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")

				    desired_cuda = os.getenv("DESIRED_CUDA")

				    if override_package_version is not None:

									
										4

.ci/docker/README.md
									
												View File
												
				@ -120,8 +120,8 @@ If your new Docker image needs a library installed from a specific pinned commit

				   If you're introducing a new argument to the Docker build, make sure to add it in the Docker build step in `.ci/docker/build.sh`:

				   ```bash

				   docker build \

				      ....

				      --build-arg "NEW_ARG_1=${NEW_ARG_1}"

				     ....

				     --build-arg "NEW_ARG_1=${NEW_ARG_1}"

				   ```

				3. **Update Dockerfile logic**:

									
										81

.ci/docker/build.sh
									
												View File
												
				@ -81,8 +81,8 @@ elif [[ "$image" == *riscv* ]]; then

				  DOCKERFILE="ubuntu-cross-riscv/Dockerfile"

				fi

				_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb

				_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b

				_UCX_COMMIT=7836b165abdbe468a2f607e7254011c07d788152

				_UCC_COMMIT=430e241bf5d38cbc73fc7a6b89155397232e3f96

				if [[ "$image" == *rocm* ]]; then

				  _UCX_COMMIT=cc312eaa4655c0cc5c2bcd796db938f90563bcf6

				  _UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d

				@ -114,6 +114,16 @@ case "$tag" in

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc11)

				    CUDA_VERSION=13.0.0

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    ANACONDA_PYTHON_VERSION=3.10

				@ -125,28 +135,6 @@ case "$tag" in

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm)

				    CUDA_VERSION=12.8.1

				    ANACONDA_PYTHON_VERSION=3.12

				@ -173,8 +161,8 @@ case "$tag" in

				    VISION=yes

				    ONNX=yes

				    ;;

				  pytorch-linux-jammy-py3.9-clang12)

				    ANACONDA_PYTHON_VERSION=3.9

				  pytorch-linux-jammy-py3.10-clang12)

				    ANACONDA_PYTHON_VERSION=3.10

				    CLANG_VERSION=12

				    VISION=yes

				    TRITON=yes

				@ -209,24 +197,24 @@ case "$tag" in

				    UCC_COMMIT=${_UCC_COMMIT}

				    PYTORCH_ROCM_ARCH="gfx90a;gfx942;gfx950"

				    ;;

				  pytorch-linux-jammy-xpu-2025.0-py3)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    VISION=yes

				    XPU_VERSION=2025.0

				    NINJA_VERSION=1.9.0

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-xpu-2025.1-py3)

				    ANACONDA_PYTHON_VERSION=3.9

				  pytorch-linux-jammy-xpu-n-1-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				    XPU_VERSION=2025.1

				    NINJA_VERSION=1.9.0

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.9

				  pytorch-linux-jammy-xpu-n-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				    XPU_VERSION=2025.2

				    NINJA_VERSION=1.9.0

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				    KATEX=yes

				@ -234,8 +222,8 @@ case "$tag" in

				    DOCS=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12)

				    ANACONDA_PYTHON_VERSION=3.9

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-clang12)

				    ANACONDA_PYTHON_VERSION=3.10

				    CUDA_VERSION=12.8.1

				    CLANG_VERSION=12

				    VISION=yes

				@ -246,8 +234,8 @@ case "$tag" in

				    CLANG_VERSION=18

				    VISION=yes

				    ;;

				  pytorch-linux-jammy-py3.9-gcc11)

				    ANACONDA_PYTHON_VERSION=3.9

				  pytorch-linux-jammy-py3.10-gcc11)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				    KATEX=yes

				@ -274,13 +262,10 @@ case "$tag" in

				    TRITON_CPU=yes

				    ;;

				  pytorch-linux-jammy-linter)

				    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

				    # We will need to update mypy version eventually, but that's for another day. The task

				    # would be to upgrade mypy to 1.0.0 with Python 3.11

				    PYTHON_VERSION=3.9

				    PYTHON_VERSION=3.10

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter)

				    PYTHON_VERSION=3.9

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-linter)

				    PYTHON_VERSION=3.10

				    CUDA_VERSION=12.8.1

				    ;;

				  pytorch-linux-jammy-aarch64-py3.10-gcc11)

2

.ci/docker/ci_commit_pins/torchbench.txt

View File

 @ -1 +1 @@
 e03a63be43e33596f7f0a43b0f530353785e4a59
 a23feff57432129df84d8099e622773cf77925

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 dc9b2bb815e428f721f9da599dab0dc1c5d7
 b0418a9a454b2b93ab8d71f40e59d2297157fae

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 f7888497a1eb9e98d4c07537f0d0bcfe180d1363
 bbb06c0334a6772b92d24bde54956e675c8c6604

									
										4

.ci/docker/common/install_cpython.sh
									
												View File
												
				@ -83,9 +83,9 @@ function build_cpython {

				        py_suffix=${py_ver::-1}

				        py_folder=$py_suffix

				    fi

				    # Only b3 is available now

				    # Update to rc2 due to https://github.com/python/cpython/commit/c72699086fe4

				    if [ "$py_suffix" == "3.14.0" ]; then

				        py_suffix="3.14.0b3"

				        py_suffix="3.14.0rc2"

				    fi

				    wget -q $PYTHON_DOWNLOAD_URL/$py_folder/Python-$py_suffix.tgz -O Python-$py_ver.tgz

				    do_cpython_build $py_ver Python-$py_suffix

									
										7

.ci/docker/common/install_cuda.sh
									
												View File
												
				@ -10,7 +10,7 @@ else

				  arch_path='sbsa'

				fi

				NVSHMEM_VERSION=3.3.20

				NVSHMEM_VERSION=3.3.24

				function install_cuda {

				  version=$1

				@ -65,7 +65,7 @@ function install_nvshmem {

				  # This pattern is a lie as it is not consistent across versions, for 3.3.9 it was cuda_ver-arch-nvshhem-ver

				  filename="libnvshmem-linux-${arch_path}-${nvshmem_version}_cuda${cuda_major_version}-archive"

				  suffix=".tar.xz"

				  url="https://developer.download.nvidia.com/compute/redist/nvshmem/${nvshmem_version}/builds/cuda${cuda_major_version}/txz/agnostic/${dl_arch}/${filename}${suffix}"

				  url="https://developer.download.nvidia.com/compute/nvshmem/redist/libnvshmem/linux-${arch_path}/${filename}${suffix}"

				  # download, unpack, install

				  wget -q "${url}"

				@ -147,8 +147,7 @@ function install_128 {

				}

				function install_130 {

				  CUDNN_VERSION=9.12.0.46

				  NVSHMEM_VERSION=3.3.20

				  CUDNN_VERSION=9.13.0.50

				  echo "Installing CUDA 13.0 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"

				  # install CUDA 13.0 in the same container

				  install_cuda 13.0.0 cuda_13.0.0_580.65.06_linux

									
										2

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -20,7 +20,7 @@ pip_install \

				pip_install coloredlogs packaging

				pip_install onnxruntime==1.22.1

				pip_install onnxscript==0.3.1

				pip_install onnxscript==0.4.0

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

									
										2

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -57,7 +57,7 @@ if [ ! -f setup.py ]; then

				  cd python

				fi

				pip_install pybind11==2.13.6

				pip_install pybind11==3.0.1

				# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

				as_jenkins sed -i -e 's/https:\/\/tritonlang.blob.core.windows.net\/llvm-builds/https:\/\/oaitriton.blob.core.windows.net\/public\/llvm-builds/g' setup.py

									
										8

.ci/docker/common/install_ucc.sh
									
												View File
												
				@ -44,8 +44,12 @@ function install_ucc() {

				  ./autogen.sh

				  # We only run distributed tests on Tesla M60 and A10G

				  NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"

				  if [[ -n "$CUDA_VERSION"  && $CUDA_VERSION == 13* ]]; then

				    NVCC_GENCODE="-gencode=arch=compute_86,code=compute_86"

				  else

				    # We only run distributed tests on Tesla M60 and A10G

				    NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"

				  fi

				  if [[ -n "$ROCM_VERSION" ]]; then

				    if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then

									
										20

.ci/docker/common/install_xpu.sh
									
												View File
												
				@ -65,10 +65,14 @@ function install_ubuntu() {

				function install_rhel() {

				    . /etc/os-release

				    if [[ ! " 8.8 8.10 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then

				        echo "RHEL version ${VERSION_ID} not supported"

				        exit

				    if [[ "${ID}" == "rhel" ]]; then

				        if [[ ! " 8.8 8.9 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then

				            echo "RHEL version ${VERSION_ID} not supported"

				            exit

				        fi

				    elif [[ "${ID}" == "almalinux" ]]; then

				        # Workaround for almalinux8 which used by quay.io/pypa/manylinux_2_28_x86_64

				        VERSION_ID="8.8"

				    fi

				    dnf install -y 'dnf-command(config-manager)'

				@ -146,11 +150,11 @@ if [[ "${XPU_DRIVER_TYPE,,}" == "lts" ]]; then

				    XPU_DRIVER_VERSION="/lts/2350"

				fi

				# Default use Intel® oneAPI Deep Learning Essentials 2025.0

				if [[ "$XPU_VERSION" == "2025.1" ]]; then

				    XPU_PACKAGES="intel-deep-learning-essentials-2025.1"

				# Default use Intel® oneAPI Deep Learning Essentials 2025.1

				if [[ "$XPU_VERSION" == "2025.2" ]]; then

				    XPU_PACKAGES="intel-deep-learning-essentials-2025.2"

				else

				    XPU_PACKAGES="intel-deep-learning-essentials-2025.0"

				    XPU_PACKAGES="intel-deep-learning-essentials-2025.1"

				fi

				# The installation depends on the base OS

									
										9

.ci/docker/common/patch_libstdc.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,9 @@

				#!/bin/bash

				set -xe

				# Script used in Linux x86 and aarch64 CD pipeline

				# Workaround for exposing statically linked libstdc++ CXX11 ABI symbols.

				# see: https://github.com/pytorch/pytorch/issues/133437

				LIBNONSHARED=$(gcc -print-file-name=libstdc++_nonshared.a)

				nm -g $LIBNONSHARED | grep " T " | grep recursive_directory_iterator | cut -c 20-  > weaken-symbols.txt

				objcopy --weaken-symbols weaken-symbols.txt $LIBNONSHARED $LIBNONSHARED

									
										13

.ci/docker/libtorch/Dockerfile
									
												View File
												
				@ -69,6 +69,19 @@ RUN bash ./install_cuda.sh 12.9

				RUN bash ./install_magma.sh 12.9

				RUN ln -sf /usr/local/cuda-12.9 /usr/local/cuda

				FROM cuda as cuda13.0

				RUN bash ./install_cuda.sh 13.0

				RUN bash ./install_magma.sh 13.0

				RUN ln -sf /usr/local/cuda-13.0 /usr/local/cuda

				# Install libibverbs for libtorch and copy to CUDA directory

				RUN apt-get update -y && \

				    apt-get install -y libibverbs-dev librdmacm-dev && \

				    cp /usr/lib/x86_64-linux-gnu/libmlx5.so* /usr/local/cuda/lib64/ && \

				    cp /usr/lib/x86_64-linux-gnu/librdmacm.so* /usr/local/cuda/lib64/ && \

				    cp /usr/lib/x86_64-linux-gnu/libibverbs.so* /usr/local/cuda/lib64/ && \

				    cp /usr/lib/x86_64-linux-gnu/libnl* /usr/local/cuda/lib64/

				FROM cpu as rocm

				ARG ROCM_VERSION

				ARG PYTORCH_ROCM_ARCH

5

.ci/docker/manywheel/Dockerfile_2_28

View File

 @ -130,7 +130,8 @@ ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/op
 RUN for cpython_version in "cp312-cp312" "cp313-cp313" "cp313-cp313t"; do \
     /opt/python/${cpython_version}/bin/python -m pip install setuptools wheel; \
     done;
 ADD ./common/patch_libstdc.sh patch_libstdc.sh
 RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh
 # cmake-3.18.4 from pip; force in case cmake3 already exists
 RUN yum install -y python3-pip && \
 @ -175,6 +176,6 @@ ENV XPU_DRIVER_TYPE ROLLING
 RUN python3 -m pip install --upgrade pip && \
     python3 -mpip install cmake==3.28.4
 ADD ./common/install_xpu.sh install_xpu.sh
 ENV XPU_VERSION 2025.1
 ENV XPU_VERSION 2025.2
 RUN bash ./install_xpu.sh && rm install_xpu.sh
 RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

2

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

 @ -71,3 +71,5 @@ RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 COPY --from=openblas     /opt/OpenBLAS/  /opt/OpenBLAS/
 ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH
 ADD ./common/patch_libstdc.sh patch_libstdc.sh
 RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh

2

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

 @ -95,3 +95,5 @@ COPY --from=nvpl /opt/nvpl/lib/  /usr/local/lib/
 COPY --from=nvpl /opt/nvpl/include/  /usr/local/include/
 RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
 ENV PATH=/usr/local/cuda/bin:$PATH
 ADD ./common/patch_libstdc.sh patch_libstdc.sh
 RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh

									
										6

.ci/docker/manywheel/build.sh
									
												View File
												
				@ -67,6 +67,12 @@ case ${image} in

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=13"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    manylinux2_28-builder:cuda13*)

				        TARGET=cuda_final

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=13"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    manylinuxaarch64-builder:cuda*)

				        TARGET=cuda_final

				        GPU_IMAGE=amd64/almalinux:8

12

.ci/docker/requirements-ci.txt

View File

 @ -93,8 +93,9 @@ librosa==0.10.2 ; python_version == "3.12" and platform_machine != "s390x"
 #Pinned versions:
 #test that import:
 mypy==1.16.0
 mypy==1.16.0 ; platform_system != "Windows"
 # Pin MyPy version because new errors are likely to appear with each release
 # Skip on Windows as lots of type annotations are POSIX specific
 #Description: linter
 #Pinned versions: 1.16.0
 #test that import: test_typing.py, test_type_hints.py
 @ -263,11 +264,6 @@ scipy==1.14.1 ; python_version >= "3.12"
 #Pinned versions:
 #test that import:
 tb-nightly==2.13.0a20230426
 #Description: TensorBoard
 #Pinned versions:
 #test that import:
 # needed by torchgen utils
 typing-extensions>=4.10.0
 #Description: type hints for python
 @ -344,7 +340,7 @@ onnx==1.18.0
 #Pinned versions:
 #test that import:
 onnxscript==0.3.1
 onnxscript==0.4.0
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 @ -384,7 +380,7 @@ dataclasses_json==0.6.7
 cmake==4.0.0
 #Description: required for building
 tlparse==0.3.30
 tlparse==0.4.0
 #Description: required for log parsing
 cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x"

2

.ci/docker/requirements-docs.txt

View File

 @ -1,7 +1,7 @@
 sphinx==5.3.0
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 5.3.0
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@722b7e6f9ca512fcc526ad07d62b3d28c50bb6cd#egg=pytorch_sphinx_theme2
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@71e55749be14ceb56e7f8211a9fb649866b87ad4#egg=pytorch_sphinx_theme2
 # TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
 # but it doesn't seem to work and hangs around idly. The initial thought that it is probably

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .4.0
 .5.0

2

.ci/docker/triton_xpu_version.txt

View File

 @ -1 +1 @@
 .4.0
 .5.0

									
										2

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -66,6 +66,7 @@ ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				# (optional) Install UCC

				ARG UCX_COMMIT

				ARG UCC_COMMIT

				ARG CUDA_VERSION

				ENV UCX_COMMIT $UCX_COMMIT

				ENV UCC_COMMIT $UCC_COMMIT

				ENV UCX_HOME /usr

				@ -181,7 +182,6 @@ COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm

				RUN if [ -n "${SKIP_LLVM_SRC_BUILD_INSTALL}" ]; then set -eu; rm -rf /opt/llvm; fi

				# AWS specific CUDA build guidance

				ENV TORCH_CUDA_ARCH_LIST Maxwell

				ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"

				ENV CUDA_PATH /usr/local/cuda

									
										2

.ci/libtorch/build.sh
									
												View File
												
				@ -7,4 +7,4 @@ set -ex

				SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

				USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.9" ${SCRIPTPATH}/../manywheel/build.sh

				USE_NVSHMEM=0 USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.10" ${SCRIPTPATH}/../manywheel/build.sh

									
										2

.ci/lumen_cli/cli/build_cli/register_build.py
									
												View File
												
				@ -2,7 +2,7 @@ import argparse

				import logging

				from cli.lib.common.cli_helper import register_targets, RichHelp, TargetSpec

				from cli.lib.core.vllm import VllmBuildRunner

				from cli.lib.core.vllm.vllm_build import VllmBuildRunner

				logger = logging.getLogger(__name__)

									
										143

.ci/lumen_cli/cli/lib/common/gh_summary.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,143 @@

				from __future__ import annotations

				import logging

				import os

				import textwrap

				from pathlib import Path

				from typing import TYPE_CHECKING

				from cli.lib.common.utils import get_wheels

				from jinja2 import Template

				if TYPE_CHECKING:

				    from collections.abc import Iterable, Mapping

				logger = logging.getLogger(__name__)

				_TPL_CONTENT = Template(

				    textwrap.dedent("""\

				    ## {{ title }}

				    ```{{ lang }}

				    {{ content }}

				    ```

				""")

				)

				_TPL_LIST_ITEMS = Template(

				    textwrap.dedent("""\

				    ## {{ title }}

				    {% for it in items %}

				    - {{ it.pkg }}: {{ it.relpath }}

				    {% else %}

				    _(no item found)_

				    {% endfor %}

				    """)

				)

				_TPL_TABLE = Template(

				    textwrap.dedent("""\

				    {%- if rows %}

				    | {{ cols | join(' | ') }} |

				    |{%- for _ in cols %} --- |{%- endfor %}

				    {%- for r in rows %}

				    | {%- for c in cols %} {{ r.get(c, "") }} |{%- endfor %}

				    {%- endfor %}

				    {%- else %}

				    _(no data)_

				    {%- endif %}

				""")

				)

				def gh_summary_path() -> Path | None:

				    """Return the Path to the GitHub step summary file, or None if not set."""

				    p = os.environ.get("GITHUB_STEP_SUMMARY")

				    return Path(p) if p else None

				def write_gh_step_summary(md: str, *, append_content: bool = True) -> bool:

				    """

				    Write Markdown content to the GitHub Step Summary file if GITHUB_STEP_SUMMARY is set.

				    append_content: default true, if True, append to the end of the file, else overwrite the whole file

				    Returns:

				        True if written successfully (in GitHub Actions environment),

				        False if skipped (e.g., running locally where the variable is not set).

				    """

				    sp = gh_summary_path()

				    if not sp:

				        logger.info("[gh-summary] GITHUB_STEP_SUMMARY not set, skipping write.")

				        return False

				    md_clean = textwrap.dedent(md).strip() + "\n"

				    mode = "a" if append_content else "w"

				    with sp.open(mode, encoding="utf-8") as f:

				        f.write(md_clean)

				    return True

				def md_heading(text: str, level: int = 2) -> str:

				    """Generate a Markdown heading string with the given level (1-6)."""

				    return f"{'#' * max(1, min(level, 6))} {text}\n"

				def md_details(summary: str, content: str) -> str:

				    """Generate a collapsible <details> block with a summary and inner content."""

				    return f"<details>\n<summary>{summary}</summary>\n\n{content}\n\n</details>\n"

				def summarize_content_from_file(

				    output_dir: Path,

				    freeze_file: str,

				    title: str = "Content from file",

				    code_lang: str = "",  # e.g. "text" or "ini"

				) -> bool:

				    f = Path(output_dir) / freeze_file

				    if not f.exists():

				        return False

				    content = f.read_text(encoding="utf-8").strip()

				    md = render_content(content, title=title, lang=code_lang)

				    return write_gh_step_summary(md)

				def summarize_wheels(path: Path, title: str = "Wheels", max_depth: int = 3):

				    items = get_wheels(path, max_depth=max_depth)

				    if not items:

				        return False

				    md = render_list(items, title=title)

				    return write_gh_step_summary(md)

				def md_kv_table(rows: Iterable[Mapping[str, str | int | float]]) -> str:

				    """

				    Render a list of dicts as a Markdown table using Jinja template.

				    """

				    rows = list(rows)

				    cols = list({k for r in rows for k in r.keys()})

				    md = _TPL_TABLE.render(cols=cols, rows=rows).strip() + "\n"

				    return md

				def render_list(

				    items: Iterable[str],

				    *,

				    title: str = "List",

				) -> str:

				    tpl = _TPL_LIST_ITEMS

				    md = tpl.render(title=title, items=items)

				    return md

				def render_content(

				    content: str,

				    *,

				    title: str = "Content",

				    lang: str = "text",

				) -> str:

				    tpl = _TPL_CONTENT

				    md = tpl.render(title=title, content=content, lang=lang)

				    return md

									
										4

.ci/lumen_cli/cli/lib/common/git_helper.py
									
												View File
												
				@ -45,7 +45,7 @@ def clone_external_repo(target: str, repo: str, dst: str = "", update_submodules

				        # Checkout pinned commit

				        commit = get_post_build_pinned_commit(target)

				        logger.info("Checking out pinned commit %s", commit)

				        logger.info("Checking out pinned %s commit %s", target, commit)

				        r.git.checkout(commit)

				        # Update submodules if requested

				@ -55,7 +55,7 @@ def clone_external_repo(target: str, repo: str, dst: str = "", update_submodules

				                sm.update(init=True, recursive=True, progress=PrintProgress())

				        logger.info("Successfully cloned %s", target)

				        return r

				        return r, commit

				    except GitCommandError as e:

				        logger.error("Git operation failed: %s", e)

									
										71

.ci/lumen_cli/cli/lib/common/pip_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,71 @@

				import glob

				import logging

				import shlex

				import shutil

				import sys

				from collections.abc import Iterable

				from importlib.metadata import PackageNotFoundError, version  # noqa: UP035

				from typing import Optional, Union

				from cli.lib.common.utils import run_command

				logger = logging.getLogger(__name__)

				def pip_install_packages(

				    packages: Iterable[str] = (),

				    env=None,

				    *,

				    requirements: Optional[str] = None,

				    constraints: Optional[str] = None,

				    prefer_uv: bool = False,

				) -> None:

				    use_uv = prefer_uv and shutil.which("uv") is not None

				    base = (

				        [sys.executable, "-m", "uv", "pip", "install"]

				        if use_uv

				        else [sys.executable, "-m", "pip", "install"]

				    )

				    cmd = base[:]

				    if requirements:

				        cmd += ["-r", requirements]

				    if constraints:

				        cmd += ["-c", constraints]

				    cmd += list(packages)

				    logger.info("pip installing packages: %s", " ".join(map(shlex.quote, cmd)))

				    run_command(" ".join(map(shlex.quote, cmd)), env=env)

				def pip_install_first_match(pattern: str, extras: Optional[str] = None, pref_uv=False):

				    wheel = first_matching_pkg(pattern)

				    target = f"{wheel}[{extras}]" if extras else wheel

				    logger.info("Installing %s...", target)

				    pip_install_packages([target], prefer_uv=pref_uv)

				def run_python(args: Union[str, list[str]], env=None):

				    """

				    Run the python in the current environment.

				    """

				    if isinstance(args, str):

				        args = shlex.split(args)

				    cmd = [sys.executable] + args

				    run_command(" ".join(map(shlex.quote, cmd)), env=env)

				def pkg_exists(name: str) -> bool:

				    try:

				        pkg_version = version(name)

				        logger.info("%s already exist with version: %s", name, pkg_version)

				        return True

				    except PackageNotFoundError:

				        logger.info("%s is not installed", name)

				        return False

				def first_matching_pkg(pattern: str) -> str:

				    matches = sorted(glob.glob(pattern))

				    if not matches:

				        raise FileNotFoundError(f"No wheel matching: {pattern}")

				    return matches[0]

									
										60

.ci/lumen_cli/cli/lib/common/utils.py
									
												View File
												
				@ -7,6 +7,8 @@ import os

				import shlex

				import subprocess

				import sys

				from contextlib import contextmanager

				from pathlib import Path

				from typing import Optional

				@ -77,3 +79,61 @@ def str2bool(value: Optional[str]) -> bool:

				    if value in false_value_set:

				        return False

				    raise ValueError(f"Invalid string value for boolean conversion: {value}")

				@contextmanager

				def temp_environ(updates: dict[str, str]):

				    """

				    Temporarily set environment variables and restore them after the block.

				    Args:

				        updates: Dict of environment variables to set.

				    """

				    missing = object()

				    old: dict[str, str | object] = {k: os.environ.get(k, missing) for k in updates}

				    try:

				        os.environ.update(updates)

				        yield

				    finally:

				        for k, v in old.items():

				            if v is missing:

				                os.environ.pop(k, None)

				            else:

				                os.environ[k] = v  # type: ignore[arg-type]

				@contextmanager

				def working_directory(path: str):

				    """

				    Temporarily change the working directory inside a context.

				    """

				    if not path:

				        # No-op context

				        yield

				        return

				    prev_cwd = os.getcwd()

				    try:

				        os.chdir(path)

				        yield

				    finally:

				        os.chdir(prev_cwd)

				def get_wheels(

				    output_dir: Path,

				    max_depth: Optional[int] = None,

				) -> list[str]:

				    """Return a list of wheels found in the given output directory."""

				    root = Path(output_dir)

				    if not root.exists():

				        return []

				    items = []

				    for dirpath, _, filenames in os.walk(root):

				        depth = Path(dirpath).relative_to(root).parts

				        if max_depth is not None and len(depth) > max_depth:

				            continue

				        for fname in sorted(filenames):

				            if fname.endswith(".whl"):

				                pkg = fname.split("-")[0]

				                relpath = str((Path(dirpath) / fname).relative_to(root))

				                items.append({"pkg": pkg, "relpath": relpath})

				    return items

									
										292

.ci/lumen_cli/cli/lib/core/vllm/lib.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,292 @@

				import logging

				import os

				import textwrap

				from typing import Any

				from cli.lib.common.gh_summary import write_gh_step_summary

				from cli.lib.common.git_helper import clone_external_repo

				from cli.lib.common.pip_helper import pip_install_packages

				from cli.lib.common.utils import run_command, temp_environ, working_directory

				from jinja2 import Template

				logger = logging.getLogger(__name__)

				_TPL_VLLM_INFO = Template(

				    textwrap.dedent("""\

				    ##  Vllm against Pytorch CI Test Summary

				    **Vllm Commit**: [{{ vllm_commit }}](https://github.com/vllm-project/vllm/commit/{{ vllm_commit }})

				    {%- if torch_sha %}

				    **Pytorch Commit**: [{{ torch_sha }}](https://github.com/pytorch/pytorch/commit/{{ torch_sha }})

				    {%- endif %}

				""")

				)

				def sample_vllm_test_library():

				    """

				    Simple sample to unblock the vllm ci development, which is mimic to

				    https://github.com/vllm-project/vllm/blob/main/.buildkite/test-pipeline.yaml

				    see run_test_plan for more details

				    """

				    # TODO(elainewy): Read from yaml file to handle the env and tests for vllm

				    return {

				        "vllm_basic_correctness_test": {

				            "title": "Basic Correctness Test",

				            "id": "vllm_basic_correctness_test",

				            "env_vars": {

				                "VLLM_WORKER_MULTIPROC_METHOD": "spawn",

				            },

				            "steps": [

				                "pytest -v -s basic_correctness/test_cumem.py",

				                "pytest -v -s basic_correctness/test_basic_correctness.py",

				                "pytest -v -s basic_correctness/test_cpu_offload.py",

				            ],

				        },

				        "vllm_basic_models_test": {

				            "title": "Basic models test",

				            "id": "vllm_basic_models_test",

				            "steps": [

				                "pytest -v -s models/test_transformers.py",

				                "pytest -v -s models/test_registry.py",

				                "pytest -v -s models/test_utils.py",

				                "pytest -v -s models/test_vision.py",

				                "pytest -v -s models/test_initialization.py",

				            ],

				        },

				        "vllm_entrypoints_test": {

				            "title": "Entrypoints Test ",

				            "id": "vllm_entrypoints_test",

				            "env_vars": {

				                "VLLM_WORKER_MULTIPROC_METHOD": "spawn",

				            },

				            "steps": [

				                " ".join(

				                    [

				                        "pytest",

				                        "-v",

				                        "-s",

				                        "entrypoints/llm",

				                        "--ignore=entrypoints/llm/test_generate.py",

				                        "--ignore=entrypoints/llm/test_collective_rpc.py",

				                    ]

				                ),

				                "pytest -v -s entrypoints/llm/test_generate.py",

				                "pytest -v -s entrypoints/offline_mode",

				            ],

				        },

				        "vllm_regression_test": {

				            "title": "Regression Test",

				            "id": "vllm_regression_test",

				            "package_install": ["modelscope"],

				            "steps": [

				                "pytest -v -s test_regression.py",

				            ],

				        },

				        "vllm_lora_tp_test_distributed": {

				            "title": "LoRA TP Test (Distributed)",

				            "id": "vllm_lora_tp_test_distributed",

				            "env_vars": {

				                "VLLM_WORKER_MULTIPROC_METHOD": "spawn",

				            },

				            "num_gpus": 4,

				            "steps": [

				                "pytest -v -s -x lora/test_chatglm3_tp.py",

				                "pytest -v -s -x lora/test_llama_tp.py",

				                "pytest -v -s -x lora/test_llm_with_multi_loras.py",

				            ],

				        },

				        "vllm_distributed_test_28_failure_test": {

				            "title": "Distributed Tests (2 GPUs) pytorch 2.8 release failure",

				            "id": "vllm_distributed_test_28_failure_test",

				            "env_vars": {

				                "VLLM_WORKER_MULTIPROC_METHOD": "spawn",

				            },

				            "num_gpus": 4,

				            "steps": [

				                "pytest -v -s distributed/test_sequence_parallel.py",

				            ],

				        },

				        "vllm_lora_28_failure_test": {

				            "title": "LoRA pytorch 2.8 failure test",

				            "id": "vllm_lora_28_failure_test",

				            "steps": ["pytest -v lora/test_quant_model.py"],

				        },

				        "vllm_multi_model_processor_test": {

				            "title": "Multi-Modal Processor Test",

				            "id": "vllm_multi_model_processor_test",

				            "package_install": ["git+https://github.com/TIGER-AI-Lab/Mantis.git"],

				            "steps": [

				                "pytest -v -s models/multimodal/processing --ignore models/multimodal/processing/test_tensor_schema.py",

				            ],

				        },

				        "vllm_multi_model_test_28_failure_test": {

				            "title": "Multi-Model Test (Failed 2.8 release)",

				            "id": "vllm_multi_model_test_28_failure_test",

				            "package_install": ["git+https://github.com/TIGER-AI-Lab/Mantis.git"],

				            "steps": [

				                "pytest -v -s models/multimodal/generation/test_voxtral.py",

				                "pytest -v -s models/multimodal/pooling",

				            ],

				        },

				        "vllm_pytorch_compilation_unit_tests": {

				            "title": "PyTorch Compilation Unit Tests",

				            "id": "vllm_pytorch_compilation_unit_tests",

				            "steps": [

				                "pytest -v -s compile/test_pass_manager.py",

				                "pytest -v -s compile/test_fusion.py",

				                "pytest -v -s compile/test_fusion_attn.py",

				                "pytest -v -s compile/test_silu_mul_quant_fusion.py",

				                "pytest -v -s compile/test_sequence_parallelism.py",

				                "pytest -v -s compile/test_async_tp.py",

				                "pytest -v -s compile/test_fusion_all_reduce.py",

				                "pytest -v -s compile/test_decorator.py",

				            ],

				        },

				        "vllm_languagde_model_test_extended_generation_28_failure_test": {

				            "title": "Language Models Test (Extended Generation) 2.8 release failure",

				            "id": "vllm_languagde_model_test_extended_generation_28_failure_test",

				            "package_install": [

				                "--no-build-isolation",

				                "git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8",

				            ],

				            "steps": [

				                "pytest -v -s models/language/generation/test_mistral.py",

				            ],

				        },

				        "vllm_distributed_test_2_gpu_28_failure_test": {

				            "title": "Distributed Tests (2 GPUs) pytorch 2.8 release failure",

				            "id": "vllm_distributed_test_2_gpu_28_failure_test",

				            "env_vars": {

				                "VLLM_WORKER_MULTIPROC_METHOD": "spawn",

				            },

				            "num_gpus": 4,

				            "steps": [

				                "pytest -v -s distributed/test_sequence_parallel.py",

				            ],

				        },

				        # TODO(elainewy):need to add g6 with 4 gpus to run this test

				        "vllm_lora_test": {

				            "title": "LoRA Test %N",

				            "id": "lora_test",

				            "parallelism": 4,

				            "steps": [

				                "echo '[checking] list sharded lora tests:'",

				                " ".join(

				                    [

				                        "pytest -q --collect-only lora",

				                        "--shard-id=$$BUILDKITE_PARALLEL_JOB",

				                        "--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT",

				                        "--ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py",

				                    ]

				                ),

				                "echo '[checking] Done. list lora tests'",

				                " ".join(

				                    [

				                        "pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB",

				                        "--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT",

				                        "--ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py",

				                    ]

				                ),

				            ],

				        },

				    }

				def check_parallelism(tests: Any, title: str, shard_id: int = 0, num_shards: int = 0):

				    """

				    a method to check if the test plan is parallelism or not.

				    """

				    parallelism = int(tests.get("parallelism", "0"))

				    is_parallel = parallelism and parallelism > 1

				    if not is_parallel:

				        return False

				    if shard_id > num_shards:

				        raise RuntimeError(

				            f"Test {title} expects {num_shards} shards, but invalid {shard_id} is provided"

				        )

				    if num_shards != parallelism:

				        raise RuntimeError(

				            f"Test {title} expects {parallelism} shards, but invalid {num_shards} is provided"

				        )

				    return True

				def run_test_plan(

				    test_plan: str,

				    test_target: str,

				    tests_map: dict[str, Any],

				    shard_id: int = 0,

				    num_shards: int = 0,

				):

				    """

				    a method to run list of tests based on the test plan.

				    """

				    logger.info("run %s tests.....", test_target)

				    if test_plan not in tests_map:

				        raise RuntimeError(

				            f"test {test_plan} not found, please add it to test plan pool"

				        )

				    tests = tests_map[test_plan]

				    pkgs = tests.get("package_install", [])

				    title = tests.get("title", "unknown test")

				    is_parallel = check_parallelism(tests, title, shard_id, num_shards)

				    if is_parallel:

				        title = title.replace("%N", f"{shard_id}/{num_shards}")

				    logger.info("Running tests: %s", title)

				    if pkgs:

				        logger.info("Installing packages: %s", pkgs)

				        pip_install_packages(packages=pkgs, prefer_uv=True)

				    with (

				        working_directory(tests.get("working_directory", "tests")),

				        temp_environ(tests.get("env_vars", {})),

				    ):

				        failures = []

				        for step in tests["steps"]:

				            logger.info("Running step: %s", step)

				            if is_parallel:

				                step = replace_buildkite_placeholders(step, shard_id, num_shards)

				                logger.info("Running parallel step: %s", step)

				            code = run_command(cmd=step, check=False, use_shell=True)

				            if code != 0:

				                failures.append(step)

				            logger.info("Finish running step: %s", step)

				        if failures:

				            logger.error("Failed tests: %s", failures)

				            raise RuntimeError(f"{len(failures)} pytest runs failed: {failures}")

				        logger.info("Done. All tests passed")

				def clone_vllm(dst: str = "vllm"):

				    _, commit = clone_external_repo(

				        target="vllm",

				        repo="https://github.com/vllm-project/vllm.git",

				        dst=dst,

				        update_submodules=True,

				    )

				    return commit

				def replace_buildkite_placeholders(step: str, shard_id: int, num_shards: int) -> str:

				    mapping = {

				        "$$BUILDKITE_PARALLEL_JOB_COUNT": str(num_shards),

				        "$$BUILDKITE_PARALLEL_JOB": str(shard_id),

				    }

				    for k in sorted(mapping, key=len, reverse=True):

				        step = step.replace(k, mapping[k])

				    return step

				def summarize_build_info(vllm_commit: str) -> bool:

				    torch_sha = os.getenv("GITHUB_SHA")

				    md = (

				        _TPL_VLLM_INFO.render(vllm_commit=vllm_commit, torch_sha=torch_sha).strip()

				        + "\n"

				    )

				    return write_gh_step_summary(md)

									
										52

.ci/lumen_cli/cli/lib/core/vllm.py → .ci/lumen_cli/cli/lib/core/vllm/vllm_build.py
									
												View File
												
				@ -13,7 +13,11 @@ from cli.lib.common.envs_helper import (

				    env_str_field,

				    with_params_help,

				)

				from cli.lib.common.git_helper import clone_external_repo

				from cli.lib.common.gh_summary import (

				    gh_summary_path,

				    summarize_content_from_file,

				    summarize_wheels,

				)

				from cli.lib.common.path_helper import (

				    copy,

				    ensure_dir_exists,

				@ -22,6 +26,7 @@ from cli.lib.common.path_helper import (

				    is_path_exist,

				)

				from cli.lib.common.utils import run_command

				from cli.lib.core.vllm.lib import clone_vllm, summarize_build_info

				logger = logging.getLogger(__name__)

				@ -42,7 +47,7 @@ class VllmBuildParameters:

				    """

				    # USE_TORCH_WHEEL: when true, use local Torch wheels; requires TORCH_WHEELS_PATH.

				    #  Otherwise docker build pull torch nightly during build

				    # Otherwise docker build pull torch nightly during build

				    # TORCH_WHEELS_PATH: directory containing local torch wheels when use_torch_whl is True

				    use_torch_whl: bool = env_bool_field("USE_TORCH_WHEEL", True)

				    torch_whls_path: Path = env_path_field("TORCH_WHEELS_PATH", "./dist")

				@ -152,18 +157,44 @@ class VllmBuildRunner(BaseRunner):

				        3. run docker build

				        """

				        inputs = VllmBuildParameters()

				        clone_vllm()

				        logger.info("Running vllm build with inputs: %s", inputs)

				        vllm_commit = clone_vllm()

				        self.cp_dockerfile_if_exist(inputs)

				        # cp torch wheels from root direct to vllm workspace if exist

				        self.cp_torch_whls_if_exist(inputs)

				        ensure_dir_exists(inputs.output_dir)

				        # make sure the output dir to store the build artifacts exist

				        ensure_dir_exists(Path(inputs.output_dir))

				        cmd = self._generate_docker_build_cmd(inputs)

				        logger.info("Running docker build: \n %s", cmd)

				        run_command(cmd, cwd="vllm", env=os.environ.copy())

				        try:

				            run_command(cmd, cwd="vllm", env=os.environ.copy())

				        finally:

				            self.genearte_vllm_build_summary(vllm_commit, inputs)

				    def genearte_vllm_build_summary(

				        self, vllm_commit: str, inputs: VllmBuildParameters

				    ):

				        if not gh_summary_path():

				            return logger.info("Skipping, not detect GH Summary env var....")

				        logger.info("Generate GH Summary ...")

				        # summarize vllm build info

				        summarize_build_info(vllm_commit)

				        # summarize vllm build artifacts

				        vllm_artifact_dir = inputs.output_dir / "wheels"

				        summarize_content_from_file(

				            vllm_artifact_dir,

				            "build_summary.txt",

				            title="Vllm build env pip package summary",

				        )

				        summarize_wheels(

				            inputs.torch_whls_path, max_depth=3, title="Torch Wheels Artifacts"

				        )

				        summarize_wheels(vllm_artifact_dir, max_depth=3, title="Vllm Wheels Artifacts")

				    def cp_torch_whls_if_exist(self, inputs: VllmBuildParameters) -> str:

				        if not inputs.use_torch_whl:

				@ -252,12 +283,3 @@ class VllmBuildRunner(BaseRunner):

				                --progress=plain .

				        """

				        ).strip()

				def clone_vllm():

				    clone_external_repo(

				        target="vllm",

				        repo="https://github.com/vllm-project/vllm.git",

				        dst="vllm",

				        update_submodules=True,

				    )

									
										269

.ci/lumen_cli/cli/lib/core/vllm/vllm_test.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,269 @@

				import logging

				import os

				import re

				import subprocess

				import sys

				from collections.abc import Iterable

				from dataclasses import dataclass

				from enum import Enum

				from pathlib import Path

				from typing import Any

				from cli.lib.common.cli_helper import BaseRunner

				from cli.lib.common.envs_helper import env_path_field, env_str_field, get_env

				from cli.lib.common.path_helper import copy, remove_dir

				from cli.lib.common.pip_helper import (

				    pip_install_first_match,

				    pip_install_packages,

				    pkg_exists,

				    run_python,

				)

				from cli.lib.common.utils import run_command, working_directory

				from cli.lib.core.vllm.lib import clone_vllm, run_test_plan, sample_vllm_test_library

				logger = logging.getLogger(__name__)

				@dataclass

				class VllmTestParameters:

				    """

				    Parameters defining the vllm external test input

				    !!!DO NOT ADD SECRETS IN THIS CLASS!!!

				    you can put environment variable name in VllmTestParameters if it's not the same as the secret one

				    fetch secrests directly from env variables during runtime

				    """

				    torch_whls_path: Path = env_path_field("WHEELS_PATH", "./dist")

				    vllm_whls_path: Path = env_path_field(

				        "VLLM_WHEELS_PATH", "./dist/external/vllm/wheels"

				    )

				    torch_cuda_arch_list: str = env_str_field("TORCH_CUDA_ARCH_LIST", "8.9")

				    def __post_init__(self):

				        if not self.torch_whls_path.exists():

				            raise ValueError("missing torch_whls_path")

				        if not self.vllm_whls_path.exists():

				            raise ValueError("missing vllm_whls_path")

				class TestInpuType(Enum):

				    TEST_PLAN = "test_plan"

				    UNKNOWN = "unknown"

				class VllmTestRunner(BaseRunner):

				    def __init__(self, args: Any):

				        self.work_directory = "vllm"

				        self.test_plan = ""

				        self.test_type = TestInpuType.UNKNOWN

				        self.shard_id = args.shard_id

				        self.num_shards = args.num_shards

				        if args.test_plan:

				            self.test_plan = args.test_plan

				            self.test_type = TestInpuType.TEST_PLAN

				        # Matches the structeur in the artifacts.zip from torcb build

				        self.TORCH_WHL_PATH_REGEX = "torch*.whl"

				        self.TORCH_WHL_EXTRA = "opt-einsum"

				        self.TORCH_ADDITIONAL_WHLS_REGEX = [

				            "vision/torchvision*.whl",

				            "audio/torchaudio*.whl",

				        ]

				        # Match the structure of the artifacts.zip from vllm external build

				        self.VLLM_TEST_WHLS_REGEX = [

				            "xformers/*.whl",

				            "vllm/vllm*.whl",

				            "flashinfer-python/flashinfer*.whl",

				        ]

				    def prepare(self):

				        """

				        prepare test environment for vllm. This includes clone vllm repo, install all wheels, test dependencies and set env

				        """

				        params = VllmTestParameters()

				        logger.info("Display VllmTestParameters %s", params)

				        self._set_envs(params)

				        clone_vllm(dst=self.work_directory)

				        with working_directory(self.work_directory):

				            remove_dir(Path("vllm"))

				            self._install_wheels(params)

				            self._install_dependencies()

				        # verify the torches are not overridden by test dependencies

				        check_versions()

				    def run(self):

				        """

				        main function to run vllm test

				        """

				        self.prepare()

				        try:

				            with working_directory(self.work_directory):

				                if self.test_type == TestInpuType.TEST_PLAN:

				                    if self.num_shards > 1:

				                        run_test_plan(

				                            self.test_plan,

				                            "vllm",

				                            sample_vllm_test_library(),

				                            self.shard_id,

				                            self.num_shards,

				                        )

				                    else:

				                        run_test_plan(

				                            self.test_plan, "vllm", sample_vllm_test_library()

				                        )

				                else:

				                    raise ValueError(f"Unknown test type {self.test_type}")

				        finally:

				            # double check the torches are not overridden by other packages

				            check_versions()

				    def _install_wheels(self, params: VllmTestParameters):

				        logger.info("Running vllm test with inputs: %s", params)

				        if not pkg_exists("torch"):

				            # install torch from local whls if it's not installed yet.

				            torch_p = f"{str(params.torch_whls_path)}/{self.TORCH_WHL_PATH_REGEX}"

				            pip_install_first_match(torch_p, self.TORCH_WHL_EXTRA)

				        torch_whls_path = [

				            f"{str(params.torch_whls_path)}/{whl_path}"

				            for whl_path in self.TORCH_ADDITIONAL_WHLS_REGEX

				        ]

				        for torch_whl in torch_whls_path:

				            pip_install_first_match(torch_whl)

				        logger.info("Done. Installed torch and other torch-related wheels ")

				        logger.info("Installing vllm wheels")

				        vllm_whls_path = [

				            f"{str(params.vllm_whls_path)}/{whl_path}"

				            for whl_path in self.VLLM_TEST_WHLS_REGEX

				        ]

				        for vllm_whl in vllm_whls_path:

				            pip_install_first_match(vllm_whl)

				        logger.info("Done. Installed vllm wheels")

				    def _install_test_dependencies(self):

				        """

				        This method replaces torch dependencies with local torch wheel info in

				        requirements/test.in file from vllm repo. then generates the test.txt

				        in runtime

				        """

				        logger.info("generate test.txt from requirements/test.in with local torch whls")

				        preprocess_test_in()

				        copy("requirements/test.txt", "snapshot_constraint.txt")

				        run_command(

				            f"{sys.executable} -m uv pip compile requirements/test.in "

				            "-o test.txt "

				            "--index-strategy unsafe-best-match "

				            "--constraint snapshot_constraint.txt "

				            "--torch-backend cu128"

				        )

				        pip_install_packages(requirements="test.txt", prefer_uv=True)

				        logger.info("Done. installed requirements for test dependencies")

				    def _install_dependencies(self):

				        pip_install_packages(packages=["-e", "tests/vllm_test_utils"], prefer_uv=True)

				        pip_install_packages(packages=["hf_transfer"], prefer_uv=True)

				        os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

				        # using script from vllm repo to remove all torch packages from requirements txt

				        run_python("use_existing_torch.py")

				        # install common packages

				        for requirements in ["requirements/common.txt", "requirements/build.txt"]:

				            pip_install_packages(

				                requirements=requirements,

				                prefer_uv=True,

				            )

				        # install test packages

				        self._install_test_dependencies()

				    def _set_envs(self, inputs: VllmTestParameters):

				        os.environ["TORCH_CUDA_ARCH_LIST"] = inputs.torch_cuda_arch_list

				        if not validate_cuda(get_env("TORCH_CUDA_ARCH_LIST")):

				            logger.warning(

				                "Missing supported TORCH_CUDA_ARCH_LIST. "

				                "Currently support TORCH_CUDA_ARCH_LIST env var "

				                "with supported arch [8.0, 8.9, 9.0]"

				            )

				        os.environ["HF_TOKEN"] = os.getenv("VLLM_TEST_HUGGING_FACE_TOKEN", "")

				        if not get_env("HF_TOKEN"):

				            raise ValueError(

				                "missing required HF_TOKEN, please set VLLM_TEST_HUGGING_FACE_TOKEN env var"

				            )

				        if not get_env("TORCH_CUDA_ARCH_LIST"):

				            raise ValueError(

				                "missing required TORCH_CUDA_ARCH_LIST, please set TORCH_CUDA_ARCH_LIST env var"

				            )

				def preprocess_test_in(

				    target_file: str = "requirements/test.in", additional_packages: Iterable[str] = ()

				):

				    """

				    This modifies the target_file file in place in vllm work directory.

				    It removes torch and unwanted packages in target_file and replace with local torch whls

				    package  with format "$WHEEL_PACKAGE_NAME @ file://<LOCAL_PATH>"

				    """

				    additional_package_to_move = list(additional_packages or ())

				    pkgs_to_remove = [

				        "torch",

				        "torchvision",

				        "torchaudio",

				        "xformers",

				        "mamba_ssm",

				    ] + additional_package_to_move

				    # Read current requirements

				    target_path = Path(target_file)

				    lines = target_path.read_text().splitlines()

				    pkgs_to_add = []

				    # Remove lines starting with the package names (==, @, >=) — case-insensitive

				    pattern = re.compile(rf"^({'|'.join(pkgs_to_remove)})\s*(==|@|>=)", re.IGNORECASE)

				    kept_lines = [line for line in lines if not pattern.match(line)]

				    # Get local installed torch/vision/audio from pip freeze

				    # This is hacky, but it works

				    pip_freeze = subprocess.check_output(["pip", "freeze"], text=True)

				    header_lines = [

				        line

				        for line in pip_freeze.splitlines()

				        if re.match(

				            r"^(torch|torchvision|torchaudio)\s*@\s*file://", line, re.IGNORECASE

				        )

				    ]

				    # Write back: header_lines + blank + kept_lines

				    out_lines = header_lines + [""] + kept_lines

				    if pkgs_to_add:

				        out_lines += [""] + pkgs_to_add

				    out = "\n".join(out_lines) + "\n"

				    target_path.write_text(out)

				    logger.info("[INFO] Updated %s", target_file)

				def validate_cuda(value: str) -> bool:

				    VALID_VALUES = {"8.0", "8.9", "9.0"}

				    return all(v in VALID_VALUES for v in value.split())

				def check_versions():

				    """

				    check installed packages version

				    """

				    logger.info("Double check installed packages")

				    patterns = ["torch", "xformers", "torchvision", "torchaudio", "vllm"]

				    for pkg in patterns:

				        pkg_exists(pkg)

				    logger.info("Done. checked installed packages")

									
										2

.ci/lumen_cli/cli/run.py
									
												View File
												
				@ -5,6 +5,7 @@ import logging

				from cli.build_cli.register_build import register_build_commands

				from cli.lib.common.logger import setup_logging

				from cli.test_cli.register_test import register_test_commands

				logger = logging.getLogger(__name__)

				@ -20,6 +21,7 @@ def main():

				    # registers second-level subcommands

				    register_build_commands(subparsers)

				    register_test_commands(subparsers)

				    # parse args after all options are registered

				    args = parser.parse_args()

0

test/dynamo_expected_failures/CPython313-test_bool-BoolTest.test_complex → .ci/lumen_cli/cli/test_cli/init.py

View File

									
										62

.ci/lumen_cli/cli/test_cli/register_test.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,62 @@

				import argparse

				import logging

				from cli.lib.common.cli_helper import register_targets, RichHelp, TargetSpec

				from cli.lib.core.vllm.vllm_test import VllmTestRunner

				logger = logging.getLogger(__name__)

				# Maps targets to their argparse configuration and runner

				# it adds new target to path python -m cli.run build external {target} with buildrunner

				_TARGETS: dict[str, TargetSpec] = {

				    "vllm": {

				        "runner": VllmTestRunner,

				        "help": "test vLLM with pytorch main",

				    }

				    # add yours ...

				}

				def common_args(parser: argparse.ArgumentParser) -> None:

				    """

				    Add common CLI arguments to the given parser.

				    """

				    parser.add_argument(

				        "--shard-id",

				        type=int,

				        default=1,

				        help="a shard id to run, e.g. '0,1,2,3'",

				    )

				    parser.add_argument(

				        "--num-shards",

				        type=int,

				        default=1,

				        help="a number of shards to run, e.g. '4'",

				    )

				    group = parser.add_mutually_exclusive_group(required=True)

				    group.add_argument(

				        "-tp",

				        "--test-plan",

				        type=str,

				        help="a pre-defined test plan to run, e.g. 'basic_correctness_test'",

				    )

				def register_test_commands(subparsers: argparse._SubParsersAction) -> None:

				    build_parser = subparsers.add_parser(

				        "test",

				        help="test related commands",

				        formatter_class=RichHelp,

				    )

				    build_subparsers = build_parser.add_subparsers(dest="test_command", required=True)

				    overview = "\n".join(

				        f"  {name:12} {spec.get('help', '')}" for name, spec in _TARGETS.items()

				    )

				    external_parser = build_subparsers.add_parser(

				        "external",

				        help="Test external targets",

				        description="Test third-party targets.\n\nAvailable targets:\n" + overview,

				        formatter_class=RichHelp,

				    )

				    register_targets(external_parser, _TARGETS, common_args=common_args)

									
										1

.ci/lumen_cli/pyproject.toml
									
												View File
												
				@ -6,6 +6,7 @@ dependencies = [

				    "GitPython==3.1.45",

				    "docker==7.1.0",

				    "pytest==7.3.2",

				    "uv==0.8.6"

				]

				[tool.setuptools]

									
										185

.ci/lumen_cli/tests/test_run_plan.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,185 @@

				# tests/test_run_test_plan.py

				import importlib

				from contextlib import nullcontext

				from types import SimpleNamespace

				from unittest.mock import MagicMock

				import pytest

				MOD = "cli.lib.core.vllm.lib"

				# We import inside tests so the MOD override above applies everywhere

				run_test_plan_import_path = f"{MOD}.run_test_plan"

				def _get_cmd(c):

				    # Support both kwargs and positional args

				    return c.kwargs.get("cmd", c.args[0] if c.args else None)

				def _get_check(c):

				    if "check" in c.kwargs:

				        return c.kwargs["check"]

				    # If positional, assume second arg is 'check' when present; default False

				    return c.args[1] if len(c.args) > 1 else False

				@pytest.fixture

				def patch_module(monkeypatch):

				    """

				    Patch helpers ('pip_install_packages', 'temp_environ', 'working_directory',

				    'run_command', 'logger') inside the target module and expose them.

				    """

				    module = importlib.import_module(MOD)

				    # Create fakes/mocks

				    pip_install_packages = MagicMock(name="pip_install_packages")

				    run_command = MagicMock(name="run_command", return_value=0)

				    # temp_environ / working_directory: record calls but act as context managers

				    temp_calls: list[dict] = []

				    workdir_calls: list[str] = []

				    def fake_working_directory(path: str):

				        workdir_calls.append(path)

				        return nullcontext()

				    def fake_temp_env(map: dict[str, str]):

				        temp_calls.append(map)

				        return nullcontext()

				    logger = SimpleNamespace(

				        info=MagicMock(name="logger.info"),

				        error=MagicMock(name="logger.error"),

				    )

				    # Apply patches (raise if attribute doesn't exist)

				    monkeypatch.setattr(

				        module, "pip_install_packages", pip_install_packages, raising=True

				    )

				    monkeypatch.setattr(module, "run_command", run_command, raising=True)

				    monkeypatch.setattr(

				        module, "working_directory", fake_working_directory, raising=True

				    )

				    monkeypatch.setattr(module, "temp_environ", fake_temp_env, raising=True)

				    monkeypatch.setattr(module, "logger", logger, raising=True)

				    return SimpleNamespace(

				        module=module,

				        run_test_plan=module.run_test_plan,  # expose to avoid getattr("constant") (Ruff B009)

				        pip_install_packages=pip_install_packages,

				        run_command=run_command,

				        temp_calls=temp_calls,

				        workdir_calls=workdir_calls,

				        logger=logger,

				    )

				def test_success_runs_all_steps_and_uses_env_and_workdir(monkeypatch, patch_module):

				    run_test_plan = patch_module.run_test_plan

				    tests_map = {

				        "basic": {

				            "title": "Basic suite",

				            "package_install": [],

				            "working_directory": "tests",

				            "env_vars": {"GLOBAL_FLAG": "1"},

				            "steps": [

				                "export A=x && pytest -q",

				                "export B=y && pytest -q tests/unit",

				            ],

				        }

				    }

				    # One exit code per step (export + two pytest)

				    patch_module.run_command.side_effect = [0, 0, 0]

				    run_test_plan("basic", "cpu", tests_map)

				    calls = patch_module.run_command.call_args_list

				    cmds = [_get_cmd(c) for c in calls]

				    checks = [_get_check(c) for c in calls]

				    assert cmds == [

				        "export A=x && pytest -q",

				        "export B=y && pytest -q tests/unit",

				    ]

				    assert all(chk is False for chk in checks)

				    assert patch_module.workdir_calls == ["tests"]

				    assert patch_module.temp_calls == [{"GLOBAL_FLAG": "1"}]

				def test_installs_packages_when_present(monkeypatch, patch_module):

				    run_test_plan = patch_module.module.run_test_plan

				    tests_map = {

				        "with_pkgs": {

				            "title": "Needs deps",

				            "package_install": ["timm==1.0.0", "flash-attn"],

				            "steps": ["pytest -q"],

				        }

				    }

				    patch_module.run_command.return_value = 0

				    run_test_plan("with_pkgs", "gpu", tests_map)

				    patch_module.pip_install_packages.assert_called_once_with(

				        packages=["timm==1.0.0", "flash-attn"],

				        prefer_uv=True,

				    )

				def test_raises_on_missing_plan(patch_module):

				    run_test_plan = patch_module.module.run_test_plan

				    with pytest.raises(RuntimeError) as ei:

				        run_test_plan("nope", "cpu", tests_map={})

				    assert "test nope not found" in str(ei.value)

				def test_aggregates_failures_and_raises(monkeypatch, patch_module):

				    run_test_plan = patch_module.module.run_test_plan

				    tests_map = {

				        "mix": {

				            "title": "Some pass some fail",

				            "steps": [

				                "pytest test_a.py",  # 0 → pass

				                "pytest test_b.py",  # 1 → fail

				                "pytest test_c.py",  # 2 → fail

				            ],

				        }

				    }

				    # Simulate pass, fail, fail

				    patch_module.run_command.side_effect = [0, 1, 2]

				    with pytest.raises(RuntimeError) as ei:

				        run_test_plan("mix", "cpu", tests_map)

				    msg = str(ei.value)

				    assert "2 pytest runs failed" in msg

				    # Ensure logger captured failed tests list

				    patch_module.logger.error.assert_called_once()

				    # And we attempted all three commands

				    assert patch_module.run_command.call_count == 3

				def test_custom_working_directory_used(patch_module):

				    run_test_plan = patch_module.module.run_test_plan

				    tests_map = {

				        "customwd": {

				            "title": "Custom wd",

				            "working_directory": "examples/ci",

				            "steps": ["pytest -q"],

				        }

				    }

				    patch_module.run_command.return_value = 0

				    run_test_plan("customwd", "cpu", tests_map)

				    assert patch_module.workdir_calls == ["examples/ci"]

									
										143

.ci/lumen_cli/tests/test_utils.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,143 @@

				import os

				import tempfile

				import unittest

				from pathlib import Path

				from cli.lib.common.utils import temp_environ, working_directory  # <-- replace import

				class EnvIsolatedTestCase(unittest.TestCase):

				    """Base class that snapshots os.environ and CWD for isolation."""

				    def setUp(self):

				        import os

				        import tempfile

				        self._env_backup = dict(os.environ)

				        # Snapshot/repair CWD if it's gone

				        try:

				            self._cwd_backup = os.getcwd()

				        except FileNotFoundError:

				            # If CWD no longer exists, switch to a safe place and record that

				            self._cwd_backup = tempfile.gettempdir()

				            os.chdir(self._cwd_backup)

				        # Create a temporary directory for the test to run in

				        self._temp_dir = tempfile.mkdtemp()

				        os.chdir(self._temp_dir)

				    def tearDown(self):

				        import os

				        import shutil

				        import tempfile

				        # Restore cwd first (before cleaning up temp dir)

				        try:

				            os.chdir(self._cwd_backup)

				        except OSError:

				            os.chdir(tempfile.gettempdir())

				        # Clean up temporary directory

				        try:

				            shutil.rmtree(self._temp_dir, ignore_errors=True)

				        except Exception:

				            pass  # Ignore cleanup errors

				        # Restore env

				        to_del = set(os.environ.keys()) - set(self._env_backup.keys())

				        for k in to_del:

				            os.environ.pop(k, None)

				        for k, v in self._env_backup.items():

				            os.environ[k] = v

				class TestTempEnviron(EnvIsolatedTestCase):

				    def test_sets_and_restores_new_var(self):

				        var = "TEST_TMP_ENV_NEW"

				        self.assertNotIn(var, os.environ)

				        with temp_environ({var: "123"}):

				            self.assertEqual(os.environ[var], "123")

				        self.assertNotIn(var, os.environ)  # removed after exit

				    def test_overwrites_and_restores_existing_var(self):

				        var = "TEST_TMP_ENV_OVERWRITE"

				        os.environ[var] = "orig"

				        with temp_environ({var: "override"}):

				            self.assertEqual(os.environ[var], "override")

				        self.assertEqual(os.environ[var], "orig")  # restored

				    def test_multiple_vars_and_missing_cleanup(self):

				        v1, v2 = "TEST_ENV_V1", "TEST_ENV_V2"

				        os.environ.pop(v1, None)

				        os.environ[v2] = "keep"

				        with temp_environ({v1: "a", v2: "b"}):

				            self.assertEqual(os.environ[v1], "a")

				            self.assertEqual(os.environ[v2], "b")

				        self.assertNotIn(v1, os.environ)  # newly-added -> removed

				        self.assertEqual(os.environ[v2], "keep")  # pre-existing -> restored

				    def test_restores_even_on_exception(self):

				        var = "TEST_TMP_ENV_EXCEPTION"

				        self.assertNotIn(var, os.environ)

				        with self.assertRaises(RuntimeError):

				            with temp_environ({var: "x"}):

				                self.assertEqual(os.environ[var], "x")

				                raise RuntimeError("boom")

				        self.assertNotIn(var, os.environ)  # removed after exception

				class TestWorkingDirectory(EnvIsolatedTestCase):

				    def test_changes_and_restores(self):

				        start = Path.cwd()

				        with tempfile.TemporaryDirectory() as td:

				            target = Path(td) / "wd"

				            target.mkdir()

				            with working_directory(str(target)):

				                self.assertEqual(Path.cwd().resolve(), target.resolve())

				        self.assertEqual(Path.cwd(), start)

				    def test_noop_when_empty_path(self):

				        start = Path.cwd()

				        with working_directory(""):

				            self.assertEqual(Path.cwd(), start)

				        self.assertEqual(Path.cwd(), start)

				    def test_restores_on_exception(self):

				        start = Path.cwd()

				        with tempfile.TemporaryDirectory() as td:

				            target = Path(td) / "wd_exc"

				            target.mkdir()

				            with self.assertRaises(ValueError):

				                with working_directory(str(target)):

				                    # Normalize both sides to handle /var -> /private/var

				                    self.assertEqual(Path.cwd().resolve(), target.resolve())

				                    raise ValueError("boom")

				        self.assertEqual(Path.cwd().resolve(), start.resolve())

				    def test_raises_for_missing_dir(self):

				        start = Path.cwd()

				        with tempfile.TemporaryDirectory() as td:

				            missing = Path(td) / "does_not_exist"

				            with self.assertRaises(FileNotFoundError):

				                # os.chdir should raise before yielding

				                with working_directory(str(missing)):

				                    pass

				        self.assertEqual(Path.cwd(), start)

				if __name__ == "__main__":

				    unittest.main(verbosity=2)

									
										53

.ci/lumen_cli/tests/test_vllm.py
									
												View File
												
				@ -4,12 +4,15 @@ import unittest

				from pathlib import Path

				from unittest.mock import MagicMock, patch

				import cli.lib.core.vllm as vllm

				import cli.lib.core.vllm.vllm_build as vllm_build

				_VLLM_BUILD_MODULE = "cli.lib.core.vllm.vllm_build"

				class TestVllmBuildParameters(unittest.TestCase):

				    @patch("cli.lib.core.vllm.local_image_exists", return_value=True)

				    @patch("cli.lib.core.vllm.is_path_exist", return_value=True)

				    @patch(f"{_VLLM_BUILD_MODULE}.local_image_exists", return_value=True)

				    @patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=True)

				    @patch(

				        "cli.lib.common.envs_helper.env_path_optional",

				        side_effect=lambda name, default=None, resolve=True: {

				@ -34,13 +37,13 @@ class TestVllmBuildParameters(unittest.TestCase):

				    def test_params_success_normalizes_and_validates(

				        self, mock_env_path, mock_is_path, mock_local_img

				    ):

				        params = vllm.VllmBuildParameters()

				        params = vllm_build.VllmBuildParameters()

				        self.assertEqual(params.torch_whls_path, Path("/abs/dist"))

				        self.assertEqual(params.dockerfile_path, Path("/abs/vllm/Dockerfile"))

				        self.assertEqual(params.output_dir, Path("/abs/shared"))

				        self.assertEqual(params.base_image, "my/image:tag")

				    @patch("cli.lib.core.vllm.is_path_exist", return_value=False)

				    @patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=False)

				    @patch.dict(

				        os.environ, {"USE_TORCH_WHEEL": "1", "TORCH_WHEELS_PATH": "dist"}, clear=True

				    )

				@ -48,14 +51,14 @@ class TestVllmBuildParameters(unittest.TestCase):

				        with tempfile.TemporaryDirectory() as td:

				            os.chdir(td)

				            with self.assertRaises(ValueError) as cm:

				                vllm.VllmBuildParameters(

				                vllm_build.VllmBuildParameters(

				                    use_local_base_image=False,

				                    use_local_dockerfile=False,

				                )

				        err = cm.exception

				        self.assertIn("TORCH_WHEELS_PATH", str(err))

				    @patch("cli.lib.core.vllm.local_image_exists", return_value=False)

				    @patch(f"{_VLLM_BUILD_MODULE}.local_image_exists", return_value=False)

				    @patch.dict(

				        os.environ, {"USE_LOCAL_BASE_IMAGE": "1", "BASE_IMAGE": "img:tag"}, clear=True

				    )

				@ -63,14 +66,14 @@ class TestVllmBuildParameters(unittest.TestCase):

				        with tempfile.TemporaryDirectory() as td:

				            os.chdir(td)

				            with self.assertRaises(ValueError) as cm:

				                vllm.VllmBuildParameters(

				                vllm_build.VllmBuildParameters(

				                    use_torch_whl=False,

				                    use_local_dockerfile=False,

				                )

				        err = cm.exception

				        self.assertIn("BASE_IMAGE", str(err))

				    @patch("cli.lib.core.vllm.is_path_exist", return_value=False)

				    @patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=False)

				    @patch.dict(

				        os.environ,

				        {"USE_LOCAL_DOCKERFILE": "1", "DOCKERFILE_PATH": "Dockerfile"},

				@ -80,14 +83,14 @@ class TestVllmBuildParameters(unittest.TestCase):

				        with tempfile.TemporaryDirectory() as td:

				            os.chdir(td)

				            with self.assertRaises(ValueError) as cm:

				                vllm.VllmBuildParameters(

				                vllm_build.VllmBuildParameters(

				                    use_torch_whl=False,

				                    use_local_base_image=False,

				                )

				        err = cm.exception

				        self.assertIn("DOCKERFILE_PATH", str(err))

				    @patch("cli.lib.core.vllm.is_path_exist", return_value=False)

				    @patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=False)

				    @patch.dict(

				        os.environ,

				        {"OUTPUT_DIR": ""},

				@ -95,14 +98,13 @@ class TestVllmBuildParameters(unittest.TestCase):

				    )

				    def test_params_missing_output_dir(self, _is_path):

				        with self.assertRaises(FileNotFoundError):

				            vllm.VllmBuildParameters()

				            vllm_build.VllmBuildParameters()

				class TestBuildCmdAndRun(unittest.TestCase):

				    @patch("cli.lib.core.vllm.local_image_exists", return_value=True)

				    @patch(f"{_VLLM_BUILD_MODULE}.local_image_exists", return_value=True)

				    def test_generate_docker_build_cmd_includes_bits(self, _exists):

				        runner = vllm.VllmBuildRunner()

				        # Craft inputs that simulate a prepared build

				        runner = vllm_build.VllmBuildRunner()

				        inputs = MagicMock()

				        inputs.output_dir = Path("/abs/out")

				        inputs.use_local_base_image = True

				@ -118,7 +120,7 @@ class TestBuildCmdAndRun(unittest.TestCase):

				        inputs.tag_name = "vllm-wheels"

				        cmd = runner._generate_docker_build_cmd(inputs)

				        squashed = " ".join(cmd.split())  # normalize whitespace for matching

				        squashed = " ".join(cmd.split())

				        self.assertIn("--output type=local,dest=/abs/out", squashed)

				        self.assertIn("-f docker/Dockerfile.nightly_torch", squashed)

				@ -136,18 +138,17 @@ class TestBuildCmdAndRun(unittest.TestCase):

				        self.assertIn("--target export-wheels", squashed)

				        self.assertIn("-t vllm-wheels", squashed)

				    @patch("cli.lib.core.vllm.run_command")

				    @patch("cli.lib.core.vllm.ensure_dir_exists")

				    @patch("cli.lib.core.vllm.clone_vllm")

				    @patch(f"{_VLLM_BUILD_MODULE}.run_command")

				    @patch(f"{_VLLM_BUILD_MODULE}.ensure_dir_exists")

				    @patch(f"{_VLLM_BUILD_MODULE}.clone_vllm")

				    @patch.object(

				        vllm.VllmBuildRunner,

				        vllm_build.VllmBuildRunner,

				        "_generate_docker_build_cmd",

				        return_value="docker buildx ...",

				    )

				    @patch.dict(

				        os.environ,

				        {

				            # Make __post_init__ validations pass cheaply

				            "USE_TORCH_WHEEL": "0",

				            "USE_LOCAL_BASE_IMAGE": "0",

				            "USE_LOCAL_DOCKERFILE": "0",

				@ -158,24 +159,18 @@ class TestBuildCmdAndRun(unittest.TestCase):

				    def test_run_calls_clone_prepare_and_build(

				        self, mock_gen, mock_clone, mock_ensure, mock_run

				    ):

				        # Stub parameters instance so we avoid FS/Docker accesses in run()

				        params = MagicMock()

				        params.output_dir = Path("shared")

				        params.use_local_dockerfile = False

				        params.use_torch_whl = False

				        with patch("cli.lib.core.vllm.VllmBuildParameters", return_value=params):

				            runner = vllm.VllmBuildRunner()

				        with patch(f"{_VLLM_BUILD_MODULE}.VllmBuildParameters", return_value=params):

				            runner = vllm_build.VllmBuildRunner()

				            runner.run()

				        mock_clone.assert_called_once()

				        mock_ensure.assert_called_once_with(Path("shared"))

				        mock_gen.assert_called_once_with(params)

				        mock_run.assert_called_once()

				        # ensure we run in vllm workdir

				        _, kwargs = mock_run.call_args

				        assert kwargs.get("cwd") == "vllm"

				if __name__ == "__main__":

				    unittest.main()

									
										102

.ci/manywheel/build_cuda.sh
									
												View File
												
				@ -66,6 +66,9 @@ case ${CUDA_VERSION} in

				            TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"

				        fi

				        ;;

				    13.0)

				        TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX"

				        ;;

				    12.6)

				        TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6;9.0"

				        ;;

				@ -110,13 +113,18 @@ DEPS_SONAME=(

				)

				# CUDA_VERSION 12.6, 12.8, 12.9

				if [[ $CUDA_VERSION == 12* ]]; then

				# CUDA_VERSION 12.*, 13.*

				if [[ $CUDA_VERSION == 12* || $CUDA_VERSION == 13* ]]; then

				    export USE_STATIC_CUDNN=0

				    # Try parallelizing nvcc as well

				    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				    TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				    # Compress the fatbin with -compress-mode=size for CUDA 13

				    if [[ $CUDA_VERSION == 13* ]]; then

				        export TORCH_NVCC_FLAGS="$TORCH_NVCC_FLAGS -compress-mode=size"

				    fi

				    if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				        echo "Bundling with cudnn and cublas."

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcudnn_adv.so.9"

				            "/usr/local/cuda/lib64/libcudnn_cnn.so.9"

				@ -126,16 +134,11 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            "/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_heuristic.so.9"

				            "/usr/local/cuda/lib64/libcudnn.so.9"

				            "/usr/local/cuda/lib64/libcublas.so.12"

				            "/usr/local/cuda/lib64/libcublasLt.so.12"

				            "/usr/local/cuda/lib64/libcusparseLt.so.0"

				            "/usr/local/cuda/lib64/libcudart.so.12"

				            "/usr/local/cuda/lib64/libnvrtc.so.12"

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so"

				            "/usr/local/cuda/lib64/libcufile.so.0"

				            "/usr/local/cuda/lib64/libcufile_rdma.so.1"

				            "/usr/local/cuda/lib64/libnvshmem_host.so.3"

				            "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12"

				            "/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so"

				        )

				        DEPS_SONAME+=(

				@ -147,42 +150,83 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            "libcudnn_engines_precompiled.so.9"

				            "libcudnn_heuristic.so.9"

				            "libcudnn.so.9"

				            "libcublas.so.12"

				            "libcublasLt.so.12"

				            "libcusparseLt.so.0"

				            "libcudart.so.12"

				            "libnvrtc.so.12"

				            "libnvrtc-builtins.so"

				            "libnvshmem_host.so.3"

				            "libcufile.so.0"

				            "libcufile_rdma.so.1"

				            "libcupti.so.12"

				            "libnvperf_host.so"

				        )

				        # Add libnvToolsExt only if CUDA version is not 12.9

				        if [[ $CUDA_VERSION != 12.9* ]]; then

				            DEPS_LIST+=("/usr/local/cuda/lib64/libnvToolsExt.so.1")

				            DEPS_SONAME+=("libnvToolsExt.so.1")

				        if [[ $CUDA_VERSION == 13* ]]; then

				            DEPS_LIST+=(

				                "/usr/local/cuda/lib64/libcublas.so.13"

				                "/usr/local/cuda/lib64/libcublasLt.so.13"

				                "/usr/local/cuda/lib64/libcudart.so.13"

				                "/usr/local/cuda/lib64/libnvrtc.so.13"

				                "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13"

				                "/usr/local/cuda/lib64/libibverbs.so.1"

				                "/usr/local/cuda/lib64/librdmacm.so.1"

				                "/usr/local/cuda/lib64/libmlx5.so.1"

				                "/usr/local/cuda/lib64/libnl-3.so.200"

				                "/usr/local/cuda/lib64/libnl-route-3.so.200")

				            DEPS_SONAME+=(

				                "libcublas.so.13"

				                "libcublasLt.so.13"

				                "libcudart.so.13"

				                "libnvrtc.so.13"

				                "libcupti.so.13"

				                "libibverbs.so.1"

				                "librdmacm.so.1"

				                "libmlx5.so.1"

				                "libnl-3.so.200"

				                "libnl-route-3.so.200")

				            export USE_CUPTI_SO=1

				            export ATEN_STATIC_CUDA=0

				            export USE_CUDA_STATIC_LINK=0

				            export USE_CUFILE=0

				        else

				            DEPS_LIST+=(

				                "/usr/local/cuda/lib64/libnvToolsExt.so.1"

				                "/usr/local/cuda/lib64/libcublas.so.12"

				                "/usr/local/cuda/lib64/libcublasLt.so.12"

				                "/usr/local/cuda/lib64/libcudart.so.12"

				                "/usr/local/cuda/lib64/libnvrtc.so.12"

				                "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12")

				            DEPS_SONAME+=(

				                "libnvToolsExt.so.1"

				                "libcublas.so.12"

				                "libcublasLt.so.12"

				                "libcudart.so.12"

				                "libnvrtc.so.12"

				                "libcupti.so.12")

				        fi

				    else

				        echo "Using nvidia libs from pypi."

				        CUDA_RPATHS=(

				            '$ORIGIN/../../nvidia/cublas/lib'

				            '$ORIGIN/../../nvidia/cuda_cupti/lib'

				            '$ORIGIN/../../nvidia/cuda_nvrtc/lib'

				            '$ORIGIN/../../nvidia/cuda_runtime/lib'

				            '$ORIGIN/../../nvidia/cudnn/lib'

				            '$ORIGIN/../../nvidia/cufft/lib'

				            '$ORIGIN/../../nvidia/curand/lib'

				            '$ORIGIN/../../nvidia/cusolver/lib'

				            '$ORIGIN/../../nvidia/cusparse/lib'

				            '$ORIGIN/../../nvidia/cusparselt/lib'

				            '$ORIGIN/../../cusparselt/lib'

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/nvshmem/lib'

				            '$ORIGIN/../../nvidia/nvtx/lib'

				            '$ORIGIN/../../nvidia/cufile/lib'

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/cusparselt/lib'

				        )

				        if [[ $CUDA_VERSION == 13* ]]; then

				            CUDA_RPATHS+=('$ORIGIN/../../nvidia/cu13/lib')

				        else

				            CUDA_RPATHS+=(

				                '$ORIGIN/../../nvidia/cublas/lib'

				                '$ORIGIN/../../nvidia/cuda_cupti/lib'

				                '$ORIGIN/../../nvidia/cuda_nvrtc/lib'

				                '$ORIGIN/../../nvidia/cuda_runtime/lib'

				                '$ORIGIN/../../nvidia/cufft/lib'

				                '$ORIGIN/../../nvidia/curand/lib'

				                '$ORIGIN/../../nvidia/cusolver/lib'

				                '$ORIGIN/../../nvidia/cusparse/lib'

				                '$ORIGIN/../../cusparselt/lib'

				                '$ORIGIN/../../nvidia/nvtx/lib'

				                '$ORIGIN/../../nvidia/cufile/lib'

				            )

				        fi

				        CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")

				        export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'

				        export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

									
										1

.ci/manywheel/build_xpu.sh
									
												View File
												
				@ -25,6 +25,7 @@ source /opt/intel/oneapi/mpi/latest/env/vars.sh

				export USE_STATIC_MKL=1

				export USE_ONEMKL=1

				export USE_XCCL=1

				export USE_MPI=0

				WHEELHOUSE_DIR="wheelhousexpu"

				LIBTORCH_HOUSE_DIR="libtorch_housexpu"

									
										11

.ci/pytorch/build.sh
									
												View File
												
				@ -173,6 +173,7 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  source /opt/intel/oneapi/mpi/latest/env/vars.sh

				  # Enable XCCL build

				  export USE_XCCL=1

				  export USE_MPI=0

				  # XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA

				  export USE_KINETO=0

				  export TORCH_XPU_ARCH_LIST=pvc

				@ -194,8 +195,16 @@ fi

				# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of

				# memory to build and will OOM

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && echo "${TORCH_CUDA_ARCH_LIST}" | tr ' ' '\n' | sed 's/$/>= 8.0/' | bc | grep -q 1; then

				  export BUILD_CUSTOM_STEP="ninja -C build flash_attention -j 2"

				  J=2  # default to 2 jobs

				  case "$RUNNER" in

				    linux.12xlarge.memory|linux.24xlarge.memory)

				      J=24

				      ;;

				  esac

				  echo "Building FlashAttention with job limit $J"

				  export BUILD_CUSTOM_STEP="ninja -C build flash_attention -j ${J}"

				fi

				if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then

									
										23

.ci/pytorch/check_binary.sh
									
												View File
												
				@ -67,7 +67,7 @@ fi

				#       wheels with cxx11-abi

				echo "Checking that the gcc ABI is what we expect"

				if [[ "$(uname)" != 'Darwin' ]]; then

				if [[ "$(uname)" != 'Darwin' &&  "$(uname -m)" != "s390x" ]]; then

				  # We also check that there are cxx11 symbols in libtorch

				  #

				  echo "Checking that symbols in libtorch.so have the right gcc abi"

				@ -300,24 +300,3 @@ except RuntimeError as e:

				    exit 1

				  fi

				fi

				###############################################################################

				# Check for C++ ABI compatibility to GCC-11 - GCC 13

				###############################################################################

				if [[ "$(uname)" == 'Linux' &&  "$PACKAGE_TYPE" == 'manywheel' ]]; then

				  pushd /tmp

				  # Per https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html

				  # gcc-11 is ABI16, gcc-13 is ABI18, gcc-14 is ABI19

				  # gcc 11 - CUDA 11.8, xpu, rocm

				  # gcc 13 - CUDA 12.6, 12.8 and cpu

				  # Please see issue for reference: https://github.com/pytorch/pytorch/issues/152426

				  if [[ "$(uname -m)" == "s390x" ]]; then

				    cxx_abi="19"

				  elif [[ "$DESIRED_CUDA" != 'xpu' && "$DESIRED_CUDA" != 'rocm'* ]]; then

				    cxx_abi="18"

				  else

				    cxx_abi="16"

				  fi

				  python -c "import torch; exit(0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi10${cxx_abi}' else 1)"

				  popd

				fi

									
										15

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -149,6 +149,19 @@ function get_pinned_commit() {

				  cat .github/ci_commit_pins/"${1}".txt

				}

				function detect_cuda_arch() {

				  if [[ "${BUILD_ENVIRONMENT}" == *cuda* ]]; then

				    if command -v nvidia-smi; then

				      TORCH_CUDA_ARCH_LIST=$(nvidia-smi --query-gpu=compute_cap --format=csv | tail -n 1)

				    elif [[ "${TEST_CONFIG}" == *nogpu* ]]; then

				      # There won't be nvidia-smi in nogpu tests, so just set TORCH_CUDA_ARCH_LIST to the default

				      # minimum supported value here

				      TORCH_CUDA_ARCH_LIST=8.0

				    fi

				    export TORCH_CUDA_ARCH_LIST

				  fi

				}

				function install_torchaudio() {

				  local commit

				  commit=$(get_pinned_commit audio)

				@ -271,7 +284,7 @@ function install_torchrec_and_fbgemm() {

				function clone_pytorch_xla() {

				  if [[ ! -d ./xla ]]; then

				    git clone --recursive --quiet https://github.com/pytorch/xla.git

				    git clone --recursive -b r2.9 https://github.com/pytorch/xla.git

				    pushd xla

				    # pin the xla hash so that we don't get broken by changes to xla

				    git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"

									
										2

.ci/pytorch/cpp_doc_push_script.sh
									
												View File
												
				@ -58,7 +58,7 @@ time python tools/setup_helpers/generate_code.py \

				# Build the docs

				pushd docs/cpp

				time make VERBOSE=1 html -j

				time make VERBOSE=1 html

				popd

				popd

									
										47

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -195,7 +195,7 @@ torchbench_setup_macos() {

				  git checkout "$(cat ../.github/ci_commit_pins/vision.txt)"

				  git submodule update --init --recursive

				  python setup.py clean

				  python setup.py develop

				  python -m pip install -e . -v --no-build-isolation

				  popd

				  pushd torchaudio

				@ -204,7 +204,7 @@ torchbench_setup_macos() {

				  git submodule update --init --recursive

				  python setup.py clean

				  #TODO: Remove me, when figure out how to make TorchAudio find brew installed openmp

				  USE_OPENMP=0 python setup.py develop

				  USE_OPENMP=0 python -m pip install -e . -v --no-build-isolation

				  popd

				  checkout_install_torchbench

				@ -302,6 +302,47 @@ test_torchbench_smoketest() {

				    fi

				  done

				  echo "Pytorch benchmark on mps device completed"

				}

				test_aoti_torchbench_smoketest() {

				  print_cmake_info

				  echo "Launching AOTInductor torchbench setup"

				  pip_benchmark_deps

				  # shellcheck disable=SC2119,SC2120

				  torchbench_setup_macos

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  local device=mps

				  local dtypes=(undefined float16 bfloat16 notset)

				  local dtype=${dtypes[$1]}

				  local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor timm_resnet timm_vovnet vgg16)

				  echo "Launching torchbench inference performance run for AOT Inductor and dtype ${dtype}"

				  local dtype_arg="--${dtype}"

				  if [ "$dtype" == notset ]; then

				      dtype_arg="--float32"

				  fi

				  touch "$TEST_REPORTS_DIR/aot_inductor_torchbench_${dtype}_inference_${device}_performance.csv"

				  for model in "${models[@]}"; do

				    PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				      --performance --only "$model" --export-aot-inductor --inference --devices "$device" "$dtype_arg" \

				      --output "$TEST_REPORTS_DIR/aot_inductor_torchbench_${dtype}_inference_${device}_performance.csv" || true

				    PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				      --accuracy --only "$model" --export-aot-inductor --inference --devices "$device" "$dtype_arg" \

				      --output "$TEST_REPORTS_DIR/aot_inductor_torchbench_${dtype}_inference_${device}_accuracy.csv" || true

				  done

				  echo "Launching HuggingFace inference performance run for AOT Inductor and dtype ${dtype}"

				  PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \

				    --performance --export-aot-inductor --inference --devices "$device" "$dtype_arg" \

				    --output "$TEST_REPORTS_DIR/aot_inductor_huggingface_${dtype}_inference_${device}_performance.csv" || true

				  PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \

				    --accuracy --export-aot-inductor --inference --devices "$device" "$dtype_arg" \

				    --output "$TEST_REPORTS_DIR/aot_inductor_huggingface_${dtype}_inference_${device}_accuracy.csv" || true

				  echo "Pytorch benchmark on mps device completed"

				}

				@ -350,6 +391,8 @@ elif [[ $TEST_CONFIG == *"perf_timm"* ]]; then

				  test_timm_perf

				elif [[ $TEST_CONFIG == *"perf_smoketest"* ]]; then

				  test_torchbench_smoketest "${SHARD_NUMBER}"

				elif [[ $TEST_CONFIG == *"aot_inductor_perf_smoketest"* ]]; then

				  test_aoti_torchbench_smoketest "${SHARD_NUMBER}"

				elif [[ $TEST_CONFIG == *"mps"* ]]; then

				  test_python_mps

				elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then

									
										1

.ci/pytorch/multigpu-test.sh
									
												View File
												
				@ -45,6 +45,7 @@ if [[ "${SHARD_NUMBER:-2}" == "2" ]]; then

				    # DTensor tests

				    time python test/run_test.py --verbose -i distributed/tensor/test_random_ops

				    time python test/run_test.py --verbose -i distributed/tensor/test_dtensor_compile

				    time python test/run_test.py --verbose -i distributed/tensor/test_utils.py

				    # DeviceMesh test

				    time python test/run_test.py --verbose -i distributed/test_device_mesh

									
										25

.ci/pytorch/numba-cuda-13.patch
									
										Normal file
									
												View File
												
				@ -0,0 +1,25 @@

				From 6e08c9d08e9de59c7af28b720289debbbd384764 Mon Sep 17 00:00:00 2001

				From: Michael Wang <13521008+isVoid@users.noreply.github.com>

				Date: Tue, 1 Apr 2025 17:28:05 -0700

				Subject: [PATCH] Avoid bumping certain driver API to avoid future breakage

				 (#185)

				Co-authored-by: isVoid <isVoid@users.noreply.github.com>

				---

				 numba_cuda/numba/cuda/cudadrv/driver.py | 3 +++

				 1 file changed, 3 insertions(+)

				diff --git a/numba_cuda/numba/cuda/cudadrv/driver.py b/numba_cuda/numba/cuda/cudadrv/driver.py

				index 1641bf77..233e9ed7 100644

				--- a/numba_cuda/numba/cuda/cudadrv/driver.py

				+++ b/numba_cuda/numba/cuda/cudadrv/driver.py

				@@ -365,6 +365,9 @@ def _find_api(self, fname):

				         else:

				             variants = ('_v2', '')

				+        if fname in ("cuCtxGetDevice", "cuCtxSynchronize"):

				+            return getattr(self.lib, fname)

				+

				         for variant in variants:

				             try:

				                 return getattr(self.lib, f'{fname}{variant}')

									
										23

.ci/pytorch/smoke_test/check_binary_symbols.py
									
												View File
												
				@ -32,6 +32,9 @@ LIBTORCH_NAMESPACE_LIST = (

				    "torch::",

				)

				# Patterns for detecting statically linked libstdc++ symbols

				STATICALLY_LINKED_CXX11_ABI = [re.compile(r".*recursive_directory_iterator.*")]

				def _apply_libtorch_symbols(symbols):

				    return [

				@ -53,12 +56,17 @@ def get_symbols(lib: str) -> list[tuple[str, str, str]]:

				    return [x.split(" ", 2) for x in lines.decode("latin1").split("\n")[:-1]]

				def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:

				def grep_symbols(

				    lib: str, patterns: list[Any], symbol_type: str | None = None

				) -> list[str]:

				    def _grep_symbols(

				        symbols: list[tuple[str, str, str]], patterns: list[Any]

				    ) -> list[str]:

				        rc = []

				        for _s_addr, _s_type, s_name in symbols:

				            # Filter by symbol type if specified

				            if symbol_type and _s_type != symbol_type:

				                continue

				            for pattern in patterns:

				                if pattern.match(s_name):

				                    rc.append(s_name)

				@ -80,6 +88,18 @@ def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:

				        return functools.reduce(list.__add__, (x.result() for x in tasks), [])

				def check_lib_statically_linked_libstdc_cxx_abi_symbols(lib: str) -> None:

				    cxx11_statically_linked_symbols = grep_symbols(

				        lib, STATICALLY_LINKED_CXX11_ABI, symbol_type="T"

				    )

				    num_statically_linked_symbols = len(cxx11_statically_linked_symbols)

				    print(f"num_statically_linked_symbols (T): {num_statically_linked_symbols}")

				    if num_statically_linked_symbols > 0:

				        raise RuntimeError(

				            f"Found statically linked libstdc++ symbols (recursive_directory_iterator): {cxx11_statically_linked_symbols[:100]}"

				        )

				def check_lib_symbols_for_abi_correctness(lib: str) -> None:

				    print(f"lib: {lib}")

				    cxx11_symbols = grep_symbols(lib, LIBTORCH_CXX11_PATTERNS)

				@ -107,6 +127,7 @@ def main() -> None:

				    libtorch_cpu_path = str(install_root / "lib" / "libtorch_cpu.so")

				    check_lib_symbols_for_abi_correctness(libtorch_cpu_path)

				    check_lib_statically_linked_libstdc_cxx_abi_symbols(libtorch_cpu_path)

				if __name__ == "__main__":

									
										49

.ci/pytorch/test.sh
									
												View File
												
				@ -32,6 +32,16 @@ if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /v

				  git config --global --add safe.directory /var/lib/jenkins/workspace

				fi

				# Patch numba to avoid CUDA-13 crash, see https://github.com/pytorch/pytorch/issues/162878

				NUMBA_CUDA_DIR=$(python -c "import os;import numba.cuda; print(os.path.dirname(numba.cuda.__file__))" 2>/dev/null || true)

				if [ -n "$NUMBA_CUDA_DIR" ]; then

				  NUMBA_PATCH="$(dirname "$(realpath "${BASH_SOURCE[0]}")")/numba-cuda-13.patch"

				  pushd "$NUMBA_CUDA_DIR"

				  patch -p4 <"$NUMBA_PATCH"

				  popd

				fi

				echo "Environment variables:"

				env

				@ -91,6 +101,7 @@ if [[ "$BUILD_ENVIRONMENT" == *clang9* || "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  export VALGRIND=OFF

				fi

				detect_cuda_arch

				if [[ "$BUILD_ENVIRONMENT" == *s390x* ]]; then

				  # There are additional warnings on s390x, maybe due to newer gcc.

				@ -495,6 +506,14 @@ test_inductor_cpp_wrapper_shard() {

				    -k 'take' \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				  if [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then

				    python test/run_test.py \

				      --include inductor/test_mkldnn_pattern_matcher \

				      -k 'xpu' \

				      --shard "$1" "$NUM_TEST_SHARDS" \

				      --verbose

				  fi

				}

				# "Global" flags for inductor benchmarking controlled by TEST_CONFIG

				@ -1605,6 +1624,25 @@ test_operator_benchmark() {

				      --expected "expected_ci_operator_benchmark_eager_float32_cpu.csv"

				}

				test_operator_microbenchmark() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  TEST_DIR=$(pwd)

				  cd benchmarks/operator_benchmark/pt_extension

				  python -m pip install .

				  cd "${TEST_DIR}"/benchmarks/operator_benchmark

				  for OP_BENCHMARK_TESTS in matmul mm addmm bmm; do

				    $TASKSET python -m pt.${OP_BENCHMARK_TESTS}_test --tag-filter long \

				      --output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_microbenchmark_${OP_BENCHMARK_TESTS}_compile.json" \

				      --benchmark-name "PyTorch operator microbenchmark" --use-compile

				    $TASKSET python -m pt.${OP_BENCHMARK_TESTS}_test --tag-filter long \

				      --output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_microbenchmark_${OP_BENCHMARK_TESTS}.json" \

				      --benchmark-name "PyTorch operator microbenchmark"

				  done

				}

				if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

				  (cd test && python -c "import torch; print(torch.__config__.show())")

				@ -1629,6 +1667,10 @@ elif [[ "${TEST_CONFIG}" == *xla* ]]; then

				  install_torchvision

				  build_xla

				  test_xla

				elif [[ "$TEST_CONFIG" == *vllm* ]]; then

				    echo "vLLM CI uses TORCH_CUDA_ARCH_LIST: $TORCH_CUDA_ARCH_LIST"

				    (cd .ci/lumen_cli && python -m pip install -e .)

				    python -m cli.run test external vllm --test-plan "$TEST_CONFIG" --shard-id "$SHARD_NUMBER" --num-shards "$NUM_TEST_SHARDS"

				elif [[ "${TEST_CONFIG}" == *executorch* ]]; then

				  test_executorch

				elif [[ "$TEST_CONFIG" == 'jit_legacy' ]]; then

				@ -1655,6 +1697,8 @@ elif [[ "${TEST_CONFIG}" == *operator_benchmark* ]]; then

				    test_operator_benchmark cpu ${TEST_MODE}

				  fi

				elif [[ "${TEST_CONFIG}" == *operator_microbenchmark* ]]; then

				  test_operator_microbenchmark

				elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then

				@ -1708,11 +1752,6 @@ elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then

				elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.9-gcc11-build ]]; then

				      test_inductor_distributed

				    fi

				  fi

				elif [[ "${TEST_CONFIG}" == *einops* ]]; then

				  test_einops

				elif [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then

									
										2

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -137,7 +137,7 @@ sccache --show-stats

				python -c "import os, glob; os.system('python -mpip install --no-index --no-deps ' + glob.glob('dist/*.whl')[0])"

				(

				  if "%BUILD_ENVIRONMENT%"=="" (

				    echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.

				    echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_ROOT_DIR%\Scripts\activate.bat %CONDA_ROOT_DIR%\envs\py_tmp` in Command Prompt before running Git Bash.

				  ) else (

				    copy /Y "dist\*.whl" "%PYTORCH_FINAL_PACKAGE_DIR%"

									
										12

.ci/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat
									
												View File
												
				@ -3,12 +3,12 @@ if "%BUILD_ENVIRONMENT%"=="" (

				) else (

				  set CONDA_PARENT_DIR=C:\Jenkins

				)

				set CONDA_ROOT_DIR=%CONDA_PARENT_DIR%\Miniconda3

				:: Be conservative here when rolling out the new AMI with conda. This will try

				:: to install conda as before if it couldn't find the conda installation. This

				:: can be removed eventually after we gain enough confidence in the AMI

				if not exist %CONDA_PARENT_DIR%\Miniconda3 (

				if not exist %CONDA_ROOT_DIR% (

				  set INSTALL_FRESH_CONDA=1

				)

				@ -17,10 +17,14 @@ if "%INSTALL_FRESH_CONDA%"=="1" (

				  if errorlevel 1 exit /b

				  if not errorlevel 0 exit /b

				  %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /AddToPath=0 /D=%CONDA_PARENT_DIR%\Miniconda3

				  %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /AddToPath=0 /D=%CONDA_ROOT_DIR%

				  if errorlevel 1 exit /b

				  if not errorlevel 0 exit /b

				)

				:: Activate conda so that we can use its commands, i.e. conda, python, pip

				call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3

				call %CONDA_ROOT_DIR%\Scripts\activate.bat %CONDA_ROOT_DIR%

				:: Activate conda so that we can use its commands, i.e. conda, python, pip

				call conda activate py_tmp

				call pip install -r .ci/docker/requirements-ci.txt

									
										2

.ci/pytorch/win-test-helpers/setup_pytorch_env.bat
									
												View File
												
				@ -14,7 +14,7 @@ if not errorlevel 0 exit /b

				:: build\torch. Rather than changing all these references, making a copy of torch folder

				:: from conda to the current workspace is easier. The workspace will be cleaned up after

				:: the job anyway

				xcopy /s %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\

				xcopy /s %CONDA_ROOT_DIR%\envs\py_tmp\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\

				pushd .

				if "%VC_VERSION%" == "" (

									
										14

.ci/pytorch/win-test.sh
									
												View File
												
				@ -38,13 +38,20 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				fi

				# TODO: Move both of them to Windows AMI

				python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1

				python -m pip install tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1

				# Copied from https://github.com/pytorch/test-infra/blob/be01a40157c36cd5a48391fdf44a7bc3ebd4c7e3/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1#L16 with some adjustments

				# pytest-rerunfailures==10.3 as 10.2 fails with INTERNALERROR> pluggy._manager.PluginValidationError: unknown hook 'pytest_configure_node'

				# scipy from 1.6.3 to 1.10

				# expecttest from 0.1.3 to 0.3.0

				# xdoctest from 1.0.2 to 1.3.0

				python -m pip install "future==0.18.2" "hypothesis==5.35.1" "expecttest==0.3.0" "librosa>=0.6.2" "scipy==1.10.1" "psutil==5.9.1" "pynvml==11.4.1" "pillow==9.2.0" "unittest-xml-reporting<=3.2.0,>=2.0.0" "pytest==7.1.3" "pytest-xdist==2.5.0" "pytest-flakefinder==1.1.0" "pytest-rerunfailures==10.3" "pytest-shard==0.1.2" "sympy==1.11.1" "xdoctest==1.3.0" "pygments==2.12.0" "opt-einsum>=3.3" "networkx==2.8.8" "mpmath==1.2.1" "pytest-cpp==2.3.0" "boto3==1.35.42"

				# Install Z3 optional dependency for Windows builds.

				python -m pip install z3-solver==4.15.1.0

				# Install tlparse for test\dynamo\test_structured_trace.py UTs.

				python -m pip install tlparse==0.3.30

				python -m pip install tlparse==0.4.0

				# Install parameterized

				python -m pip install parameterized==0.8.1

				@ -52,9 +59,6 @@ python -m pip install parameterized==0.8.1

				# Install pulp for testing ilps under torch\distributed\_tools

				python -m pip install pulp==2.9.0

				# Install expecttest to merge https://github.com/pytorch/pytorch/pull/155308

				python -m pip install expecttest==0.3.0

				run_tests() {

				    # Run nvidia-smi if available

				    for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

									
										4

.ci/pytorch/windows/cuda128.bat
									
												View File
												
				@ -37,10 +37,10 @@ IF "%CUDA_PATH_V128%"=="" (

				)

				IF "%BUILD_VISION%" == "" (

				    set TORCH_CUDA_ARCH_LIST=6.1;7.0;7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_NVCC_FLAGS=-Xfatbin -compress-all

				) ELSE (

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				)

				set "CUDA_PATH=%CUDA_PATH_V128%"

									
										59

.ci/pytorch/windows/cuda130.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,59 @@

				@echo off

				set MODULE_NAME=pytorch

				IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (

				    call internal\clone.bat

				    cd %~dp0

				) ELSE (

				    call internal\clean.bat

				)

				IF ERRORLEVEL 1 goto :eof

				call internal\check_deps.bat

				IF ERRORLEVEL 1 goto :eof

				REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%NVTOOLSEXT_PATH%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib"  (

				        set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				    ) ELSE (

				        echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing

				        exit /b 1

				    )

				)

				IF "%CUDA_PATH_V130%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0\bin\nvcc.exe" (

				        set "CUDA_PATH_V130=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0"

				    ) ELSE (

				        echo CUDA 13.0 not found, failing

				        exit /b 1

				    )

				)

				IF "%BUILD_VISION%" == "" (

				    set TORCH_CUDA_ARCH_LIST=7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_NVCC_FLAGS=-Xfatbin -compress-all

				) ELSE (

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				)

				set "CUDA_PATH=%CUDA_PATH_V130%"

				set "PATH=%CUDA_PATH_V130%\bin;%PATH%"

				:optcheck

				call internal\check_opts.bat

				IF ERRORLEVEL 1 goto :eof

				if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..

				call  %~dp0\internal\copy.bat

				IF ERRORLEVEL 1 goto :eof

				call  %~dp0\internal\setup.bat

				IF ERRORLEVEL 1 goto :eof

									
										27

.ci/pytorch/windows/internal/copy.bat
									
												View File
												
				@ -1,12 +1,20 @@

				copy "%CUDA_PATH%\bin\cusparse*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\cublas*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\cudart*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\curand*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\cufft*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\cusolver*64_*.dll*" pytorch\torch\lib

				if %CUDA_VERSION% geq 130 (

				    set "dll_path=bin\x64"

				) else (

				    set "dll_path=bin"

				)

				copy "%CUDA_PATH%\%dll_path%\cusparse*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\%dll_path%\cublas*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\%dll_path%\cudart*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\%dll_path%\curand*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\%dll_path%\cufft*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\%dll_path%\cusolver*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\%dll_path%\nvrtc*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\%dll_path%\nvJitLink_*.dll*"  pytorch\torch\lib

				copy "%CUDA_PATH%\bin\cudnn*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\nvrtc*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\extras\CUPTI\lib64\cupti64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\extras\CUPTI\lib64\nvperf_host*.dll*" pytorch\torch\lib

				@ -20,8 +28,3 @@ copy "%libuv_ROOT%\bin\uv.dll" pytorch\torch\lib

				if exist "C:\Windows\System32\zlibwapi.dll" (

				    copy "C:\Windows\System32\zlibwapi.dll"  pytorch\torch\lib

				)

				::copy nvJitLink dll is requires for cuda 12+

				if exist "%CUDA_PATH%\bin\nvJitLink_*.dll*" (

				    copy "%CUDA_PATH%\bin\nvJitLink_*.dll*"  pytorch\torch\lib

				)

									
										28

.ci/pytorch/windows/internal/cuda_install.bat
									
												View File
												
				@ -26,6 +26,7 @@ if exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%

				if %CUDA_VER% EQU 126 goto cuda126

				if %CUDA_VER% EQU 128 goto cuda128

				if %CUDA_VER% EQU 129 goto cuda129

				if %CUDA_VER% EQU 130 goto cuda130

				echo CUDA %CUDA_VERSION_STR% is not supported

				exit /b 1

				@ -113,6 +114,33 @@ xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"

				goto cuda_common

				:cuda130

				set CUDA_INSTALL_EXE=cuda_13.0.0_windows.exe

				if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (

				    curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" & REM @lint-ignore

				    if errorlevel 1 exit /b 1

				    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"

				    set "ARGS="

				)

				set CUDNN_FOLDER=cudnn-windows-x86_64-9.12.0.46_cuda13-archive

				set CUDNN_LIB_FOLDER="lib"

				set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"

				if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (

				    curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" & REM @lint-ignore

				    if errorlevel 1 exit /b 1

				    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"

				)

				@REM cuDNN 8.3+ required zlib to be installed on the path

				echo Installing ZLIB dlls

				curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"

				7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"

				xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"

				goto cuda_common

				:cuda_common

				:: NOTE: We only install CUDA if we don't have it installed already.

				:: With GHA runners these should be pre-installed as part of our AMI process

									
										10

.ci/pytorch/windows/internal/driver_update.bat
									
												View File
												
				@ -1,9 +1,9 @@

				set WIN_DRIVER_VN=528.89

				set "DRIVER_DOWNLOAD_LINK=https://ossci-windows.s3.amazonaws.com/%WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe" & REM @lint-ignore

				curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe

				set WIN_DRIVER_VN=580.88

				set "DRIVER_DOWNLOAD_LINK=https://ossci-windows.s3.amazonaws.com/%WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe" & REM @lint-ignore

				curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe

				if errorlevel 1 exit /b 1

				start /wait %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe -s -noreboot

				start /wait %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe -s -noreboot

				if errorlevel 1 exit /b 1

				del %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe || ver > NUL

				del %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe || ver > NUL

									
										12

.ci/pytorch/windows/internal/install_python.bat
									
												View File
												
				@ -1,12 +1,22 @@

				set ADDITIONAL_OPTIONS=""

				set PYTHON_EXEC="python"

				if "%DESIRED_PYTHON%" == "3.13t" (

				    echo Python version is set to 3.13t

				    set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.13.0/python-3.13.0-amd64.exe"

				    set ADDITIONAL_OPTIONS="Include_freethreaded=1"

				    set PYTHON_EXEC="python3.13t"

				) else if "%DESIRED_PYTHON%"=="3.14" (

				    echo Python version is set to 3.14 or 3.14t

				    set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.14.0/python-3.14.0rc1-amd64.exe"

				) else if "%DESIRED_PYTHON%"=="3.14t" (

				    echo Python version is set to 3.14 or 3.14t

				    set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.14.0/python-3.14.0rc1-amd64.exe"

				    set ADDITIONAL_OPTIONS="Include_freethreaded=1"

				    set PYTHON_EXEC="python3.14t"

				) else (

				    echo DESIRED_PYTHON not defined, Python version is set to %DESIRED_PYTHON%

				    echo Python version is set to %DESIRED_PYTHON%

				    set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/%DESIRED_PYTHON%.0/python-%DESIRED_PYTHON%.0-amd64.exe" %= @lint-ignore =%

				)

									
										21

.ci/pytorch/windows/internal/xpu_install.bat
									
												View File
												
				@ -13,9 +13,9 @@ if not exist "%SRC_DIR%\temp_build" mkdir "%SRC_DIR%\temp_build"

				:xpu_bundle_install_start

				set XPU_BUNDLE_PARENT_DIR=C:\Program Files (x86)\Intel\oneAPI

				set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d6d6c17-ca2d-4735-9331-99447e4a1280/intel-deep-learning-essentials-2025.0.1.28_offline.exe

				set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/75d4eb97-914a-4a95-852c-7b9733d80f74/intel-deep-learning-essentials-2025.1.3.8_offline.exe

				set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.deep-learning-essentials.product

				set XPU_BUNDLE_VERSION=2025.0.1+20

				set XPU_BUNDLE_VERSION=2025.1.3+5

				set XPU_BUNDLE_INSTALLED=0

				set XPU_BUNDLE_UNINSTALL=0

				set XPU_EXTRA_URL=NULL

				@ -24,9 +24,9 @@ set XPU_EXTRA_VERSION=2025.0.1+1226

				set XPU_EXTRA_INSTALLED=0

				set XPU_EXTRA_UNINSTALL=0

				if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.1] (

				    set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/75d4eb97-914a-4a95-852c-7b9733d80f74/intel-deep-learning-essentials-2025.1.3.8_offline.exe

				    set XPU_BUNDLE_VERSION=2025.1.3+5

				if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.2] (

				    set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/24751ead-ddc5-4479-b9e6-f9fe2ff8b9f2/intel-deep-learning-essentials-2025.2.1.25_offline.exe

				    set XPU_BUNDLE_VERSION=2025.2.1+20

				)

				:: Check if XPU bundle is target version or already installed

				@ -90,14 +90,3 @@ if errorlevel 1 exit /b 1

				del xpu_extra.exe

				:xpu_install_end

				if not "%XPU_ENABLE_KINETO%"=="1" goto install_end

				:: Install Level Zero SDK

				set XPU_EXTRA_LZ_URL=https://github.com/oneapi-src/level-zero/releases/download/v1.14.0/level-zero-sdk_1.14.0.zip

				curl -k -L %XPU_EXTRA_LZ_URL% --output "%SRC_DIR%\temp_build\level_zero_sdk.zip"

				echo "Installing level zero SDK..."

				7z x "%SRC_DIR%\temp_build\level_zero_sdk.zip" -o"%SRC_DIR%\temp_build\level_zero"

				set "INCLUDE=%SRC_DIR%\temp_build\level_zero\include;%INCLUDE%"

				del "%SRC_DIR%\temp_build\level_zero_sdk.zip"

				:install_end

									
										2

.ci/pytorch/windows/setup_build.bat
									
												View File
												
				@ -7,6 +7,8 @@ call "internal\install_python.bat"

				%PYTHON_EXEC% --version

				set "PATH=%CD%\Python\Lib\site-packages\cmake\data\bin;%CD%\Python\Scripts;%CD%\Python;%PATH%"

				if "%DESIRED_PYTHON%" == "3.14t" %PYTHON_EXEC% -m pip install numpy==2.3.2 cmake

				if "%DESIRED_PYTHON%" == "3.14" %PYTHON_EXEC% -m pip install numpy==2.3.2 cmake

				if "%DESIRED_PYTHON%" == "3.13t" %PYTHON_EXEC% -m pip install numpy==2.2.1 cmake

				if "%DESIRED_PYTHON%" == "3.13" %PYTHON_EXEC% -m pip install numpy==2.1.2 cmake

				if "%DESIRED_PYTHON%" == "3.12" %PYTHON_EXEC% -m pip install numpy==2.0.2 cmake

									
										51

.ci/wheel/build_wheel.sh
									
												View File
												
				@ -124,20 +124,16 @@ popd

				export TH_BINARY_BUILD=1

				export INSTALL_TEST=0 # dont install test binaries into site-packages

				export MACOSX_DEPLOYMENT_TARGET=10.15

				export MACOSX_DEPLOYMENT_TARGET=11.0

				export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

				SETUPTOOLS_PINNED_VERSION="==70.1.0"

				PYYAML_PINNED_VERSION="=5.3"

				EXTRA_CONDA_INSTALL_FLAGS=""

				CONDA_ENV_CREATE_FLAGS=""

				RENAME_WHEEL=true

				case $desired_python in

				    3.14t)

				        echo "Using 3.14 deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=6.0.1"

				        NUMPY_PINNED_VERSION="=2.1.0"

				        NUMPY_PINNED_VERSION="==2.1.0"

				        CONDA_ENV_CREATE_FLAGS="python-freethreading"

				        EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"

				        desired_python="3.14.0rc1"

				@ -145,18 +141,14 @@ case $desired_python in

				        ;;

				    3.14)

				        echo "Using 3.14t deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=6.0.1"

				        NUMPY_PINNED_VERSION="=2.1.0"

				        NUMPY_PINNED_VERSION="==2.1.0"

				        EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"

				        desired_python="3.14.0rc1"

				        RENAME_WHEEL=false

				        ;;

				    3.13t)

				        echo "Using 3.13 deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=6.0.1"

				        NUMPY_PINNED_VERSION="=2.1.0"

				        NUMPY_PINNED_VERSION="==2.1.0"

				        CONDA_ENV_CREATE_FLAGS="python-freethreading"

				        EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"

				        desired_python="3.13"

				@ -164,37 +156,23 @@ case $desired_python in

				        ;;

				    3.13)

				        echo "Using 3.13 deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=6.0.1"

				        NUMPY_PINNED_VERSION="=2.1.0"

				        NUMPY_PINNED_VERSION="==2.1.0"

				        ;;

				    3.12)

				        echo "Using 3.12 deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=6.0.1"

				        NUMPY_PINNED_VERSION="=2.0.2"

				        NUMPY_PINNED_VERSION="==2.0.2"

				        ;;

				    3.11)

				        echo "Using 3.11 deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=5.3"

				        NUMPY_PINNED_VERSION="=2.0.2"

				        NUMPY_PINNED_VERSION="==2.0.2"

				        ;;

				    3.10)

				        echo "Using 3.10 deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=5.3"

				        NUMPY_PINNED_VERSION="=2.0.2"

				        ;;

				    3.9)

				        echo "Using 3.9 deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=5.3"

				        NUMPY_PINNED_VERSION="=2.0.2"

				        NUMPY_PINNED_VERSION="==2.0.2"

				        ;;

				    *)

				        echo "Using default deps"

				        NUMPY_PINNED_VERSION="=1.11.3"

				        echo "Unsupported version $desired_python"

				        exit 1

				        ;;

				esac

				@ -203,8 +181,11 @@ tmp_env_name="wheel_py$python_nodot"

				conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS}

				source activate "$tmp_env_name"

				retry pip install -r "${pytorch_rootdir}/requirements-build.txt"

				pip install "numpy=${NUMPY_PINNED_VERSION}"  "pyyaml${PYYAML_PINNED_VERSION}" requests ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing-extensions

				PINNED_PACKAGES=(

				    "numpy${NUMPY_PINNED_VERSION}"

				)

				retry pip install "${PINNED_PACKAGES[@]}" -r "${pytorch_rootdir}/requirements-build.txt"

				pip install requests ninja typing-extensions

				retry pip install -r "${pytorch_rootdir}/requirements.txt" || true

				retry brew install libomp

				@ -218,7 +199,7 @@ export BUILD_TEST=OFF

				pushd "$pytorch_rootdir"

				echo "Calling setup.py bdist_wheel at $(date)"

				python setup.py bdist_wheel -d "$whl_tmp_dir"

				python setup.py bdist_wheel -d "$whl_tmp_dir" --plat-name ${mac_version}

				echo "Finished setup.py bdist_wheel at $(date)"

									
										9

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -71,14 +71,7 @@ export PYTORCH_BUILD_NUMBER=1

				# Set triton version as part of PYTORCH_EXTRA_INSTALL_REQUIREMENTS

				TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"

				# CUDA 12.9 builds have triton for Linux and Linux aarch64 binaries.

				if [[ "$DESIRED_CUDA" == "cu129" ]]; then

				  TRITON_CONSTRAINT="platform_system == 'Linux'"

				fi

				TRITON_CONSTRAINT="platform_system == 'Linux'"

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" && ! "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then

				  TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

									
										6

.circleci/scripts/binary_upload.sh
									
												View File
												
				@ -51,16 +51,12 @@ s3_upload() {

				    s3_upload_dir="${s3_root_dir}/${UPLOAD_SUBFOLDER}/"

				  fi

				  (

				    cache_control_flag=""

				    if [[ "${UPLOAD_CHANNEL}" = "test" ]]; then

				      cache_control_flag="--cache-control='no-cache,no-store,must-revalidate'"

				    fi

				    for pkg in ${PKG_DIR}/*.${extension}; do

				      (

				        set -x

				        shm_id=$(sha256sum "${pkg}" | awk '{print $1}')

				        ${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_upload_dir}" \

				          --metadata "checksum-sha256=${shm_id}" ${cache_control_flag}

				          --metadata "checksum-sha256=${shm_id}"

				      )

				    done

				  )

									
										3

.circleci/scripts/binary_windows_build.sh
									
												View File
												
				@ -15,8 +15,7 @@ fi

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				    export VC_YEAR=2022

				    export USE_SCCACHE=0

				    export XPU_VERSION=2025.1

				    export XPU_ENABLE_KINETO=1

				    export XPU_VERSION=2025.2

				fi

				echo "Free space on filesystem before build:"

									
										2

.circleci/scripts/binary_windows_test.sh
									
												View File
												
				@ -8,7 +8,7 @@ export VC_YEAR=2022

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				    export VC_YEAR=2022

				    export XPU_VERSION=2025.1

				    export XPU_VERSION=2025.2

				fi

				pushd "$PYTORCH_ROOT/.ci/pytorch/"

1

.flake8

View File

 @ -48,6 +48,7 @@ per-file-ignores =
     torch/__init__.py: F401,TOR901
     torch/_custom_op/impl.py: TOR901
     torch/_export/serde/upgrade.py: TOR901
     torch/_functorch/predispatch.py: TOR901
     torch/_functorch/vmap.py: TOR901
     torch/_inductor/test_operators.py: TOR901
     torch/_library/abstract_impl.py: TOR901

									
										2

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -12,7 +12,9 @@ self-hosted-runner:

				    - linux.9xlarge.ephemeral

				    - am2.linux.9xlarge.ephemeral

				    - linux.12xlarge

				    - linux.12xlarge.memory

				    - linux.24xlarge

				    - linux.24xlarge.memory

				    - linux.24xlarge.ephemeral

				    - linux.24xlarge.amd

				    - linux.arm64.2xlarge

									
										10

.github/actions/build-external-packages/action.yml
									
										vendored
									
												View File
												
				@ -4,6 +4,11 @@ name: Build External packages

				description: build external packages for PyTorch

				inputs:

				  cuda-version:

				    description: CUDA version to use

				    type: string

				    required: true

				    default: '12.8.1'

				  cuda-arch-list:

				    description: TORCH_CUDA_ARCH_LIST (e.g., "8.0;8.9;9.0")

				    type: string

				@ -44,10 +49,12 @@ runs:

				      env:

				        SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				        SCCACHE_REGION: us-east-1

				        CUDA_VERSION: ${{ inputs.cuda-version }}

				        TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}

				        BASE_IMAGE: ${{ inputs.docker-image }}

				        BUILD_TARGETS: ${{ inputs.build-targets }}

				        PARENT_OUTPUT_DIR: ${{ inputs.output-dir}}

				        PARENT_OUTPUT_DIR: ${{ inputs.output-dir }}

				        TORCH_WHEELS_PATH: ${{ inputs.torch-wheel-dir }}

				      shell: bash

				      run: |

				        set -euo pipefail

				@ -68,7 +75,6 @@ runs:

				          export OUTPUT_DIR

				          echo "Building external package: $target in directory $OUTPUT_DIR"

				          python3 -m cli.run build external "$target"

				        done

				        END_TIME=$(date +%s)

									
										15

.github/actions/checkout-pytorch/action.yml
									
										vendored
									
												View File
												
				@ -57,6 +57,21 @@ runs:

				        submodules: ${{ inputs.submodules }}

				        show-progress: false

				    - name: Clean submodules post checkout

				      id: clean-submodules

				      if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				      shell: bash

				      env:

				        NO_SUDO: ${{ inputs.no-sudo }}

				      run: |

				        cd "${GITHUB_WORKSPACE}"

				        # Clean stale submodule dirs

				        if [ -z "${NO_SUDO}" ]; then

				          sudo git submodule foreach --recursive git clean -ffdx

				        else

				          git submodule foreach --recursive git clean -ffdx

				        fi

				    - name: Clean workspace (try again)

				      if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' &&

				        (steps.first-clean.outcome != 'success' || steps.first-checkout-attempt.outcome != 'success') }}

									
										16

.github/actions/setup-win/action.yml
									
										vendored
									
												View File
												
				@ -6,6 +6,12 @@ inputs:

				  cuda-version:

				    description: which cuda version to install, 'cpu' for none

				    required: true

				  python-version:

				    required: false

				    type: string

				    default: "3.10"

				    description: |

				      The python version to be used. Will be 3.10 by default

				runs:

				  using: composite

				@ -38,18 +44,24 @@ runs:

				        CONDA="C:\Jenkins\Miniconda3\condabin\conda.bat"

				        {

				          echo "CONDA=${CONDA}";

				          echo "CONDA_RUN=${CONDA} run --no-capture-output";

				          echo "CONDA_BUILD=${CONDA} run conda-build";

				          echo "CONDA_INSTALL=${CONDA} install";

				        } >> "${GITHUB_ENV}"

				    - name: Setup Python3

				      env:

				          PYTHON_VERSION: ${{ inputs.python-version }}

				      shell: bash

				      run: |

				        set +e

				        set -x

				        PYTHON3=$(${CONDA_RUN} which python3)

				        # Create new py_tmp env with python-version

				        ${CONDA} create -y -n py_tmp python=${PYTHON_VERSION} intel-openmp libuv

				        PYTHON3=$(${CONDA_RUN} -n py_tmp which python3)

				        EXIT_CODE=$?

				        if [[ "${EXIT_CODE}" == "0" ]]; then

				@ -62,7 +74,7 @@ runs:

				          # installation, which is Python 3 based. Its Python is default to Python 3. Further, there

				          # is also the Miniconda installation that is Python 2 based, and both can be installed if

				          # needed. In both cases, Python binary is just called python

				          PYTHON=$(${CONDA_RUN} which python)

				          PYTHON=$(${CONDA_RUN} -n py_tmp which python)

				          EXIT_CODE=$?

				          if [[ "${EXIT_CODE}" == "0" ]]; then

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 a683668dd65bc82343e55245e308eb97b4e
 fc2493d383354a008106f22f3be232badee9a1

2

.github/ci_commit_pins/vllm.txt vendored

View File

 @ -1 +1 @@
 fc8fa751a4321d6531467537ff77cf3c1c70260
 a47f87ce259a48f0391fa9ae15add05ea7432b

2

.github/ci_commit_pins/xla.txt vendored

View File

 @ -1 +1 @@
 a1c6ee92c85e8b0955c20892ed68f032a6015c09
 r2.9

									
										219

.github/ci_configs/vllm/Dockerfile.tmp_vllm
									
										vendored
									
												View File
												
				@ -12,54 +12,46 @@ ARG BUILD_BASE_IMAGE=torch-nightly-base

				# by default, it uses devel-ubuntu22.04 official image.

				ARG FINAL_BASE_IMAGE=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04

				# The logic is copied from https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile

				ARG GET_PIP_URL="https://bootstrap.pypa.io/get-pip.py"

				#################### TORCH NIGHTLY  BASE IMAGE ####################

				#################### TORCH NIGHTLY BASE IMAGE ####################

				# A base image for building vLLM with devel ubuntu 22.04, this is mainly used to build vllm in vllm builtkite ci

				From nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04 as torch-nightly-base

				ARG CUDA_VERSION=12.8.1

				ARG PYTHON_VERSION=3.12

				ARG TARGETPLATFORM

				ENV DEBIAN_FRONTEND=noninteractive

				FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04 as torch-nightly-base

				RUN PYTHON_VERSION_STR=$(echo ${PYTHON_VERSION} | sed 's/\.//g') && \

				    echo "export PYTHON_VERSION_STR=${PYTHON_VERSION_STR}" >> /etc/environment

				ARG CUDA_VERSION

				ARG PYTHON_VERSION

				ARG GET_PIP_URL

				# Install Python and other dependencies if it does not existed

				RUN if ! command -v python3 >/dev/null || ! python3 --version | grep -q "${PYTHON_VERSION}"; then \

				      echo "Installing Python ${PYTHON_VERSION}..." && \

				      echo 'tzdata tzdata/Areas select America' | debconf-set-selections && \

				      echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections && \

				      apt-get update -y && \

				      apt-get install -y ccache software-properties-common git curl sudo && \

				      for i in 1 2 3; do \

				        add-apt-repository -y ppa:deadsnakes/ppa && break || \

				        { echo "Attempt $i failed, retrying in 5s..."; sleep 5; }; \

				      done && \

				      apt-get update -y && \

				      apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-venv && \

				      update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 1 && \

				      update-alternatives --set python3 /usr/bin/python${PYTHON_VERSION} && \

				      ln -sf /usr/bin/python${PYTHON_VERSION}-config /usr/bin/python3-config && \

				      curl -sS https://bootstrap.pypa.io/get-pip.py | python${PYTHON_VERSION}; \

				   else \

				      echo "Python ${PYTHON_VERSION} already present, skipping setup."; \

				   fi \

				   && python3 --version && python3 -m pip --version

				# Install Python and other dependencies

				RUN apt-get update -y \

				    && apt-get install -y ccache software-properties-common git curl wget sudo vim \

				    && add-apt-repository -y ppa:deadsnakes/ppa \

				    && apt-get update -y \

				    && apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-venv \

				    && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 1 \

				    && update-alternatives --set python3 /usr/bin/python${PYTHON_VERSION} \

				    && ln -sf /usr/bin/python${PYTHON_VERSION}-config /usr/bin/python3-config \

				    && curl -sS ${GET_PIP_URL} | python${PYTHON_VERSION} \

				    && python3 --version && python3 -m pip --version

				# Upgrade to GCC 10 to avoid https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92519

				# as it was causing spam when compiling the CUTLASS kernels

				# Ensure gcc >= 10 to avoid CUTLASS issues (bug 92519)

				RUN current_gcc_version=$(gcc -dumpversion | cut -f1 -d.) && \

				    if [ "$current_gcc_version" -lt 10 ]; then \

				      echo "GCC version is $current_gcc_version, installing gcc-10..."; \

				      apt-get update && \

				      apt-get install -y gcc-10 g++-10 && \

				      update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 100 && \

				      update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 100; \

				    else \

				      echo "GCC version is $current_gcc_version, no need to install gcc-10."; \

				    fi && \

				    gcc --version && g++ --version

				    if command -v apt-get >/dev/null; then \

				        if [ "$current_gcc_version" -lt 10 ]; then \

				            echo "GCC version is $current_gcc_version, installing gcc-10..."; \

				            apt-get update \

				            && apt-get install -y gcc-10 g++-10 \

				            && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 100 \

				            && update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 100; \

				        else \

				            echo "GCC version is $current_gcc_version, no need to install gcc-10."; \

				        fi \

				    fi \

				    && gcc --version && g++ --version

				# install uv for faster pip installs

				RUN --mount=type=cache,target=/root/.cache/uv \

				@ -67,6 +59,8 @@ RUN --mount=type=cache,target=/root/.cache/uv \

				ENV UV_HTTP_TIMEOUT=500

				ENV UV_INDEX_STRATEGY="unsafe-best-match"

				# Use copy mode to avoid hardlink failures with Docker cache mounts

				ENV UV_LINK_MODE=copy

				#################### TORCH NIGHTLY  BASE IMAGE ####################

				@ -77,6 +71,21 @@ ENV UV_INDEX_STRATEGY="unsafe-best-match"

				FROM ${BUILD_BASE_IMAGE} AS base

				USER root

				ARG CUDA_VERSION

				ARG PYTHON_VERSION

				# TODO (huydhn): Only work with PyTorch manylinux builder

				ENV PATH="/opt/python/cp312-cp312/bin:${PATH}"

				# Install some system dependencies and double check python version

				RUN if command -v apt-get >/dev/null; then \

				        apt-get update -y \

				        && apt-get install -y ccache software-properties-common git curl wget sudo vim; \

				    else \

				        dnf install -y git curl wget sudo vim; \

				    fi \

				    && python3 --version && python3 -m pip --version

				# Workaround for https://github.com/openai/triton/issues/2507 and

				# https://github.com/pytorch/pytorch/issues/107960 -- hopefully

				# this won't be needed for future versions of this docker image

				@ -90,6 +99,8 @@ RUN --mount=type=cache,target=/root/.cache/uv \

				    fi

				ENV UV_HTTP_TIMEOUT=500

				ENV UV_INDEX_STRATEGY="unsafe-best-match"

				# Use copy mode to avoid hardlink failures with Docker cache mounts

				ENV UV_LINK_MODE=copy

				WORKDIR /workspace

				@ -112,18 +123,17 @@ ARG PINNED_TORCH_VERSION

				RUN --mount=type=bind,source=${TORCH_WHEELS_PATH},target=/dist \

				    --mount=type=cache,target=/root/.cache/uv \

				    if [ -n "$TORCH_WHEELS_PATH" ] && [ "$TORCH_WHEELS_PATH" != "./requirements" ] && [ -d "/dist" ] && ls /dist/torch*.whl >/dev/null 2>&1; then \

				        echo "[INFO] Installing torch wheels to build vllm"; \

				        torch_whl=$(find /dist -maxdepth 1 -name 'torch-*.whl' -print -quit); \

				        vision_whl=$(find /dist/vision -name 'torchvision*.whl' | head -n1 | xargs); \

				        audio_whl=$(find /dist/audio -name 'torchaudio*.whl' | head -n1 | xargs); \

				        uv pip install --system "${torch_whl}[opt-einsum]"; \

				        uv pip install --system "${vision_whl}"; \

				        uv pip install --system "${audio_whl}"; \

				        vision_whl=$(find /dist -name 'torchvision*.whl' | head -n1 | xargs); \

				        audio_whl=$(find /dist -name 'torchaudio*.whl' | head -n1 | xargs); \

				        uv pip install --system "${torch_whl}[opt-einsum]" "${vision_whl}" "${audio_whl}" /dist/*.whl; \

				    elif [ -n "$PINNED_TORCH_VERSION" ]; then \

				        echo "[INFO] Installing pinned torch nightly version: $PINNED_TORCH_VERSION"; \

				        uv pip install --system "$PINNED_TORCH_VERSION" --index-url https://download.pytorch.org/whl/nightly/cu128; \

				        echo "[INFO] Installing pinned torch nightly version to build vllm: $PINNED_TORCH_VERSION"; \

				        uv pip install --system "$PINNED_TORCH_VERSION" --index-url https://download.pytorch.org/whl/nightly/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.'); \

				    else \

				        echo "[INFO] Installing torch nightly with latest one"; \

				        uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128; \

				        echo "[INFO] Installing torch nightly with latest one to build vllm"; \

				        uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.'); \

				    fi

				# Install numba 0.61.2 for cuda environment

				@ -132,19 +142,25 @@ RUN --mount=type=cache,target=/root/.cache/uv \

				# Install common dependencies from vllm common.txt

				RUN --mount=type=cache,target=/root/.cache/uv \

				uv pip install --system -r requirements/common.txt

				    uv pip install --system -r requirements/common.txt

				# Must put before installing xformers, so it can install the correct version of xfomrers.

				ARG torch_cuda_arch_list='8.0;8.6;8.9;9.0'

				ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}

				ARG xformers_cuda_arch_list='7.5;8.0+PTX;9.0a'

				ENV TORCH_CUDA_ARCH_LIST=${xformers_cuda_arch_list}

				ARG max_jobs=16

				ENV MAX_JOBS=${max_jobs}

				RUN echo ${TORCH_CUDA_ARCH_LIST}

				RUN echo ${MAX_JOBS}

				RUN pip freeze | grep -E 'ninja'

				# Build xformers with cuda and torch nightly/wheel

				# following official xformers guidance: https://github.com/facebookresearch/xformers#build

				ARG XFORMERS_COMMIT=f2de641ef670510cadab099ce6954031f52f191c

				# sha for https://github.com/facebookresearch/xformers/tree/v0.0.32.post2

				ARG XFORMERS_COMMIT=5d4b92a5e5a9c6c6d4878283f47d82e17995b468

				ENV CCACHE_DIR=/root/.cache/ccache

				RUN --mount=type=cache,target=/root/.cache/ccache \

				    --mount=type=cache,target=/root/.cache/uv \

				    echo 'git clone xformers...' \

				@ -157,14 +173,15 @@ RUN --mount=type=cache,target=/root/.cache/ccache \

				    && python3 setup.py bdist_wheel --dist-dir=../xformers-dist --verbose \

				    && cd .. \

				    && rm -rf xformers

				RUN --mount=type=cache,target=/root/.cache/uv \

				    uv pip install --system xformers-dist/*.whl --verbose

				# Build can take a long time, and the torch nightly version fetched from url can be different in next docker stage.

				# track the nightly torch version used in the build, when we set up runtime environment we can make sure the version is the same

				RUN uv pip freeze | grep -i '^torch\|^torchvision\|^torchaudio' > torch_build_versions.txt

				RUN cat  torch_build_versions.txt

				RUN cat torch_build_versions.txt

				RUN pip freeze | grep -E 'torch|xformers|torchvision|torchaudio'

				#################### BASE BUILD IMAGE ####################

				@ -175,9 +192,6 @@ RUN pip freeze | grep -E 'torch|xformers|torchvision|torchaudio'

				FROM base AS build

				ARG TARGETPLATFORM

				ENV UV_HTTP_TIMEOUT=500

				ENV UV_INDEX_STRATEGY="unsafe-best-match"

				COPY . .

				RUN python3 use_existing_torch.py

				@ -192,7 +206,7 @@ RUN --mount=type=bind,source=.git,target=.git \

				# Max jobs used by Ninja to build extensions

				ARG max_jobs=16

				ENV MAX_JOBS=${max_jobs}

				ARG nvcc_threads=2

				ARG nvcc_threads=4

				ENV NVCC_THREADS=$nvcc_threads

				ARG torch_cuda_arch_list='8.0;8.6;8.9;9.0'

				ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}

				@ -216,11 +230,14 @@ RUN --mount=type=cache,target=/root/.cache/uv \

				        && export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \

				        && export SCCACHE_IDLE_TIMEOUT=0 \

				        && export CMAKE_BUILD_TYPE=Release \

				        && export VLLM_DOCKER_BUILD_CONTEXT=1 \

				        && sccache --show-stats \

				        && python3 setup.py bdist_wheel --dist-dir=vllm-dist --py-limited-api=cp38 \

				        && sccache --show-stats; \

				    fi

				ARG vllm_target_device="cuda"

				ENV VLLM_TARGET_DEVICE=${vllm_target_device}

				ENV CCACHE_DIR=/root/.cache/ccache

				RUN --mount=type=cache,target=/root/.cache/ccache \

				    --mount=type=cache,target=/root/.cache/uv \

				@ -229,12 +246,13 @@ RUN --mount=type=cache,target=/root/.cache/ccache \

				        # Clean any existing CMake artifacts

				        rm -rf .deps && \

				        mkdir -p .deps && \

				        export VLLM_DOCKER_BUILD_CONTEXT=1 && \

				        python3 setup.py bdist_wheel --dist-dir=vllm-dist --py-limited-api=cp38; \

				    fi

				RUN echo "[DEBUG] Listing  current directory:" && \

				RUN echo "[INFO] Listing current directory:" && \

				    ls -al && \

				    echo "[DEBUG] Showing torch_build_versions.txt content:" && \

				    echo "[INFO] Showing torch_build_versions.txt content:" && \

				    cat torch_build_versions.txt

				#################### WHEEL BUILD IMAGE ####################

				@ -244,42 +262,40 @@ RUN echo "[DEBUG] Listing  current directory:" && \

				# Setup clean environment for vLLM for test and api server using ubuntu22.04 with AOT flashinfer

				FROM ${FINAL_BASE_IMAGE} AS vllm-base

				USER root

				ARG CUDA_VERSION

				ARG PYTHON_VERSION

				ARG GET_PIP_URL

				# TODO (huydhn): Only work with PyTorch manylinux builder

				ENV PATH="/opt/python/cp312-cp312/bin:${PATH}"

				# prepare for environment starts

				WORKDIR /workspace

				RUN PYTHON_VERSION_STR=$(echo ${PYTHON_VERSION} | sed 's/\.//g') && \

				    echo "export PYTHON_VERSION_STR=${PYTHON_VERSION_STR}" >> /etc/environment

				# Install Python and other dependencies if it does not existed

				RUN if ! command -v python3 >/dev/null || ! python3 --version | grep -q "${PYTHON_VERSION}"; then \

				      echo "Installing Python ${PYTHON_VERSION}..." && \

				      echo 'tzdata tzdata/Areas select America' | debconf-set-selections && \

				      echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections && \

				      apt-get update -y && \

				      apt-get install -y ccache software-properties-common git curl sudo && \

				      for i in 1 2 3; do \

				        add-apt-repository -y ppa:deadsnakes/ppa && break || \

				        { echo "Attempt $i failed, retrying in 5s..."; sleep 5; }; \

				      done && \

				      apt-get update -y && \

				      apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-venv && \

				      update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 1 && \

				      update-alternatives --set python3 /usr/bin/python${PYTHON_VERSION} && \

				      ln -sf /usr/bin/python${PYTHON_VERSION}-config /usr/bin/python3-config && \

				      curl -sS https://bootstrap.pypa.io/get-pip.py | python${PYTHON_VERSION}; \

				   else \

				      echo "Python ${PYTHON_VERSION} already present, skipping setup."; \

				   fi \

				   && python3 --version && python3 -m pip --version

				# Install Python and other dependencies

				RUN if command -v apt-get >/dev/null; then \

				        apt-get update -y \

				        && apt-get install -y ccache software-properties-common git curl wget sudo vim \

				        && add-apt-repository -y ppa:deadsnakes/ppa \

				        && apt-get update -y \

				        && apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-venv \

				        && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 1 \

				        && update-alternatives --set python3 /usr/bin/python${PYTHON_VERSION} \

				        && ln -sf /usr/bin/python${PYTHON_VERSION}-config /usr/bin/python3-config \

				        && curl -sS ${GET_PIP_URL} | python${PYTHON_VERSION}; \

				    else \

				        dnf install -y git curl wget sudo vim; \

				    fi \

				    && python3 --version && python3 -m pip --version

				# Get the torch versions, and whls used in previous stagtes for consistency

				COPY --from=base /workspace/torch_build_versions.txt ./torch_build_versions.txt

				COPY --from=base /workspace/xformers-dist /wheels/xformers

				COPY --from=build /workspace/vllm-dist /wheels/vllm

				RUN echo "[DEBUG] Listing current directory before torch install step:" && \

				RUN echo "[INFO] Listing current directory before torch install step:" && \

				    ls -al && \

				    echo "[DEBUG] Showing torch_build_versions.txt content:" && \

				    echo "[INFO] Showing torch_build_versions.txt content:" && \

				    cat torch_build_versions.txt

				# Workaround for https://github.com/openai/triton/issues/2507 and

				@ -288,7 +304,6 @@ RUN echo "[DEBUG] Listing current directory before torch install step:" && \

				# or future versions of triton.

				RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/

				# Install uv for faster pip installs if not existed

				RUN --mount=type=cache,target=/root/.cache/uv \

				    if ! python3 -m uv --version > /dev/null 2>&1; then \

				@ -296,6 +311,8 @@ RUN --mount=type=cache,target=/root/.cache/uv \

				    fi

				ENV UV_HTTP_TIMEOUT=500

				ENV UV_INDEX_STRATEGY="unsafe-best-match"

				# Use copy mode to avoid hardlink failures with Docker cache mounts

				ENV UV_LINK_MODE=copy

				# Default mount file as placeholder, this just avoid the mount error

				ARG TORCH_WHEELS_PATH="./requirements"

				@ -306,15 +323,13 @@ RUN --mount=type=bind,source=${TORCH_WHEELS_PATH},target=/dist \

				    --mount=type=cache,target=/root/.cache/uv \

				    if [ -n "$TORCH_WHEELS_PATH" ] && [ "$TORCH_WHEELS_PATH" != "./requirements" ] && [ -d "/dist" ] && ls /dist/torch*.whl >/dev/null 2>&1; then \

				        torch_whl=$(find /dist -maxdepth 1 -name 'torch-*.whl' -print -quit); \

				        vision_whl=$(find /dist/vision -name 'torchvision*.whl' | head -n1 | xargs); \

				        audio_whl=$(find /dist/audio -name 'torchaudio*.whl' | head -n1 | xargs); \

				        echo "Found: '${torch_whl}' '${audio_whl}' '${vision_whl}'"; \

				        uv pip install --system "${torch_whl}[opt-einsum]"; \

				        uv pip install --system "${vision_whl}"; \

				        uv pip install --system "${audio_whl}"; \

				        vision_whl=$(find /dist -name 'torchvision*.whl' | head -n1 | xargs); \

				        audio_whl=$(find /dist -name 'torchaudio*.whl' | head -n1 | xargs); \

				        echo "[INFO] Use wheels to build : '${torch_whl}' '${audio_whl}' '${vision_whl}'"; \

				        uv pip install --system "${torch_whl}[opt-einsum]" "${vision_whl}" "${audio_whl}" /dist/*.whl; \

				    else \

				        echo "[INFO] Installing torch versions from torch_build_versions.txt"; \

				        uv pip install --system $(cat torch_build_versions.txt | xargs) --index-url https://download.pytorch.org/whl/nightly/cu128; \

				        uv pip install --system $(cat torch_build_versions.txt | xargs) --index-url https://download.pytorch.org/whl/nightly/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.'); \

				    fi

				# Install the vllm wheel from previous stage

				@ -325,9 +340,8 @@ RUN --mount=type=cache,target=/root/.cache/uv \

				RUN --mount=type=cache,target=/root/.cache/uv \

				    uv pip install --system /wheels/xformers/*.whl --verbose

				# Build flashinfer from source.

				ARG torch_cuda_arch_list='8.0;8.9;9.0a'

				ARG torch_cuda_arch_list='8.0;8.9;9.0a;10.0a;12.0'

				# install package for build flashinfer

				# see issue: https://github.com/flashinfer-ai/flashinfer/issues/738

				@ -338,7 +352,7 @@ ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}

				# Build flashinfer for torch nightly from source around 10 mins

				ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git"

				# Keep this in sync with https://github.com/vllm-project/vllm/blob/main/requirements/cuda.txt

				ARG FLASHINFER_GIT_REF="v0.2.9rc2"

				ARG FLASHINFER_GIT_REF="v0.2.14.post1"

				RUN --mount=type=cache,target=/root/.cache/uv \

				    git clone --depth 1 --recursive --shallow-submodules \

				        --branch ${FLASHINFER_GIT_REF} \

				@ -356,6 +370,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \

				# Logging to confirm the torch versions

				RUN pip freeze | grep -E 'torch|xformers|vllm|flashinfer'

				RUN uv pip freeze | grep -i '^torch\|^torchvision\|^torchaudio\|^xformers\|^vllm\|^flashinfer' > build_summary.txt

				################### VLLM INSTALLED IMAGE ####################

				@ -364,6 +379,8 @@ FROM vllm-base as test

				ENV UV_HTTP_TIMEOUT=500

				ENV UV_INDEX_STRATEGY="unsafe-best-match"

				# Use copy mode to avoid hardlink failures with Docker cache mounts

				ENV UV_LINK_MODE=copy

				COPY tests/ tests/

				COPY examples examples

				@ -392,11 +409,6 @@ RUN --mount=type=cache,target=/root/.cache/uv \

				RUN --mount=type=cache,target=/root/.cache/uv \

				    uv pip install --system -r requirements/nightly_torch_test.txt

				# Workaround for #17068

				# pinned commit for v2.2.4

				RUN --mount=type=cache,target=/root/.cache/uv \

				    uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@95d8aba8a8c75aedcaa6143713b11e745e7cd0d9#egg=mamba-ssm"

				# Logging to confirm the torch versions

				RUN pip freeze | grep -E 'torch|xformers|vllm|flashinfer'

				@ -411,4 +423,5 @@ FROM scratch as export-wheels

				# Just copy the wheels we prepared in previous stages

				COPY --from=base /workspace/xformers-dist /wheels/xformers

				COPY --from=build /workspace/vllm-dist /wheels/vllm

				COPY --from=vllm-base /workspace/build_summary.txt /wheels/build_summary.txt

				COPY --from=vllm-base /workspace/wheels/flashinfer /wheels/flashinfer-python

									
										4

.github/dependabot.yml
									
										vendored
									
												View File
												
				@ -8,6 +8,9 @@ updates:

				    target-branch: "main"

				    allow:

				      - dependency-name: "transformers"

				    ignore:

				      - dependency-name: "*"

				        update-types: ["version-update:semver-patch"]

				    commit-message:

				      prefix: "[Dependabot] Update"

				      include: "scope"

				@ -18,3 +21,4 @@ updates:

				      - "topic: not user facing"

				      - "module: ci"

				      - "module: inductor"

				      - "ciflow/inductor"

2

.github/requirements/pip-requirements-macOS.txt vendored

View File

 @ -28,7 +28,7 @@ pyyaml==6.0.2
 scipy==1.12.0
 setuptools==72.1.0
 sympy==1.13.3
 tlparse==0.3.30
 tlparse==0.4.0
 tensorboard==2.13.0
 typing-extensions==4.12.2
 unittest-xml-reporting<=3.2.0,>=2.0.0

									
										1

.github/scripts/build_triton_wheel.py
									
										vendored
									
												View File
												
				@ -84,6 +84,7 @@ def build_triton(

				                ["git", "checkout", f"release/{ver}.{rev}.x"], cwd=triton_basedir

				            )

				        else:

				            check_call(["git", "fetch", "origin", commit_hash], cwd=triton_basedir)

				            check_call(["git", "checkout", commit_hash], cwd=triton_basedir)

				        # change built wheel name and version

									
										4

.github/scripts/filter_test_configs.py
									
										vendored
									
												View File
												
				@ -41,9 +41,9 @@ SUPPORTED_PERIODICAL_MODES: dict[str, Callable[[Optional[str]], bool]] = {

				}

				# The link to the published list of disabled jobs

				DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json"

				DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json?versionId=hjktHz2WOejHpxKpkqpDknTt5rMTM9KK"

				# and unstable jobs

				UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json"

				UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json?versionId=wrjdvvQTJxgvMO.rGw5MEuMsj6XbjuV7"

				# Some constants used to handle disabled and unstable jobs

				JOB_NAME_SEP = "/"

									
										188

.github/scripts/generate_binary_build_matrix.py
									
										vendored
									
												View File
												
				@ -16,17 +16,17 @@ from typing import Optional

				# NOTE: Please also update the CUDA sources in `PIP_SOURCES` in tools/nightly.py when changing this

				CUDA_ARCHES = ["12.6", "12.8", "12.9"]

				CUDA_ARCHES = ["12.6", "12.8", "13.0"]

				CUDA_STABLE = "12.8"

				CUDA_ARCHES_FULL_VERSION = {

				    "12.6": "12.6.3",

				    "12.8": "12.8.1",

				    "12.9": "12.9.1",

				    "13.0": "13.0.0",

				}

				CUDA_ARCHES_CUDNN_VERSION = {

				    "12.6": "9",

				    "12.8": "9",

				    "12.9": "9",

				    "13.0": "9",

				}

				# NOTE: Please also update the ROCm sources in `PIP_SOURCES` in tools/nightly.py when changing this

				@ -38,82 +38,82 @@ CPU_AARCH64_ARCH = ["cpu-aarch64"]

				CPU_S390X_ARCH = ["cpu-s390x"]

				CUDA_AARCH64_ARCHES = ["12.9-aarch64"]

				CUDA_AARCH64_ARCHES = ["12.6-aarch64", "12.8-aarch64", "13.0-aarch64"]

				PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				    "12.6": (

				        "nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'"

				        "nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | "

				        "nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | "

				        "nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | "

				        "nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | "

				        "nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | "

				        "nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | "

				        "nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | "

				        "nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | "

				        "nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "

				        "nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | "

				        "nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' | "

				        "nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | "

				        "nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | "

				        "nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'"

				    ),

				    "12.8": (

				        "nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'"

				        "nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | "

				        "nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | "

				        "nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | "

				        "nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | "

				        "nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | "

				        "nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | "

				        "nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | "

				        "nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | "

				        "nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "

				        "nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | "

				        "nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' | "

				        "nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | "

				        "nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | "

				        "nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'"

				    ),

				    "12.9": (

				        "nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    "13.0": (

				        "nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | "

				        "nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | "

				        "nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | "

				        "nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | "

				        "nvidia-cublas==13.0.0.19; platform_system == 'Linux' | "

				        "nvidia-cufft==12.0.0.15; platform_system == 'Linux' | "

				        "nvidia-curand==10.4.0.35; platform_system == 'Linux' | "

				        "nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | "

				        "nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | "

				        "nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | "

				        "nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | "

				        "nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | "

				        "nvidia-nvtx==13.0.39; platform_system == 'Linux' | "

				        "nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | "

				        "nvidia-cufile==1.15.0.42; platform_system == 'Linux'"

				    ),

				    "xpu": (

				        "intel-cmplr-lib-rt==2025.1.1 | "

				        "intel-cmplr-lib-ur==2025.1.1 | "

				        "intel-cmplr-lic-rt==2025.1.1 | "

				        "intel-sycl-rt==2025.1.1 | "

				        "oneccl-devel==2021.15.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "oneccl==2021.15.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "impi-rt==2021.15.0; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "onemkl-sycl-blas==2025.1.0 | "

				        "onemkl-sycl-dft==2025.1.0 | "

				        "onemkl-sycl-lapack==2025.1.0 | "

				        "onemkl-sycl-rng==2025.1.0 | "

				        "onemkl-sycl-sparse==2025.1.0 | "

				        "dpcpp-cpp-rt==2025.1.1 | "

				        "intel-opencl-rt==2025.1.1 | "

				        "mkl==2025.1.0 | "

				        "intel-openmp==2025.1.1 | "

				        "tbb==2022.1.0 | "

				        "tcmlib==1.3.0 | "

				        "umf==0.10.0 | "

				        "intel-pti==0.12.3"

				        "intel-cmplr-lib-rt==2025.2.1 | "

				        "intel-cmplr-lib-ur==2025.2.1 | "

				        "intel-cmplr-lic-rt==2025.2.1 | "

				        "intel-sycl-rt==2025.2.1 | "

				        "oneccl-devel==2021.16.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "oneccl==2021.16.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "impi-rt==2021.16.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "onemkl-sycl-blas==2025.2.0 | "

				        "onemkl-sycl-dft==2025.2.0 | "

				        "onemkl-sycl-lapack==2025.2.0 | "

				        "onemkl-sycl-rng==2025.2.0 | "

				        "onemkl-sycl-sparse==2025.2.0 | "

				        "dpcpp-cpp-rt==2025.2.1 | "

				        "intel-opencl-rt==2025.2.1 | "

				        "mkl==2025.2.0 | "

				        "intel-openmp==2025.2.1 | "

				        "tbb==2022.2.0 | "

				        "tcmlib==1.4.0 | "

				        "umf==0.11.0 | "

				        "intel-pti==0.13.1"

				    ),

				}

				@ -124,9 +124,7 @@ def get_nccl_wheel_version(arch_version: str) -> str:

				    requirements = map(

				        str.strip, re.split("[;|]", PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version])

				    )

				    return next(x for x in requirements if x.startswith("nvidia-nccl-cu")).split("==")[

				        1

				    ]

				    return next(x for x in requirements if x.startswith("nvidia-nccl")).split("==")[1]

				def read_nccl_pin(arch_version: str) -> str:

				@ -193,7 +191,7 @@ LIBTORCH_CONTAINER_IMAGES: dict[str, str] = {

				    "cpu": "libtorch-cxx11-builder:cpu",

				}

				FULL_PYTHON_VERSIONS = ["3.9", "3.10", "3.11", "3.12", "3.13", "3.13t", "3.14", "3.14t"]

				FULL_PYTHON_VERSIONS = ["3.10", "3.11", "3.12", "3.13", "3.13t", "3.14", "3.14t"]

				def translate_desired_cuda(gpu_arch_type: str, gpu_arch_version: str) -> str:

				@ -311,19 +309,20 @@ def generate_wheels_matrix(

				                else arch_version

				            )

				            # TODO: Enable python 3.13t on cpu-s390x

				            if gpu_arch_type == "cpu-s390x" and python_version == "3.13t":

				                continue

				            # TODO: Enable python 3.14 on non linux OSes

				            if os not in ["linux", "linux-aarch64", "macos-arm64"] and (

				                python_version == "3.14" or python_version == "3.14t"

				            ):

				            # TODO: Enable python 3.14 for rest

				            if os not in [

				                "linux",

				                "linux-aarch64",

				                "linux-s390x",

				                "macos-arm64",

				                "windows",

				            ] and (python_version == "3.14" or python_version == "3.14t"):

				                continue

				            # cuda linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install

				            if (

				                arch_version in ["12.9", "12.8", "12.6"]

				                arch_version in ["13.0", "12.8", "12.6"]

				                and os == "linux"

				                or arch_version in CUDA_AARCH64_ARCHES

				            ):

				@ -356,29 +355,6 @@ def generate_wheels_matrix(

				                        ),  # include special case for aarch64 build, remove the -aarch64 postfix

				                    }

				                )

				                # Special build building to use on Colab. Python 3.11 for 12.6 CUDA

				                if python_version == "3.11" and arch_version == CUDA_STABLE:

				                    ret.append(

				                        {

				                            "python_version": python_version,

				                            "gpu_arch_type": gpu_arch_type,

				                            "gpu_arch_version": gpu_arch_version,

				                            "desired_cuda": translate_desired_cuda(

				                                gpu_arch_type, gpu_arch_version

				                            ),

				                            "container_image": WHEEL_CONTAINER_IMAGES[

				                                arch_version

				                            ].split(":")[0],

				                            "container_image_tag_prefix": WHEEL_CONTAINER_IMAGES[

				                                arch_version

				                            ].split(":")[1],

				                            "package_type": package_type,

				                            "pytorch_extra_install_requirements": "",

				                            "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-full".replace(  # noqa: B950

				                                ".", "_"

				                            ),

				                        }

				                    )

				            else:

				                ret.append(

				                    {

				@ -409,6 +385,6 @@ def generate_wheels_matrix(

				    return ret

				validate_nccl_dep_consistency("12.9")

				validate_nccl_dep_consistency("13.0")

				validate_nccl_dep_consistency("12.8")

				validate_nccl_dep_consistency("12.6")

									
										2

.github/scripts/generate_ci_workflows.py
									
										vendored
									
												View File
												
				@ -135,7 +135,7 @@ ROCM_SMOKE_WORKFLOWS = [

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            arches=["6.4"],

				            python_versions=["3.9"],

				            python_versions=["3.10"],

				        ),

				        ciflow_config=CIFlowConfig(

				            labels={

									
										182

.github/scripts/test_trymerge.py
									
										vendored
									
												View File
												
				@ -27,6 +27,7 @@ from trymerge import (

				    get_drci_classifications,

				    gh_get_team_members,

				    GitHubPR,

				    iter_issue_timeline_until_comment,

				    JobCheckState,

				    main as trymerge_main,

				    MandatoryChecksMissingError,

				@ -34,6 +35,8 @@ from trymerge import (

				    RE_GHSTACK_DESC,

				    read_merge_rules,

				    remove_job_name_suffix,

				    sha_from_committed_event,

				    sha_from_force_push_after,

				    validate_revert,

				)

				@ -124,7 +127,7 @@ def mock_parse_args(revert: bool = False, force: bool = False) -> Any:

				            self.force = force

				            self.pr_num = 76123

				            self.dry_run = True

				            self.comment_id = 0

				            self.comment_id = 12345  # Set to non-zero value

				            self.reason = "this is for testing"

				            self.ignore_current = False

				            self.check_mergeability = False

				@ -152,9 +155,9 @@ def mock_revert(

				def mock_merge(

				    pr: GitHubPR,

				    repo: GitRepo,

				    comment_id: int,

				    dry_run: bool = False,

				    skip_mandatory_checks: bool = False,

				    comment_id: Optional[int] = None,

				    timeout_minutes: int = 400,

				    stale_pr_days: int = 3,

				    ignore_current: bool = False,

				@ -470,9 +473,9 @@ class TestTryMerge(TestCase):

				        mock_merge.assert_called_once_with(

				            mock.ANY,

				            mock.ANY,

				            comment_id=mock.ANY,

				            dry_run=mock.ANY,

				            skip_mandatory_checks=True,

				            comment_id=mock.ANY,

				            ignore_current=False,

				        )

				@ -485,9 +488,9 @@ class TestTryMerge(TestCase):

				        mock_merge.assert_called_once_with(

				            mock.ANY,

				            mock.ANY,

				            comment_id=mock.ANY,

				            dry_run=mock.ANY,

				            skip_mandatory_checks=False,

				            comment_id=mock.ANY,

				            ignore_current=False,

				        )

				@ -1138,5 +1141,176 @@ Pull Request resolved: https://github.com/pytorch/pytorch/pull/154394"""

				        )

				@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)

				@mock.patch("trymerge.gh_fetch_merge_base", return_value="")

				@mock.patch(

				    "trymerge.get_drci_classifications", side_effect=mocked_drci_classifications

				)

				class TestTimelineFunctions(TestCase):

				    """Tests for the new timeline-related functions"""

				    def test_sha_from_committed_event(self, *args: Any) -> None:

				        """Test extracting SHA from committed event"""

				        # Based on actual GitHub API format - committed events have "sha" at top level

				        event = {

				            "event": "committed",

				            "sha": "fb21ce932ded6670c918804a0d9151b773770a7c",

				        }

				        self.assertEqual(

				            sha_from_committed_event(event), "fb21ce932ded6670c918804a0d9151b773770a7c"

				        )

				        # Test with missing SHA

				        event_no_sha = {"event": "committed"}

				        self.assertIsNone(sha_from_committed_event(event_no_sha))

				    def test_sha_from_force_push_after(self, *args: Any) -> None:

				        """Test extracting SHA from force push event"""

				        # NOTE: The current function doesn't handle the actual GitHub API format

				        # Real force push events have "commit_id" at top level, but this function

				        # looks for "after", "after_commit", "after_sha", or "head_sha" fields

				        # Test with the legacy format the current function handles

				        event_legacy = {

				            "event": "head_ref_force_pushed",

				            "after": {"sha": "ef22bcbc54bb0f787e1e4ffd3d83df18fc407f5e"},

				        }

				        self.assertEqual(

				            sha_from_force_push_after(event_legacy),

				            "ef22bcbc54bb0f787e1e4ffd3d83df18fc407f5e",

				        )

				        # Test with current GitHub API format (should return None with current implementation)

				        event_real_api = {

				            "event": "head_ref_force_pushed",

				            "commit_id": "ef22bcbc54bb0f787e1e4ffd3d83df18fc407f5e",

				        }

				        self.assertEqual(

				            sha_from_force_push_after(event_real_api),

				            "ef22bcbc54bb0f787e1e4ffd3d83df18fc407f5e",

				        )  # Current function doesn't handle commit_id

				        # Test with missing SHA

				        event_no_sha = {"event": "head_ref_force_pushed"}

				        self.assertIsNone(sha_from_force_push_after(event_no_sha))

				    @mock.patch("trymerge.gh_fetch_json_list")

				    def test_iter_issue_timeline_until_comment(

				        self, mock_gh_fetch_json_list: Any, *args: Any

				    ) -> None:

				        """Test timeline iteration until target comment"""

				        # Mock timeline data based on actual GitHub API format

				        timeline_data = [

				            {"event": "commented", "id": 100, "body": "first comment"},

				            {"event": "committed", "sha": "fb21ce932ded6670c918804a0d9151b773770a7c"},

				            {"event": "commented", "id": 200, "body": "target comment"},

				            {"event": "commented", "id": 300, "body": "after target"},

				        ]

				        mock_gh_fetch_json_list.return_value = timeline_data

				        # Test iteration stops at target comment

				        events = list(iter_issue_timeline_until_comment("pytorch", "pytorch", 123, 200))

				        self.assertEqual(len(events), 3)  # Should stop at target comment

				        self.assertEqual(events[0]["event"], "commented")

				        self.assertEqual(events[0]["id"], 100)

				        self.assertEqual(events[1]["event"], "committed")

				        self.assertEqual(events[1]["sha"], "fb21ce932ded6670c918804a0d9151b773770a7c")

				        self.assertEqual(events[2]["event"], "commented")

				        self.assertEqual(events[2]["id"], 200)

				    @mock.patch("trymerge.gh_fetch_json_list")

				    def test_iter_issue_timeline_until_comment_not_found(

				        self, mock_gh_fetch_json_list: Any, *args: Any

				    ) -> None:

				        """Test timeline iteration when target comment is not found"""

				        # Mock empty timeline

				        mock_gh_fetch_json_list.return_value = []

				        events = list(iter_issue_timeline_until_comment("pytorch", "pytorch", 123, 999))

				        self.assertEqual(len(events), 0)

				    @mock.patch("trymerge.iter_issue_timeline_until_comment")

				    def test_get_commit_sha_at_comment_commit_after_comment(

				        self, mock_iter_timeline: Any, *args: Any

				    ) -> None:

				        """Test get_commit_sha_at_comment returns correct SHA after comment"""

				        mock_iter_timeline.return_value = [

				            {"event": "committed", "sha": "commit1"},

				            {"event": "committed", "sha": "commit2"},

				            {"event": "commented", "id": 100},

				            {"event": "head_ref_force_pushed", "after": {"sha": "commit3"}},

				        ]

				        pr = GitHubPR("pytorch", "pytorch", 77700)

				        sha = pr.get_commit_sha_at_comment(100)

				        self.assertEqual(sha, "commit2")

				    @mock.patch("trymerge.iter_issue_timeline_until_comment")

				    def test_get_commit_sha_at_comment_force_push_before_comment(

				        self, mock_iter_timeline: Any, *args: Any

				    ) -> None:

				        mock_iter_timeline.return_value = [

				            {"event": "committed", "sha": "commit1"},

				            {"event": "committed", "sha": "commit2"},

				            {"event": "head_ref_force_pushed", "commit_id": "commit3"},

				            {"event": "commented", "id": 100},

				        ]

				        pr = GitHubPR("pytorch", "pytorch", 77700)

				        sha = pr.get_commit_sha_at_comment(100)

				        self.assertEqual(sha, "commit3")

				    @mock.patch("trymerge.iter_issue_timeline_until_comment")

				    def test_get_commit_sha_at_comment_force_push_before_comment_legacy_mode(

				        self, mock_iter_timeline: Any, *args: Any

				    ) -> None:

				        mock_iter_timeline.return_value = [

				            {"event": "committed", "sha": "commit1"},

				            {"event": "committed", "sha": "commit2"},

				            {"event": "head_ref_force_pushed", "after": {"sha": "commit3"}},

				            {"event": "commented", "id": 100},

				        ]

				        pr = GitHubPR("pytorch", "pytorch", 77700)

				        sha = pr.get_commit_sha_at_comment(100)

				        self.assertEqual(sha, "commit3")

				    @mock.patch("trymerge.iter_issue_timeline_until_comment")

				    def test_get_commit_sha_at_comment_multiple_comments(

				        self, mock_iter_timeline: Any, *args: Any

				    ) -> None:

				        mock_iter_timeline.return_value = [

				            {"event": "committed", "sha": "commit1"},

				            {"event": "commented", "id": 100},

				            {"event": "committed", "sha": "commit2"},

				            {"event": "commented", "id": 200},

				            {"event": "head_ref_force_pushed", "after": {"sha": "commit3"}},

				            {"event": "commented", "id": 300},

				        ]

				        pr = GitHubPR("pytorch", "pytorch", 77700)

				        sha = pr.get_commit_sha_at_comment(200)

				        self.assertEqual(sha, "commit2")

				        sha = pr.get_commit_sha_at_comment(300)

				        self.assertEqual(sha, "commit3")

				    @mock.patch("trymerge.iter_issue_timeline_until_comment")

				    def test_get_commit_sha_at_comment_no_events(

				        self, mock_iter_timeline: Any, *args: Any

				    ) -> None:

				        mock_iter_timeline.return_value = [

				            {"event": "commented", "id": 100},

				            {"event": "labeled", "label": {"name": "test"}},

				        ]

				        pr = GitHubPR("pytorch", "pytorch", 77700)

				        sha = pr.get_commit_sha_at_comment(100)

				        self.assertIsNone(sha)

				    @mock.patch("trymerge.iter_issue_timeline_until_comment")

				    def test_get_commit_sha_at_comment_exception(

				        self, mock_iter_timeline: Any, *args: Any

				    ) -> None:

				        mock_iter_timeline.side_effect = Exception("API error")

				        pr = GitHubPR("pytorch", "pytorch", 77700)

				        sha = pr.get_commit_sha_at_comment(100)

				        self.assertIsNone(sha)

				if __name__ == "__main__":

				    main()

									
										197

.github/scripts/trymerge.py
									
										vendored
									
												View File
												
				@ -450,6 +450,63 @@ HAS_NO_CONNECTED_DIFF_TITLE = (

				IGNORABLE_FAILED_CHECKS_THESHOLD = 10

				def iter_issue_timeline_until_comment(

				    org: str, repo: str, issue_number: int, target_comment_id: int, max_pages: int = 200

				) -> Any:

				    """

				    Yield timeline entries in order until (and including) the entry whose id == target_comment_id

				    for a 'commented' event. Stops once the target comment is encountered.

				    """

				    page = 1

				    while page <= max_pages:

				        url = (

				            f"https://api.github.com/repos/{org}/{repo}/issues/{issue_number}/timeline"

				        )

				        params = {"per_page": 100, "page": page}

				        batch = gh_fetch_json_list(url, params)

				        if not batch:

				            return

				        for ev in batch:

				            # The target is the issue comment row with event == "commented" and id == issue_comment_id

				            if ev.get("event") == "commented" and ev.get("id") == target_comment_id:

				                yield ev  # nothing in the timeline after this matters, so stop early

				                return

				            yield ev

				        if len(batch) < 100:

				            return

				        page += 1

				    # If we got here without finding the comment, then we either hit a bug or some github PR

				    # has a _really_ long timeline.

				    # The max # of pages found on any pytorch/pytorch PR at the time of this change was 41

				    raise RuntimeError(

				        f"Could not find a merge commit in the first {max_pages} pages of the timeline at url {url}."

				        f"This is most likely a bug, please report it to the @pytorch/pytorch-dev-infra team."

				    )

				def sha_from_committed_event(ev: dict[str, Any]) -> Optional[str]:

				    """Extract SHA from committed event in timeline"""

				    return ev.get("sha")

				def sha_from_force_push_after(ev: dict[str, Any]) -> Optional[str]:

				    """Extract SHA from force push event in timeline"""

				    # The current GitHub API format

				    commit_id = ev.get("commit_id")

				    if commit_id:

				        return str(commit_id)

				    # Legacy format

				    after = ev.get("after") or ev.get("after_commit") or {}

				    if isinstance(after, dict):

				        return after.get("sha") or after.get("oid")

				    return ev.get("after_sha") or ev.get("head_sha")

				def gh_get_pr_info(org: str, proj: str, pr_no: int) -> Any:

				    rc = gh_graphql(GH_GET_PR_INFO_QUERY, name=proj, owner=org, number=pr_no)

				    return rc["data"]["repository"]["pullRequest"]

				@ -737,16 +794,24 @@ class GitHubPR:

				    def last_commit(self) -> Any:

				        return self.info["commits"]["nodes"][-1]["commit"]

				    def last_commit_sha(self, default: Optional[str] = None) -> str:

				        # for commits, the oid is the sha

				        if default is None:

				            return str(self.last_commit()["oid"])

				        return str(self.last_commit().get("oid", default))

				    def get_merge_base(self) -> str:

				        if self.merge_base:

				            return self.merge_base

				        last_commit_oid = self.last_commit()["oid"]

				        last_commit_sha = self.last_commit_sha()

				        # NB: We could use self.base_ref() here for regular PR, however, that doesn't

				        # work for ghstack where the base is the custom branch, i.e. gh/USER/ID/base,

				        # so let's just use main instead

				        self.merge_base = gh_fetch_merge_base(

				            self.org, self.project, last_commit_oid, self.default_branch()

				            self.org, self.project, last_commit_sha, self.default_branch()

				        )

				        # Fallback to baseRefOid if the API call fails, i.e. rate limit. Note that baseRefOid

				@ -835,6 +900,44 @@ class GitHubPR:

				    def get_commit_count(self) -> int:

				        return int(self.info["commits_with_authors"]["totalCount"])

				    def get_commit_sha_at_comment(self, comment_id: int) -> Optional[str]:

				        """

				        Get the PR head commit SHA that was present when a specific comment was posted.

				        This ensures we only merge the state of the PR at the time the merge command was issued,

				        not any subsequent commits that may have been pushed after.

				        Returns None if no head-changing events found before the comment or if the comment was not found.

				        """

				        head = None

				        try:

				            for event in iter_issue_timeline_until_comment(

				                self.org, self.project, self.pr_num, comment_id

				            ):

				                etype = event.get("event")

				                if etype == "committed":

				                    sha = sha_from_committed_event(event)

				                    if sha:

				                        head = sha

				                        print(f"Timeline: Found commit event for SHA {sha}")

				                elif etype == "head_ref_force_pushed":

				                    sha = sha_from_force_push_after(event)

				                    if sha:

				                        head = sha

				                        print(f"Timeline: Found force push event for SHA {sha}")

				                elif etype == "commented":

				                    if event.get("id") == comment_id:

				                        print(f"Timeline: Found final comment with sha {sha}")

				                        return head

				        except Exception as e:

				            print(

				                f"Warning: Failed to reconstruct timeline for comment {comment_id}: {e}"

				            )

				            return None

				        print(f"Did not find comment with id {comment_id} in the PR timeline")

				        return None

				    def get_pr_creator_login(self) -> str:

				        return cast(str, self.info["author"]["login"])

				@ -1151,7 +1254,7 @@ class GitHubPR:

				        *,

				        skip_mandatory_checks: bool = False,

				        dry_run: bool = False,

				        comment_id: Optional[int] = None,

				        comment_id: int,

				        ignore_current_checks: Optional[list[str]] = None,

				    ) -> None:

				        # Raises exception if matching rule is not found

				@ -1167,7 +1270,7 @@ class GitHubPR:

				            skip_internal_checks=can_skip_internal_checks(self, comment_id),

				            ignore_current_checks=ignore_current_checks,

				        )

				        additional_merged_prs = self.merge_changes(

				        additional_merged_prs = self.merge_changes_locally(

				            repo, skip_mandatory_checks, comment_id

				        )

				@ -1196,7 +1299,7 @@ class GitHubPR:

				                broken_trunk_checks=ignorable_checks.get("BROKEN_TRUNK", []),

				                flaky_checks=ignorable_checks.get("FLAKY", []),

				                unstable_checks=ignorable_checks.get("UNSTABLE", []),

				                last_commit_sha=self.last_commit().get("oid", ""),

				                last_commit_sha=self.last_commit_sha(default=""),

				                merge_base_sha=self.get_merge_base(),

				                merge_commit_sha=merge_commit_sha,

				                is_failed=False,

				@ -1217,7 +1320,7 @@ class GitHubPR:

				            dry_run=dry_run,

				        )

				    def merge_changes(

				    def merge_changes_locally(

				        self,

				        repo: GitRepo,

				        skip_mandatory_checks: bool = False,

				@ -1226,27 +1329,15 @@ class GitHubPR:

				        skip_all_rule_checks: bool = False,

				    ) -> list["GitHubPR"]:

				        """

				        :param skip_all_rule_checks: If true, skips all rule checks, useful for dry-running merge locally

				        :param skip_all_rule_checks: If true, skips all rule checks on ghstack PRs, useful for dry-running merge locally

				        """

				        branch_to_merge_into = self.default_branch() if branch is None else branch

				        if repo.current_branch() != branch_to_merge_into:

				            repo.checkout(branch_to_merge_into)

				        if not self.is_ghstack_pr():

				            msg = self.gen_commit_message()

				            pr_branch_name = f"__pull-request-{self.pr_num}__init__"

				            repo.fetch(self.last_commit()["oid"], pr_branch_name)

				            repo._run_git("merge", "--squash", pr_branch_name)

				            repo._run_git("commit", f'--author="{self.get_author()}"', "-m", msg)

				            # Did the PR change since we started the merge?

				            pulled_sha = repo.show_ref(pr_branch_name)

				            latest_pr_status = GitHubPR(self.org, self.project, self.pr_num)

				            if pulled_sha != latest_pr_status.last_commit()["oid"]:

				                raise RuntimeError(

				                    "PR has been updated since CI checks last passed. Please rerun the merge command."

				                )

				            return []

				        else:

				        # It's okay to skip the commit SHA check for ghstack PRs since

				        # authoring requires write access to the repo.

				        if self.is_ghstack_pr():

				            return self.merge_ghstack_into(

				                repo,

				                skip_mandatory_checks,

				@ -1254,6 +1345,48 @@ class GitHubPR:

				                skip_all_rule_checks=skip_all_rule_checks,

				            )

				        msg = self.gen_commit_message()

				        pr_branch_name = f"__pull-request-{self.pr_num}__init__"

				        # Determine which commit SHA to merge

				        commit_to_merge = None

				        if not comment_id:

				            raise ValueError("Must provide --comment-id when merging regular PRs")

				        # Get the commit SHA that was present when the comment was made

				        commit_to_merge = self.get_commit_sha_at_comment(comment_id)

				        if not commit_to_merge:

				            raise RuntimeError(

				                f"Could not find commit that was pushed before comment {comment_id}"

				            )

				        # Validate that this commit is the latest commit on the PR

				        latest_commit = self.last_commit_sha()

				        if commit_to_merge != latest_commit:

				            raise RuntimeError(

				                f"Commit {commit_to_merge} was HEAD when comment {comment_id} was posted "

				                f"but now the latest commit on the PR is {latest_commit}. "

				                f"Please re-issue the merge command to merge the latest commit."

				            )

				        print(f"Merging commit {commit_to_merge} locally")

				        repo.fetch(commit_to_merge, pr_branch_name)

				        repo._run_git("merge", "--squash", pr_branch_name)

				        repo._run_git("commit", f'--author="{self.get_author()}"', "-m", msg)

				        # Did the PR change since we started the merge?

				        pulled_sha = repo.show_ref(pr_branch_name)

				        latest_pr_status = GitHubPR(self.org, self.project, self.pr_num)

				        if (

				            pulled_sha != latest_pr_status.last_commit_sha()

				            or pulled_sha != commit_to_merge

				        ):

				            raise RuntimeError(

				                "PR has been updated since CI checks last passed. Please rerun the merge command."

				            )

				        return []

				class MergeRuleFailedError(RuntimeError):

				    def __init__(self, message: str, rule: Optional["MergeRule"] = None) -> None:

				@ -1458,7 +1591,7 @@ def find_matching_merge_rule(

				            pending_checks = []

				            failed_checks = []

				        hud_link = f"https://hud.pytorch.org/{pr.org}/{pr.project}/commit/{pr.last_commit()['oid']}"

				        hud_link = f"https://hud.pytorch.org/{pr.org}/{pr.project}/commit/{pr.last_commit_sha()}"

				        if len(failed_checks) > 0:

				            if reject_reason_score < 30000:

				                reject_reason_score = 30000

				@ -2156,14 +2289,14 @@ def categorize_checks(

				def merge(

				    pr: GitHubPR,

				    repo: GitRepo,

				    comment_id: int,

				    dry_run: bool = False,

				    skip_mandatory_checks: bool = False,

				    comment_id: Optional[int] = None,

				    timeout_minutes: int = 400,

				    stale_pr_days: int = 3,

				    ignore_current: bool = False,

				) -> None:

				    initial_commit_sha = pr.last_commit()["oid"]

				    initial_commit_sha = pr.last_commit_sha()

				    pr_link = f"https://github.com/{pr.org}/{pr.project}/pull/{pr.pr_num}"

				    print(f"Attempting merge of {initial_commit_sha} ({pr_link})")

				@ -2234,7 +2367,7 @@ def merge(

				            f"Attempting merge of https://github.com/{pr.org}/{pr.project}/pull/{pr.pr_num} ({elapsed_time / 60} minutes elapsed)"

				        )

				        pr = GitHubPR(pr.org, pr.project, pr.pr_num)

				        if initial_commit_sha != pr.last_commit()["oid"]:

				        if initial_commit_sha != pr.last_commit_sha():

				            raise RuntimeError(

				                "New commits were pushed while merging. Please rerun the merge command."

				            )

				@ -2401,7 +2534,7 @@ def main() -> None:

				    if args.check_mergeability:

				        if pr.is_ghstack_pr():

				            get_ghstack_prs(repo, pr)  # raises error if out of sync

				        pr.merge_changes(

				        pr.merge_changes_locally(

				            repo,

				            skip_mandatory_checks=True,

				            skip_all_rule_checks=True,

				@ -2416,12 +2549,18 @@ def main() -> None:

				        gh_post_pr_comment(org, project, args.pr_num, message, dry_run=args.dry_run)

				        return

				    try:

				        # Ensure comment id is set, else fail

				        if not args.comment_id:

				            raise ValueError(

				                "Comment ID is required for merging PRs, please provide it using --comment-id"

				            )

				        merge(

				            pr,

				            repo,

				            comment_id=args.comment_id,

				            dry_run=args.dry_run,

				            skip_mandatory_checks=args.force,

				            comment_id=args.comment_id,

				            ignore_current=args.ignore_current,

				        )

				    except Exception as e:

				@ -2443,7 +2582,7 @@ def main() -> None:

				                broken_trunk_checks=[],

				                flaky_checks=[],

				                unstable_checks=[],

				                last_commit_sha=pr.last_commit().get("oid", ""),

				                last_commit_sha=pr.last_commit_sha(default=""),

				                merge_base_sha=pr.get_merge_base(),

				                is_failed=True,

				                skip_mandatory_checks=args.force,

									
										3

.github/scripts/windows/build_magma.bat
									
										vendored
									
												View File
												
				@ -35,6 +35,9 @@ cd magma

				mkdir build && cd build

				set GPU_TARGET=All

				if "%CUVER_NODOT%" == "130" (

				  set CUDA_ARCH_LIST=-gencode=arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120

				)

				if "%CUVER_NODOT%" == "129" (

				  set CUDA_ARCH_LIST=-gencode=arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120

				)

									
										16

.github/scripts/windows/build_triton.bat
									
										vendored
									
												View File
												
				@ -1,18 +1,12 @@

				@echo on

				set PYTHON_PREFIX=%PY_VERS:.=%

				set PYTHON_PREFIX=py%PYTHON_PREFIX:;=;py%

				call .ci/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat

				:: Create a new conda environment

				if "%PY_VERS%" == "3.13t" (

				    call conda create -n %PYTHON_PREFIX% -y -c=conda-forge python-freethreading python=3.13

				) else (

				    call conda create -n %PYTHON_PREFIX% -y -c=conda-forge python=%PY_VERS%

				)

				set DESIRED_PYTHON=%PY_VERS%

				call .ci/pytorch/windows/internal/install_python.bat

				:: Fix cmake version for issue https://github.com/pytorch/pytorch/issues/150480

				call conda run -n %PYTHON_PREFIX% pip install wheel pybind11 certifi cython cmake==3.31.6 setuptools==72.1.0 ninja==1.11.1.4

				%PYTHON_EXEC% -m pip install wheel pybind11 certifi cython cmake==3.31.6 setuptools==72.1.0 ninja==1.11.1.4

				dir "%VC_INSTALL_PATH%"

				call "%VC_INSTALL_PATH%\VC\Auxiliary\Build\vcvarsall.bat" x64

				call conda run -n %PYTHON_PREFIX% python .github/scripts/build_triton_wheel.py --device=%BUILD_DEVICE% %RELEASE%

				%PYTHON_EXEC% .github/scripts/build_triton_wheel.py --device=%BUILD_DEVICE% %RELEASE%

4

.github/templates/common.yml.j2 vendored

View File

 @ -4,7 +4,7 @@
 {%- set download_artifact_action = "actions/download-artifact@v4.1.7" -%}
 {%- set timeout_minutes = 240 -%}
 {%- set timeout_minutes_windows_binary = 300 -%}
 {%- set timeout_minutes_windows_binary = 360 -%}
 {%- macro concurrency(build_environment) -%}
 concurrency:
 @ -32,7 +32,7 @@ concurrency:
 {%- macro setup_ec2_windows() -%}
       !{{ display_ec2_information() }}
       - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
         uses: pytorch/test-infra/.github/actions/setup-ssh@main
         uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.9
         continue-on-error: true
         with:
           github-secret: ${{ secrets.GITHUB_TOKEN }}

25

.github/templates/linux_binary_build_workflow.yml.j2 vendored

View File

 @ -56,7 +56,7 @@ jobs:
   get-label-type:
     if: github.repository_owner == 'pytorch'
     name: get-label-type
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.9
     with:
       triggering_actor: ${{ github.triggering_actor }}
       issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
 @ -77,6 +77,9 @@ jobs:
       runs_on: linux.s390x
       ALPINE_IMAGE: "docker.io/s390x/alpine"
       timeout-minutes: 420
       {%- elif config["gpu_arch_type"] == "rocm" %}
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       timeout-minutes: 300
       {%- elif "conda" in build_environment and config["gpu_arch_type"] == "cuda" %}
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.24xlarge.ephemeral
 @ -114,12 +117,12 @@ jobs:
       ALPINE_IMAGE: "docker.io/s390x/alpine"
       {%- elif config["gpu_arch_type"] == "rocm" %}
       runs_on: linux.rocm.gpu
       {%- elif config["gpu_arch_type"] == "cuda" and config["gpu_arch_version"] in ["12.8", "12.9"] %}
       {%- elif config["gpu_arch_type"] == "cuda" and config["gpu_arch_version"] in ["12.6"] %}
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.g4dn.4xlarge.nvidia.gpu  # 12.8 and 12.9 build need sm_70+ runner
       runs_on: linux.4xlarge.nvidia.gpu  # 12.6 build can use maxwell (sm_50) runner
       {%- elif config["gpu_arch_type"] == "cuda" %}
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.4xlarge.nvidia.gpu # for other cuda versions, we use 4xlarge runner
       runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8+ builds need sm_70+ runner
       {%- else %}
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.4xlarge
 @ -135,7 +138,7 @@ jobs:
       contents: read
     steps:
       - name: Setup XPU
         uses: ./.github/actions/setup-xpu
         uses: pytorch/pytorch/.github/actions/setup-xpu@release/2.9
       - name: configure aws credentials
         id: aws_creds
         uses: aws-actions/configure-aws-credentials@v4
 @ -150,10 +153,10 @@ jobs:
         with:
           name: !{{ config["build_name"] }}
           path: "${{ runner.temp }}/artifacts/"
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       - name: Calculate docker image
         id: calculate-docker-image
         uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
         uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.9
         with:
           docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
           docker-image-name: !{{ config["container_image"] }}
 @ -161,7 +164,7 @@ jobs:
           docker-build-dir: .ci/docker
           working-directory: pytorch
       - name: Pull Docker image
         uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.9
         with:
           docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
       - name: Test Pytorch binary
 @ -182,7 +185,7 @@ jobs:
         with:
           name: !{{ config["build_name"] }}
           path: "${{ runner.temp }}/artifacts/"
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       - name: ROCm set GPU_FLAG
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
 @ -196,7 +199,7 @@ jobs:
           role-duration-seconds: 18000
       - name: Calculate docker image
         id: calculate-docker-image
         uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
         uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.9
         with:
           docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
           docker-image-name: !{{ config["container_image"] }}
 @ -204,7 +207,7 @@ jobs:
           docker-build-dir: .ci/docker
           working-directory: pytorch
       - name: Pull Docker image
         uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.9
         with:
           docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
       - name: Test Pytorch binary

7

.github/templates/macos_binary_build_workflow.yml.j2 vendored

View File

 @ -68,12 +68,7 @@ jobs:
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
           if [ -d "/Applications/Xcode_14.3.1.app" ]; then
             echo "DEVELOPER_DIR=/Applications/Xcode_14.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
           elif [ -d "/Applications/Xcode_13.3.1.app" ]; then
             echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
           fi
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091

4

.github/templates/upload.yml.j2 vendored

View File

 @ -15,7 +15,7 @@
       #       favor of GPU_ARCH_VERSION
       DESIRED_CUDA: !{{ config["desired_cuda"] }}
 {%- if config["gpu_arch_version"] %}
       GPU_ARCH_VERSION: !{{ config["gpu_arch_version"] }}
       GPU_ARCH_VERSION: "!{{ config["gpu_arch_version"] }}"
 {%- endif %}
       GPU_ARCH_TYPE: !{{ config["gpu_arch_type"] }}
 {%- if include_skip_tests %}
 @ -33,7 +33,7 @@
   {%- if is_windows %}
       # This is a dummy value for libtorch to work correctly with our batch scripts
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.9"
       DESIRED_PYTHON: "3.10"
   {%- endif %}
 {%- else %}

6

.github/templates/windows_binary_build_workflow.yml.j2 vendored

View File

 @ -64,7 +64,7 @@ jobs:
   get-label-type:
     if: github.repository_owner == 'pytorch'
     name: get-label-type
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
     uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.9
     with:
       triggering_actor: ${{ github.triggering_actor }}
       issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
 @ -135,7 +135,7 @@ jobs:
 {%- else %}
       !{{ set_runner_specific_vars() }}
       !{{ common.setup_ec2_windows() }}
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
 {%- endif %}
       - name: Populate binary env
         shell: bash
 @ -211,7 +211,7 @@ jobs:
           "pytorch/.ci/pytorch/windows/arm64/bootstrap_rust.bat"
 {%- else %}
       !{{ common.setup_ec2_windows() }}
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       !{{ set_runner_specific_vars() }}
 {%- endif %}
       - uses: !{{ common.download_artifact_action }}

									
										14

.github/workflows/_bazel-build-test.yml
									
										vendored
									
												View File
												
				@ -47,7 +47,7 @@ jobs:

				      reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.9

				        with:

				          fetch-depth: 1

				          submodules: false

				@ -69,25 +69,25 @@ jobs:

				    runs-on: ${{ matrix.runner }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.9

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.9

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.9

				        with:

				          docker-image-name: ${{ inputs.docker-image-name }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.9

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -97,7 +97,7 @@ jobs:

				        run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.9

				        if: ${{ inputs.cuda-version != 'cpu' && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}

				      - name: Output disk space left

				@ -209,5 +209,5 @@ jobs:

				          file-suffix: bazel-${{ github.job }}_${{ steps.get-job-id.outputs.job-id }}

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.9

				        if: always()

									
										13

.github/workflows/_binary-build-linux.yml
									
										vendored
									
												View File
												
				@ -142,13 +142,13 @@ jobs:

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

				        if: inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.9

				        continue-on-error: true

				        with:

				          github-secret: ${{ secrets.github-token }}

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.9

				        with:

				          no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}

				@ -178,7 +178,6 @@ jobs:

				      - name: Checkout PyTorch to pytorch dir

				        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

				        with:

				          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				          submodules: recursive

				          path: pytorch

				          show-progress: false

				@ -213,9 +212,9 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.9

				        with:

				          # If doing this in main or release branch, use docker.io. Otherwise

				          # If doing this in release/2.9 or release branch, use docker.io. Otherwise

				          # use ECR

				          docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}

				          docker-image-name: ${{ inputs.DOCKER_IMAGE }}

				@ -227,7 +226,7 @@ jobs:

				      - name: Pull Docker image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.9

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -283,7 +282,7 @@ jobs:

				      - name: Teardown Linux

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.9

				      - name: Chown workspace

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

									
										13

.github/workflows/_binary-test-linux.yml
									
										vendored
									
												View File
												
				@ -125,14 +125,14 @@ jobs:

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

				        if: inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.9

				        continue-on-error: true

				        with:

				          github-secret: ${{ secrets.github-token }}

				        # Setup the environment

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.9

				        with:

				          no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}

				@ -155,7 +155,6 @@ jobs:

				      - name: Checkout PyTorch to pytorch dir

				        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

				        with:

				          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				          submodules: recursive

				          show-progress: false

				          path: pytorch

				@ -186,7 +185,7 @@ jobs:

				          path: "${{ runner.temp }}/artifacts/"

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.9

				        if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' && steps.filter.outputs.is-test-matrix-empty == 'False' }}

				      - name: configure aws credentials

				@ -201,7 +200,7 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.9

				        with:

				          docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}

				          docker-image-name: ${{ inputs.DOCKER_IMAGE }}

				@ -211,7 +210,7 @@ jobs:

				      - name: Pull Docker image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.9

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -223,7 +222,7 @@ jobs:

				      - name: Teardown Linux

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.9

				      - name: Chown workspace

				        if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

									
										2

.github/workflows/_binary-upload.yml
									
										vendored
									
												View File
												
				@ -81,7 +81,7 @@ jobs:

				      SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.9

				        with:

				          no-sudo: true

									
										12

.github/workflows/_docs.yml
									
										vendored
									
												View File
												
				@ -67,7 +67,7 @@ jobs:

				            # an OOM issue when running the job, so this upgrades the runner from 4xlarge

				            # to the next available tier of 12xlarge. So much memory just to generate cpp

				            # doc

				            runner: ${{ inputs.runner_prefix }}linux.12xlarge

				            runner: ${{ inputs.runner_prefix }}linux.12xlarge.memory

				            # TODO: Nightly cpp docs take longer and longer to finish (more than 3h now)

				            # Let's try to figure out how this can be improved

				            timeout-minutes: 360

				@ -84,7 +84,7 @@ jobs:

				    name: build-docs-${{ matrix.docs_type }}-${{ inputs.push }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.9

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				          instructions: |

				@ -95,7 +95,7 @@ jobs:

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.9

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				@ -110,12 +110,12 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.9

				        with:

				          docker-image-name: ${{ inputs.docker-image }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.9

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -222,5 +222,5 @@ jobs:

				          s3-prefix: pytorch/pytorch/${{ github.event.pull_request.number }}/functorchdocs

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.9

				        if: always()

									
										6

.github/workflows/_link_check.yml
									
										vendored
									
												View File
												
				@ -11,8 +11,9 @@ on:

				jobs:

				  lint-urls:

				    if: ${{ github.event_name != 'pull_request' || !contains(github.event.pull_request.labels.*.name, 'skip-url-lint') }}

				    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main

				    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@release/2.9

				    with:

				      job-name: lint-urls

				      timeout: 120

				      runner: ${{ inputs.runner }}linux.2xlarge

				      docker-image: ci-image:pytorch-linux-jammy-linter

				@ -36,8 +37,9 @@ jobs:

				  lint-xrefs:

				    if: ${{ github.event_name != 'pull_request' || !contains(github.event.pull_request.labels.*.name, 'skip-xref-lint') }}

				    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main

				    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@release/2.9

				    with:

				      job-name: lint-xrefs

				      timeout: 60

				      runner: ${{ inputs.runner }}linux.2xlarge

				      docker-image: ci-image:pytorch-linux-jammy-linter

									
										14

.github/workflows/_linux-build.yml
									
										vendored
									
												View File
												
				@ -128,13 +128,13 @@ jobs:

				    # Don't run on forked repos

				    if: github.repository_owner == 'pytorch'

				    runs-on: ${{ inputs.runner_prefix}}${{ inputs.runner }}

				    timeout-minutes: 240

				    timeout-minutes: 480

				    outputs:

				      docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				      test-matrix: ${{ steps.filter.outputs.test-matrix }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.9

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				@ -147,7 +147,7 @@ jobs:

				      # checkout because when we run this action we don't *have* a local

				      # checkout. In other cases you should prefer a local checkout.

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.9

				        with:

				          no-sudo: true

				@ -183,7 +183,7 @@ jobs:

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.9

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel'

				        with:

				          docker-image-name: ${{ inputs.docker-image-name }}

				@ -199,7 +199,7 @@ jobs:

				          echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.9

				        if: inputs.build-environment != 'linux-s390x-binary-manywheel' && steps.use-old-whl.outputs.reuse != 'true'

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -269,6 +269,7 @@ jobs:

				          HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

				          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}

				          BUILD_ADDITIONAL_PACKAGES: ${{ inputs.build-additional-packages }}

				          RUNNER: ${{ inputs.runner }}

				        run: |

				          START_TIME=$(date +%s)

				          if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then

				@ -340,6 +341,7 @@ jobs:

				            -e HUGGING_FACE_HUB_TOKEN \

				            -e SCRIBE_GRAPHQL_ACCESS_TOKEN \

				            -e BUILD_ADDITIONAL_PACKAGES \

				            -e RUNNER \

				            --memory="${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}g" \

				            --memory-swap="${TOTAL_MEMORY_WITH_SWAP}g" \

				            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

				@ -455,7 +457,7 @@ jobs:

				          artifact_prefix: usage_log_build_${{ steps.get-job-id.outputs.job-id }}

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.9

				        if: always() && inputs.build-environment != 'linux-s390x-binary-manywheel'

				      - name: Cleanup docker

Compare commits

1007 Commits windows_li ... v2.9.0

15 .bc-linter.yml Normal file Unescape Escape View File

26 .ci/aarch64_linux/aarch64_ci_build.sh Unescape Escape View File

245 .ci/aarch64_linux/aarch64_wheel_ci_build.py Unescape Escape View File

4 .ci/docker/README.md Unescape Escape View File

81 .ci/docker/build.sh Unescape Escape View File

2 .ci/docker/ci_commit_pins/torchbench.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

4 .ci/docker/common/install_cpython.sh Unescape Escape View File

7 .ci/docker/common/install_cuda.sh Unescape Escape View File

2 .ci/docker/common/install_onnx.sh Unescape Escape View File

2 .ci/docker/common/install_triton.sh Unescape Escape View File

8 .ci/docker/common/install_ucc.sh Unescape Escape View File

20 .ci/docker/common/install_xpu.sh Unescape Escape View File

9 .ci/docker/common/patch_libstdc.sh Executable file Unescape Escape View File

13 .ci/docker/libtorch/Dockerfile Unescape Escape View File

5 .ci/docker/manywheel/Dockerfile_2_28 Unescape Escape View File

2 .ci/docker/manywheel/Dockerfile_2_28_aarch64 Unescape Escape View File

2 .ci/docker/manywheel/Dockerfile_cuda_aarch64 Unescape Escape View File

6 .ci/docker/manywheel/build.sh Unescape Escape View File

12 .ci/docker/requirements-ci.txt Unescape Escape View File

2 .ci/docker/requirements-docs.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

2 .ci/docker/triton_xpu_version.txt Unescape Escape View File

2 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

2 .ci/libtorch/build.sh Unescape Escape View File

2 .ci/lumen_cli/cli/build_cli/register_build.py Unescape Escape View File

143 .ci/lumen_cli/cli/lib/common/gh_summary.py Normal file Unescape Escape View File

4 .ci/lumen_cli/cli/lib/common/git_helper.py Unescape Escape View File

71 .ci/lumen_cli/cli/lib/common/pip_helper.py Normal file Unescape Escape View File

60 .ci/lumen_cli/cli/lib/common/utils.py Unescape Escape View File

292 .ci/lumen_cli/cli/lib/core/vllm/lib.py Normal file Unescape Escape View File

52 .ci/lumen_cli/cli/lib/core/vllm.py → .ci/lumen_cli/cli/lib/core/vllm/vllm_build.py Unescape Escape View File

269 .ci/lumen_cli/cli/lib/core/vllm/vllm_test.py Normal file Unescape Escape View File

2 .ci/lumen_cli/cli/run.py Unescape Escape View File

0 test/dynamo_expected_failures/CPython313-test_bool-BoolTest.test_complex → .ci/lumen_cli/cli/test_cli/__init__.py Unescape Escape View File

62 .ci/lumen_cli/cli/test_cli/register_test.py Normal file Unescape Escape View File

1 .ci/lumen_cli/pyproject.toml Unescape Escape View File

185 .ci/lumen_cli/tests/test_run_plan.py Normal file Unescape Escape View File

143 .ci/lumen_cli/tests/test_utils.py Normal file Unescape Escape View File

53 .ci/lumen_cli/tests/test_vllm.py Unescape Escape View File

102 .ci/manywheel/build_cuda.sh Unescape Escape View File

1 .ci/manywheel/build_xpu.sh Unescape Escape View File

11 .ci/pytorch/build.sh Unescape Escape View File

23 .ci/pytorch/check_binary.sh Unescape Escape View File

15 .ci/pytorch/common_utils.sh Unescape Escape View File

2 .ci/pytorch/cpp_doc_push_script.sh Unescape Escape View File

47 .ci/pytorch/macos-test.sh Unescape Escape View File

1 .ci/pytorch/multigpu-test.sh Unescape Escape View File

25 .ci/pytorch/numba-cuda-13.patch Normal file Unescape Escape View File

23 .ci/pytorch/smoke_test/check_binary_symbols.py Unescape Escape View File

49 .ci/pytorch/test.sh Unescape Escape View File

2 .ci/pytorch/win-test-helpers/build_pytorch.bat Unescape Escape View File

12 .ci/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat Unescape Escape View File

2 .ci/pytorch/win-test-helpers/setup_pytorch_env.bat Unescape Escape View File

14 .ci/pytorch/win-test.sh Unescape Escape View File

4 .ci/pytorch/windows/cuda128.bat Unescape Escape View File

59 .ci/pytorch/windows/cuda130.bat Normal file Unescape Escape View File

27 .ci/pytorch/windows/internal/copy.bat Unescape Escape View File

28 .ci/pytorch/windows/internal/cuda_install.bat Unescape Escape View File

10 .ci/pytorch/windows/internal/driver_update.bat Unescape Escape View File

12 .ci/pytorch/windows/internal/install_python.bat Unescape Escape View File

21 .ci/pytorch/windows/internal/xpu_install.bat Unescape Escape View File

2 .ci/pytorch/windows/setup_build.bat Unescape Escape View File

51 .ci/wheel/build_wheel.sh Unescape Escape View File

9 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

6 .circleci/scripts/binary_upload.sh Unescape Escape View File

3 .circleci/scripts/binary_windows_build.sh Unescape Escape View File

2 .circleci/scripts/binary_windows_test.sh Unescape Escape View File

1 .flake8 Unescape Escape View File

2 .github/actionlint.yaml vendored Unescape Escape View File

10 .github/actions/build-external-packages/action.yml vendored Unescape Escape View File

15 .github/actions/checkout-pytorch/action.yml vendored Unescape Escape View File

16 .github/actions/setup-win/action.yml vendored Unescape Escape View File

2 .github/ci_commit_pins/audio.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/vllm.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/xla.txt vendored Unescape Escape View File

219 .github/ci_configs/vllm/Dockerfile.tmp_vllm vendored Unescape Escape View File

1007 Commits

windows_li ... v2.9.0

15

.bc-linter.yml Normal file

View File

26

.ci/aarch64_linux/aarch64_ci_build.sh

View File

245

.ci/aarch64_linux/aarch64_wheel_ci_build.py

View File

4

.ci/docker/README.md

View File

81

.ci/docker/build.sh

View File

2

.ci/docker/ci_commit_pins/torchbench.txt

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

4

.ci/docker/common/install_cpython.sh

View File

7

.ci/docker/common/install_cuda.sh

View File

2

.ci/docker/common/install_onnx.sh

View File

2

.ci/docker/common/install_triton.sh

View File

8

.ci/docker/common/install_ucc.sh

View File

20

.ci/docker/common/install_xpu.sh

View File

9

.ci/docker/common/patch_libstdc.sh Executable file

View File

13

.ci/docker/libtorch/Dockerfile

View File

5

.ci/docker/manywheel/Dockerfile_2_28

View File

2

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

2

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

6

.ci/docker/manywheel/build.sh

View File

12

.ci/docker/requirements-ci.txt

View File

2

.ci/docker/requirements-docs.txt

View File

2

.ci/docker/triton_version.txt

View File

2

.ci/docker/triton_xpu_version.txt

View File

2

.ci/docker/ubuntu/Dockerfile

View File

2

.ci/libtorch/build.sh

View File

2

.ci/lumen_cli/cli/build_cli/register_build.py

View File

143

.ci/lumen_cli/cli/lib/common/gh_summary.py Normal file

View File

4

.ci/lumen_cli/cli/lib/common/git_helper.py

View File

71

.ci/lumen_cli/cli/lib/common/pip_helper.py Normal file

View File

60

.ci/lumen_cli/cli/lib/common/utils.py

View File

292

.ci/lumen_cli/cli/lib/core/vllm/lib.py Normal file

View File

52

.ci/lumen_cli/cli/lib/core/vllm.py → .ci/lumen_cli/cli/lib/core/vllm/vllm_build.py

View File

269

.ci/lumen_cli/cli/lib/core/vllm/vllm_test.py Normal file

View File

2

.ci/lumen_cli/cli/run.py

View File

0

test/dynamo_expected_failures/CPython313-test_bool-BoolTest.test_complex → .ci/lumen_cli/cli/test_cli/init.py

View File

62

.ci/lumen_cli/cli/test_cli/register_test.py Normal file

View File

1

.ci/lumen_cli/pyproject.toml

View File

185

.ci/lumen_cli/tests/test_run_plan.py Normal file

View File

143

.ci/lumen_cli/tests/test_utils.py Normal file

View File

53

.ci/lumen_cli/tests/test_vllm.py

View File

102

.ci/manywheel/build_cuda.sh

View File

1

.ci/manywheel/build_xpu.sh

View File

11

.ci/pytorch/build.sh

View File

23

.ci/pytorch/check_binary.sh

View File

15

.ci/pytorch/common_utils.sh

View File

2

.ci/pytorch/cpp_doc_push_script.sh

View File

47

.ci/pytorch/macos-test.sh

View File

1

.ci/pytorch/multigpu-test.sh

View File

25

.ci/pytorch/numba-cuda-13.patch Normal file

View File

23

.ci/pytorch/smoke_test/check_binary_symbols.py

View File

49

.ci/pytorch/test.sh

View File

2

.ci/pytorch/win-test-helpers/build_pytorch.bat

View File

12

.ci/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat

View File

2

.ci/pytorch/win-test-helpers/setup_pytorch_env.bat

View File

14

.ci/pytorch/win-test.sh

View File

4

.ci/pytorch/windows/cuda128.bat

View File

59

.ci/pytorch/windows/cuda130.bat Normal file

View File

27

.ci/pytorch/windows/internal/copy.bat

View File

28

.ci/pytorch/windows/internal/cuda_install.bat

View File

10

.ci/pytorch/windows/internal/driver_update.bat

View File

12

.ci/pytorch/windows/internal/install_python.bat

View File

21

.ci/pytorch/windows/internal/xpu_install.bat

View File

2

.ci/pytorch/windows/setup_build.bat

View File

51

.ci/wheel/build_wheel.sh

View File

9

.circleci/scripts/binary_populate_env.sh

View File

6

.circleci/scripts/binary_upload.sh

View File

3

.circleci/scripts/binary_windows_build.sh

View File

2

.circleci/scripts/binary_windows_test.sh

View File

1

.flake8

View File

2

.github/actionlint.yaml vendored

View File

10

.github/actions/build-external-packages/action.yml vendored

View File

15

.github/actions/checkout-pytorch/action.yml vendored

View File

16

.github/actions/setup-win/action.yml vendored

View File

2

.github/ci_commit_pins/audio.txt vendored

View File

2

.github/ci_commit_pins/vllm.txt vendored

View File

2

.github/ci_commit_pins/xla.txt vendored

View File

219

.github/ci_configs/vllm/Dockerfile.tmp_vllm vendored

View File

4

.github/dependabot.yml vendored

View File