pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-22 22:25:10 +08:00

Author	SHA1	Message	Date
pytorchbot	0fabc3ba44	CUDA aarch64 12.6 and 12.8 builds fix triton constraints (#165022 ) CUDA aarch64 12.6 and 12.8 builds fix triton constraints (#165013) Since we have introduced CUDA aarch64 builds for all cuda versions we need to remove this constraint. This was missed by https://github.com/pytorch/pytorch/pull/162364 Proper constraint on triton should be: ``` Requires-Dist: triton==3.5.0; platform_system == "Linux" ``` not: ``` Requires-Dist: triton==3.5.0; platform_system == "Linux" and platform_machine == "x86_64" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165013 Approved by: https://github.com/Camyll, https://github.com/nWEIdia, https://github.com/tinglvv (cherry picked from commit 81dbeb06f4b3eb6c56625ec25d377eb7c7c6c573) Co-authored-by: atalman <atalman@fb.com>	2025-10-08 21:09:57 -04:00
pytorchbot	26e023a973	[MPS] Update OS version in error message (#164949 ) [MPS] Update OS version in error message (#164946) Followup after https://github.com/pytorch/pytorch/pull/159912 Fixes https://github.com/pytorch/pytorch/issues/164943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164946 Approved by: https://github.com/Camyll (cherry picked from commit 01f3a43462da594b65a6c9e8b46c132cd360cea9) Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-10-08 11:11:48 -07:00
pytorchbot	6f12be2770	CUDA 13.0 builds fix on Amazon Linux 2023 (#164893 ) CUDA 13.0 builds fix on Amazon Linux 2023 (#164870) During 2.9 rc testing I am seeing an issue on Amazon Linux 2023 with CUDA 13.0 builds This is related to: https://github.com/pytorch/pytorch/issues/152756 Workflow: https://github.com/pytorch/test-infra/actions/runs/18324074610/job/52184079262 Error: ``` WARNING: There was an error checking the latest version of pip. + python3.11 .ci/pytorch/smoke_test/smoke_test.py --package torchonly Traceback (most recent call last): File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 333, in _load_global_deps ctypes.CDLL(global_deps_lib_path, mode=ctypes.RTLD_GLOBAL) File "/usr/lib64/python3.11/ctypes/__init__.py", line 376, in __init__ self._handle = _dlopen(self._name, mode) ^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: libcudart.so.13: cannot open shared object file: No such file or directory During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/pytorch/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 12, in <module> import torch File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 425, in <module> _load_global_deps() File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 383, in _load_global_deps _preload_cuda_deps(lib_folder, lib_name) File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 317, in _preload_cuda_deps raise ValueError(f"{lib_name} not found in the system path {sys.path}") Traceback (most recent call last): ValueError: libnvToolsExt.so.*[0-9] not found in the system path ['/pytorch/pytorch/.ci/pytorch/smoke_test', '/usr/lib64/python311.zip', '/usr/lib64/python3.11', '/usr/lib64/python3.11/lib-dynload', '/usr/local/lib64/python3.11/site-packages', '/usr/local/lib/python3.11/site-packages', '/usr/lib64/python3.11/site-packages', '/usr/lib/python3.11/site-packages'] File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module> main() File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main run_cmd_or_die(f"docker exec -t {container_name} /exec") File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}") RuntimeError: Command docker exec -t 7d9c5bd403cac9a9ee824d63a1d6f6057ecce89a7daa94a81617dbf8eff0ff2e /exec failed with exit code 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164870 Approved by: https://github.com/Camyll (cherry picked from commit 483f4e0db91166128ad8922d86dc7222338d4ecc) Co-authored-by: atalman <atalman@fb.com> Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2025-10-07 19:33:08 -07:00
pytorchbot	42f0c2c970	update the baseline data for the operator benchmark (#164789 ) update the baseline data for the operator benchmark (#162693) According to the results of the last four operator benchmark runs, we found that five models achieved more than a 30% improvement compared to the baseline. Therefore, we will update the operator benchmark baseline data. We use the average results from the four runs as the new baseline for the five models. And add a pull request trigger for the operator benchmark workflow Benchmarking Framework \| Benchmarking Module Name \| Case Name \| tag \| run_backward \| baseline old \| r1 \| r2 \| r3 \| r4 \| avg \| speedup -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- PyTorch \| add \| add_M1_N1_K1_cpu \| short \| FALSE \| 3.9497 \| 2.57 \| 2.54 \| 2.38 \| 2.31 \| 2.45 \| 1.61 PyTorch \| functional.hardtanh \| functional.hardtanh_dims(512 512)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 67.118 \| 50.02 \| 49.80 \| 46.78 \| 48.94 \| 48.88 \| 1.37 PyTorch \| relu6 \| relu6_dims(512 512)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 68.739 \| 51.17 \| 51.19 \| 48.07 \| 50.42 \| 50.21 \| 1.37 PyTorch \| relu6 \| relu6_dims(256 1024)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 69.1875 \| 51.97 \| 52.77 \| 50.00 \| 51.24 \| 51.50 \| 1.34 PyTorch \| functional.hardtanh \| functional.hardtanh_dims(256 1024)_contigFalse_inplaceFalse_dtypetorch.quint8 \| short \| FALSE \| 67.436 \| 50.98 \| 51.69 \| 49.06 \| 49.87 \| 50.40 \| 1.34 @chuanqi129 @huydhn @desertfire @jainapurva Pull Request resolved: https://github.com/pytorch/pytorch/pull/162693 Approved by: https://github.com/huydhn (cherry picked from commit f7ea4975abb0aeb0224894f0b54b1f8fd1fa70e3) Co-authored-by: LifengWang <lifeng.a.wang@intel.com>	2025-10-07 07:10:51 -07:00
pytorchbot	b015422da1	fix cpp extension distributed warning spew (#164785 ) fix cpp extension distributed warning spew (#162764) With the new change we only log the warning if we're running non distributed code or if we're in rank 0. Unit testing that certain messages get printed on certain ranks only feels kinda jank so test plan is below instead Test plan ```python # torchrun --nproc_per_node=2 demo_fix.py import os import logging logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG) import torch if 'RANK' in os.environ: torch.distributed.init_process_group('nccl') from torch.utils.cpp_extension import _get_cuda_arch_flags _get_cuda_arch_flags() print(f"Rank {os.environ.get('RANK', '0')} done") ``` Logs showing how how `TORCH_CUDA_ARCH_LIST`only shows up once if we explicitly set the the logging level to `logging.DEBUG`. It also improves the debug message to explain what the actual behavior will be ``` (source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *************************************** W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] ************************************* [rank0]:V0911 18:30:18.921000 1316753 pytorch/torch/utils/cpp_extension.py:2444] TORCH_CUDA_ARCH_LIST is not set, using TORCH_CUDA_ARCH_LIST='10.0+PTX' for visible GPU architectures. Set os.environ['TORCH_CUDA_ARCH_LIST'] to override. Rank 0 done Rank 1 done ``` But if we just use the default and comment out `logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)` Then we get ``` (source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] ************************************* W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *************************************** Rank 0 done Rank 1 done (source) [marksaroufim@devgpu005]~% ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162764 Approved by: https://github.com/ezyang, https://github.com/zou3519 (cherry picked from commit f7e83219619a05934a344ca699c33ee69d5a3642) Co-authored-by: Mark Saroufim <marksaroufim@meta.com>	2025-10-06 16:58:36 -07:00
pytorchbot	d4c4307032	Fix docker build issue after 164575 (#164779 ) Fix docker build issue after 164575 (#164774) Looks like https://github.com/pytorch/pytorch/pull/164575 introduced an issue. The command is wrong: ``` conda install -c "whl/nightly" -y python=3.11 conda=25.7.0 ``` Should be just using default conda channel: ``` conda install -y python=3.11 conda=25.7.0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164774 Approved by: https://github.com/Camyll (cherry picked from commit c1f40d33c89b361a1edad17aa25cfff1ab4014fd) Co-authored-by: atalman <atalman@fb.com>	2025-10-06 16:56:06 -04:00
pytorchbot	3b57315b1b	[ROCm] Increase binary build timeout to 5 hours (300 minutes) (#164770 ) [ROCm] Increase binary build timeout to 5 hours (300 minutes) (#163776) Despite narrowing down the [FBGEMM_GENAI build to gfx942](https://github.com/pytorch/pytorch/pull/162648), the nightly builds still timed out because they [didn't get enough time to finish the post-PyTorch-build steps](https://github.com/pytorch/pytorch/actions/runs/17969771026/job/51109432897). This PR increases timeout for ROCm builds for both [libtorch ](https://github.com/pytorch/pytorch/actions/runs/17969771026)and [manywheel](https://github.com/pytorch/pytorch/actions/runs/17969771041), because both of those are close to the 4hr mark currently. This PR is a more ROCm-targeted version of https://github.com/pytorch/pytorch/pull/162880 (which is for release/2.9 branch). Pull Request resolved: https://github.com/pytorch/pytorch/pull/163776 Approved by: https://github.com/jeffdaily (cherry picked from commit 0ec946a0522748332f42675a4d690ff32d773d42) Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-06 16:08:40 -04:00
pytorchbot	c74f05797d	Pin conda version for Docker builds (#164579 ) Pin conda version for Docker builds (#164575) Mitigates https://github.com/pytorch/pytorch/issues/164574 Remove unused CUDA_CHANNEL var - this was used before when we had pytorch install via conda. Please note: CUDA 13.0 failures are expected since the CI tries to build against prod and CUDA 13.0 is not available in prod yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164575 Approved by: https://github.com/malfet, https://github.com/Camyll (cherry picked from commit e40fe634b1a7aa33e278b1404ee02dea12277080) Co-authored-by: atalman <atalman@fb.com>	2025-10-03 11:44:46 -04:00
Andrey Talman	fd364580a9	[Cherry-Pick] Work Around exposing statically linked libstdc++ CXX11 ABI strong symbols (#163980 ) (#164508 ) * Work Around exposing statically linked libstdc++ CXX11 ABI strong symbols (#163980) Work Around for: https://github.com/pytorch/pytorch/issues/133437 Test plan: 1. Build whl in CI 2. Download 3. Run ``nm -D libtorch_cpu.so \| grep "recursive_directory_iterator"`` Test with check_binary_symbols.py: Success: ``` num_cxx11_symbols: 2326 num_pre_cxx11_symbols: 0 lib: /home/ec2-user/github/variant-repack/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so num_statically_linked_symbols (T): 0 ``` Fail when using "W" instead of "T" as type calling ``cxx11_statically_linked_symbols = grep_symbols( lib, STATICALLY_LINKED_CXX11_ABI, symbol_type="W" )`` : ``` num_cxx11_symbols: 2326 num_pre_cxx11_symbols: 0 lib: /home/ec2-user/github/variant-repack/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so num_statically_linked_symbols (T): 20 Traceback (most recent call last): File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 130, in <module> main() File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 126, in main check_lib_statically_linked_libstdc_cxx_abi_symbols(libtorch_cpu_path) File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 95, in check_lib_statically_linked_libstdc_cxx_abi_symbols raise RuntimeError( RuntimeError: Found statically linked libstdc++ symbols (recursive_directory_iterator), but there shouldn't be any, see: ['std::filesystem::__cxx11::recursive_directory_iterator::recursion_pending() const', 'std::filesystem::__cxx11::recursive_directory_iterator::depth() const', 'std::filesystem::__cxx11::recursive_directory_iterator::options() const', 'std::filesystem::__cxx11::recursive_directory_iterator::operator() const', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::operator bool() const', 'std::filesystem::__cxx11::recursive_directory_iterator::disable_recursion_pending()', 'std::filesystem::__cxx11::recursive_directory_iterator::pop(std::error_code&)', 'std::filesystem::__cxx11::recursive_directory_iterator::pop()', 'std::filesystem::__cxx11::recursive_directory_iterator::increment(std::error_code&)', 'std::filesystem::__cxx11::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::__cxx11::path const&, std::filesystem::directory_options, std::error_code)', 'std::filesystem::__cxx11::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::__cxx11::path const&, std::filesystem::directory_options, std::error_code)', 'std::filesystem::__cxx11::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::__cxx11::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::__cxx11::recursive_directory_iterator::operator=(std::filesystem::__cxx11::recursive_directory_iterator&&)', 'std::filesystem::__cxx11::recursive_directory_iterator::operator=(std::filesystem::__cxx11::recursive_directory_iterator const&)', 'std::filesystem::__cxx11::recursive_directory_iterator::operator++()', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>&&)', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr()', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>&&)', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr()'] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163980 Approved by: https://github.com/isuruf, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> fix --------- Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-10-02 17:49:44 -04:00
Lucas Kabela	2f6387e9a1	[CherrryPick][2.9] Cherry pick request for `Reapply "Make functionalization ViewMeta serializable with pickle #163769` (#163873 ) Reapply "Make functionalization `ViewMeta` serializable with pickle. (#143712)" (#163769) NOTE: This is a re-export of https://github.com/pytorch/pytorch/pull/161994 ; the changes between these two PRs is exclusively to the buck/build files (Summary from #161994 ) Attempted rebase of https://github.com/pytorch/pytorch/pull/143712. This reverts commit 6c713ccb5e0df227dd5b630057cbccd373cbe7d6. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela imported-using-ghimport Test Plan: Imported from OSS Differential Revision: D81524507 Pulled By: Lucaskabela Pull Request resolved: https://github.com/pytorch/pytorch/pull/163769 Approved by: https://github.com/dolpm (cherry picked from commit 7d710403b003e44bf31d367673a05468e49df75d) Co-authored-by: Brian Hirsh <hirsheybar@fb.com>	2025-10-02 16:07:51 -04:00
pytorchbot	017d857f5f	fix pickling for BitwiseFn (#163861 ) * fix pickling for BitwiseFn (#163571) Summary: ran into AttributeError: Can't get local object 'make_opaque_bitwise_fn.<locals>.BitwiseFn' looks like it was fixed for UnaryFn but not BitwiseFn in https://github.com/pytorch/pytorch/pull/138395 Fixes #147841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163571 Approved by: https://github.com/jamesjwu (cherry picked from commit cde5c9aebd7a2eda0c935de1ab7a40b6453c5813) * Fix lintrunner with -a --------- Co-authored-by: dolpm <34420038+dolpm@users.noreply.github.com> Co-authored-by: Lucas Kabela <lucaskabela@meta.com>	2025-10-02 15:35:40 -04:00
pytorchbot	d6e8411889	Make sure Windows CUDA 12.8 build follow same arches as Linux builds (#164477 ) Make sure Windows CUDA 12.8 build follow same arches as Linux builds (#164470) I believe ``set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0`` is the one thats actually used. Hence remove 6.1 to align the support with Linux support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164470 Approved by: https://github.com/tinglvv, https://github.com/nWEIdia, https://github.com/Camyll (cherry picked from commit 235b995ce18de632ab816940319fcd66b46039b8) Co-authored-by: Andrey Talman <atalman@fb.com>	2025-10-02 14:33:06 -04:00
pytorchbot	10b501fde9	[Flex] Fix silent correctness w/ backpropping grads (#164366 ) [Flex] Fix silent correctness w/ backpropping grads (#163677) Fixes #https://github.com/pytorch/pytorch/issues/162228 # Summary Majority of our tests are only compiling flex-attention in isolation. This means that for fake tensor propagation the input primals and all captured buffers dont do any intermediate computation below autograd. As a result result the by happen chance match the `require_grad`ness of the eager implementation and this check will pass. However if score_mod is a the result of some other intermediate fake tensor prop then it is not guaranteed to have accurate req_gradness, which was happening here. TLDR is that this was a boot and suspenders that was actually harmful and we should just let the joint graph handle creating the correct joint graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/163677 Approved by: https://github.com/ydwu4 (cherry picked from commit e2ce79e4cce5327b71fcf366fad1133030563285) Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-10-01 14:43:28 -07:00
pytorchbot	31c72b8a96	[a2av] Separate in/out splits into two tensors (#164028 ) [a2av] Separate in/out splits into two tensors (#163837) Old signature: `all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor(a!) in_out_splits, str group_name)` New signature: `all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name)` i.e. split `in_out_splits` into IN tensor and OUT tensor so that we can define the TORCH_LIBRARY signature better. Also to be in line with the 2D version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163837 Approved by: https://github.com/fduwjj ghstack dependencies: #163886 (cherry picked from commit bbf8aa43efe755b9c310347b3780962fca85bf9c) Co-authored-by: Ke Wen <kw2501@meta.com>	2025-10-01 14:43:19 -07:00
pytorchbot	1cd83de315	[Flex attention] Fix flex attention head broadcast (#164368 ) [Flex attention] Fix flex attention head broadcast (#163426) Fixes part of #163314 In particular bug: Bug 1: H=None Broadcasting Produces Incorrect Results This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (mask[:, :, i]). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting. The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error Pull Request resolved: https://github.com/pytorch/pytorch/pull/163426 Approved by: https://github.com/drisspg (cherry picked from commit 1a42656d6c43a9bb7eb90c511884ce451d29422f) Co-authored-by: Isalia20 <irakli.salia854@gmail.com>	2025-10-01 13:48:10 -07:00
pytorchbot	881c2ccae9	Update Gloo submodule (#164371 ) Update Gloo submodule (#163112) Which makes PyTorch buildable with gcc-15, tested by running the build inside `fedora:44` docker ``` docker run --rm -it fedora:44 bash -c "yum install -y g++ python3-devel git; git clone https://github.com/pytorch/pytorch; cd pytorch; git checkout 8f710acce8332979c9a7bf97e72666dfd35c43e6; python3 -mpip install -r requirements.txt; python3 setup.py bdist_wheel" ``` Fixes https://github.com/pytorch/pytorch/issues/156595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163112 Approved by: https://github.com/huydhn (cherry picked from commit 65845d72917fc27cd89a88b067e7c8f44bc0c987) Co-authored-by: Nikita Shulga <nshulga@meta.com>	2025-10-01 12:00:18 -07:00
pytorchbot	764f65584a	[MPS] Chunk fillBuffer into 4Gb slices (#164370 ) [MPS] Chunk fillBuffer into 4Gb slices (#164108) To avoid regression on MacOS 26, which one could observe by running the following script ```swift import Metal let bufferSize = 1<<32 + 4 guard let device = MTLCreateSystemDefaultDevice() else { fatalError("No Metal device found") } guard let buffer = device.makeBuffer(length: bufferSize, options: .storageModeShared) else { fatalError("Failed to create buffer") } guard let cmdQueue = device.makeCommandQueue() else { fatalError("Failed to create command queue") } guard let cmdBuffer = cmdQueue.makeCommandBuffer() else { fatalError("Failed to create command buffer") } guard let blitEncoder = cmdBuffer.makeBlitCommandEncoder() else { fatalError("Failed to create blit encoder") } blitEncoder.fill(buffer: buffer, range: 0..<bufferSize, value: 0x42) blitEncoder.endEncoding() cmdBuffer.commit() cmdBuffer.waitUntilCompleted() let tailOffs = 8 let hostPtr = buffer.contents().bindMemory(to: UInt8.self, capacity: bufferSize) let tail = Array(UnsafeBufferPointer(start: hostPtr + (bufferSize - tailOffs), count: tailOffs)) for (idx, val) in tail.enumerated() { print("Offs 0x\(String(bufferSize - tailOffs + idx, radix: 16)): 0x\(String(val, radix: 16))") } ``` Test plan: run `test_indexing.py` on MacOS-26 Fixes https://github.com/pytorch/pytorch/issues/161265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164108 Approved by: https://github.com/Skylion007 (cherry picked from commit 6db1b9dd217501e0b3171d96335bed7b2bb53c36) Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2025-10-01 11:59:56 -07:00
pytorchbot	3e8a062385	Update Microsoft C++ Redistributable to the latest version (#164369 ) Update Microsoft C++ Redistributable to the latest version (#161430) Update Microsoft C++ Redistributable link to the latest version as one of the libraries used by AMD currently has a dependency on that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161430 Approved by: https://github.com/malfet (cherry picked from commit 1330c638bef7fac64a42935b5a46ee32637ddd4d) Co-authored-by: Saman Khatir <saman.khatir@amd.com>	2025-10-01 11:57:53 -07:00
pytorchbot	3abee625e1	Fix warn message (#164367 ) Fix warn message (#163578) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163578 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/atalman, https://github.com/v0i0 (cherry picked from commit f3f67ff43a014b75b804d5ded0c7de3d8e0be65f) Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-10-01 11:57:16 -07:00
pytorchbot	f227c883f9	[MPSHooks] Release pending command encoder (#164365 ) [MPSHooks] Release pending command encoder (#164093) Before returning a comand buffer, as subsequent calle are very likely to allocate their own encoder, which results in the following runtime error ``` tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:]:1090: failed assertion `A command encoder is already encoding to this command buffer' ``` Added regression test to `test_mps_extension` Please note, that `torch::mps::get_command_buffer()` should be called with dispatch_queue held, both before and after this change, but many implementations skip that Fixes https://github.com/pytorch/pytorch/issues/163721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164093 Approved by: https://github.com/atalman, https://github.com/Skylion007 (cherry picked from commit 8f32adc90a7fee83583c9ba89dbdfabb317e0452) Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2025-10-01 11:56:42 -07:00
pytorchbot	a5feacb14b	[SDPA] [MPS] Fixes regression in 2.8.0 for scaled_dot_product_attention using mps (#164364 ) [SDPA] [MPS] Fixes regression in 2.8.0 for scaled_dot_product_attention using mps (#163598) Fixes #163597 - Updates fast SDPA implementations to take in query tensor stride info similar to key and value instead of assuming stride. - Updated tests with additional transpose/permutation layouts. New tests catch the regression. ### Benchmarking with script found in [implementation PR](https://github.com/pytorch/pytorch/pull/152781#:~:text=19.8%25%20speed%20improvement-,Script%20to%20get%20perf%3A,-import%20torch%0Aimport) Times are averaged over 100000 iterations. This change should not have any significant performance difference. Tested on an M3 Pro ### Vector Fast Path (q_len=1, k_len=256) - Before: 0.160 ms - After: 0.157 ms ### Vector 2-pass (q_len=1, k_len=4096) - Before: 0.342 ms - After: 0.339 ms ### Vector Fast Path (q_len=8, k_len=256) - Before: 0.228 ms - After: 0.231 ms ### Vector 2-pass (q_len=8, k_len=4096) - Before: 0.432 ms - After: 0.436 ms Pull Request resolved: https://github.com/pytorch/pytorch/pull/163598 Approved by: https://github.com/malfet (cherry picked from commit 1c12d7416bc4f1cf0bc8a229e64169fc361b688e) Co-authored-by: Vismai Khanderao <59114226+Vismai-Khanderao@users.noreply.github.com>	2025-10-01 11:37:14 -07:00
Svetlana Karslioglu	71282c8364	Update Sphinx theme (#164147 ) (#164254 ) Fix links in the top nav bar: `71e55749be` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164147 Approved by: https://github.com/albanD (cherry picked from commit e88cca069171ceb117dd1ceb73e8bf3e54aa83cf)	2025-10-01 09:59:45 -07:00
Huy Do	e70d9f5322	[vllm hash update] update the pinned vllm hash (#164190 ) (#164312 ) * [vllm hash update] update the pinned vllm hash (#164190) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164190 Approved by: https://github.com/pytorchbot * Cherry pick b7125b3c456d48445ab0b84fab28702577cd9557 Signed-off-by: Huy Do <huydhn@gmail.com> --------- Signed-off-by: Huy Do <huydhn@gmail.com> Co-authored-by: PyTorch UpdateBot <pytorchupdatebot@users.noreply.github.com>	2025-10-01 06:43:17 -07:00
pytorchbot	005e3e8d78	Clean up obsoleted vLLM tests (#164282 ) Clean up obsoleted vLLM tests (#163383) They have been removed in https://github.com/vllm-project/vllm/pull/25117 and https://github.com/vllm-project/vllm/pull/22772, thus failing in trunk at the moment after the latest pin commit update Pull Request resolved: https://github.com/pytorch/pytorch/pull/163383 Approved by: https://github.com/wdvr, https://github.com/seemethere, https://github.com/malfet (cherry picked from commit a31acf32bd18e115df910002aef42baf7a9b4a33) Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-30 14:40:57 -07:00
pytorchbot	72cf48ea43	[AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 (#164236 ) [AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 (#163988) See also #163972, which was intended to be this PR. Triton (release/3.5.x) by default ships CUDA12.8 ptxas. This PR tries to bundle a ptxas version for cuda13, so that it can help https://github.com/pytorch/pytorch/issues/163801 when users run on new devices like THOR and Spark. Fixes https://github.com/pytorch/pytorch/issues/163801 Test Plan: Check binary size increase against nightly or v2.9RC Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression. Reference: https://github.com/pytorch/pytorch/pull/119750 and `5c814e2527` Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary. However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then `c6ad34f7eb/python/triton/knobs.py (L216)` would still complain ptxas not found (if removed - it won't know this new one available) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163988 Approved by: https://github.com/atalman (cherry picked from commit 3b4ad4a17d69e2db495ecaf3bae8916282a4eb0d) Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-09-30 13:53:56 -04:00
pytorchbot	a21a4bf11a	[CI] Move libtorch-cpu-shared-with-deps-release-build to python 3.10 (#164182 ) [CI] Move libtorch-cpu-shared-with-deps-release-build to python 3.10 (#162877) Related to https://github.com/pytorch/pytorch/pull/162862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162877 Approved by: https://github.com/malfet (cherry picked from commit c9e57d7e9f326e427fc4ae5c318fd017cd4b75a9) Co-authored-by: atalman <atalman@fb.com>	2025-09-29 15:52:16 -07:00
pytorchbot	21fec65781	Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests (#164172 ) Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests (#163956) Workaround for https://github.com/pytorch/pytorch/issues/163658 Looks like the workflow passes on 12.8 build that use inux.g4dn.4xlarge.nvidia.gpu but its failing on 12.6 builds that use linux.4xlarge.nvidia.gpu: https://github.com/pytorch/pytorch/actions/runs/17953843505/job/51080623612#step:13:470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163956 Approved by: https://github.com/malfet (cherry picked from commit 349c960970f4e29eff0d37a9b3c1ca5ed86a121a) Co-authored-by: atalman <atalman@fb.com> Co-authored-by: Mark Saroufim <marksaroufim@meta.com>	2025-09-29 16:14:37 -04:00
pytorchbot	22d46b50ec	[CUDA] revert PR 130472 (#163379 ) [CUDA] revert PR 130472 (#162950) This change may also resolve https://github.com/pytorch/pytorch/issues/161789, though verification is still needed. PR #130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed (cherry picked from commit 4a160dae3cabaff358a6bb2490d0160dd1bf2cdf) Co-authored-by: thenumberouscode <dream20151224@163.com>	2025-09-29 16:05:26 -04:00
pytorchbot	d1b63e2b4a	Skip test_conv3d_cudnn_broken on ROCM (#164163 ) Skip test_conv3d_cudnn_broken on ROCM (#164138) Followup after https://github.com/pytorch/pytorch/pull/163903 Fixes https://github.com/pytorch/pytorch/issues/164137 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164138 Approved by: https://github.com/Camyll (cherry picked from commit 95be302889b8683b7ec7793a69ffa8891b6b5af8) Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-29 11:41:18 -07:00
pytorchbot	20100b7210	[c10d] P2P tensors must be dense (#163981 ) [c10d] P2P tensors must be dense (#163719) Fixes #161324 by adding `is_non_overlapping_and_dense` check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163719 Approved by: https://github.com/ngimel (cherry picked from commit 11a231ef52841a549913b7a6d423cc9004b6b58b) Co-authored-by: Ke Wen <kw2501@meta.com>	2025-09-29 11:27:24 -07:00
pytorchbot	a2c77043ee	Add operator benchmarking run to CI nightly (#164151 ) Add operator benchmarking run to CI nightly (#162530) This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode. Benchmark Configuration and Coverage: * Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI. * The CI benchmarking is running on various hardwares: H100, A100. * The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162530 Approved by: https://github.com/huydhn (cherry picked from commit 54b38f3b46c33a1cc4e8f7894619358afcbd7c89) Co-authored-by: jainapurva <apurvajain.kota@gmail.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-29 11:21:19 -07:00
pytorchbot	b64fc8e41e	Fix operator benchmark issue#162708 (#164140 ) Fix operator benchmark issue#162708 (#162744) This PR skips memory metric calculation for ops which don't take tensor input, fixing the operator_benchmark bug Fixes https://github.com/pytorch/pytorch/issues/162708 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162744 Approved by: https://github.com/huydhn (cherry picked from commit 5f66902ecfb9cb4f7b9c50cb86307217cec1dbe9) Co-authored-by: jainapurva <apurvajain.kota@gmail.com>	2025-09-29 09:34:26 -07:00
pytorchbot	709f4f62a0	[cuDNN][Convolution] Disable cuDNN for 3D convolutions with kernel size != 1 for cuDNN 9.8+ (#164027 ) [cuDNN][Convolution] Disable cuDNN for 3D convolutions with kernel size != 1 for cuDNN 9.8+ (#163581) To workaround #163539 Still confirming whether 9.10 is affected. The original test states that the convolution is "large," but note that the input size does not apepar to require 64-bit indexing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163581 Approved by: https://github.com/ngimel, https://github.com/malfet (cherry picked from commit e2817ac20426356278502db3b1614ea87cb7cff7) Co-authored-by: Eddie Yan <eddiey@nvidia.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-29 09:07:14 -07:00
pytorchbot	11f776c8ee	[cuDNN][SDPA] Disable dropout for cuDNN SDPA on 9.11 - 9.13 (#164026 ) [cuDNN][SDPA] Disable dropout for cuDNN SDPA on 9.11 - 9.13 (#163903) cuDNN introduced some broken heuristics for these cases so we need to disable dropout to avoid unexpected crashes due to heuristics refusing to proceed Pull Request resolved: https://github.com/pytorch/pytorch/pull/163903 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/atalman (cherry picked from commit ed3085814a870f7a07b7f9c696999a47d4f85376) Co-authored-by: Eddie Yan <eddiey@nvidia.com>	2025-09-29 09:06:23 -07:00
pytorchbot	45e257f046	[cuDNN][conv][64-bit] Disable cuDNN for 64-bit depthwise convs again (#164023 ) [cuDNN][conv][64-bit] Disable cuDNN for 64-bit depthwise convs again (#163171) test is breaking, will check if there's an older version that we can enable on to avoid completely dropping support Pull Request resolved: https://github.com/pytorch/pytorch/pull/163171 Approved by: https://github.com/ngimel, https://github.com/malfet (cherry picked from commit 0ea10f9912a9ec7c6d606bc71e3ec91f20372212) Co-authored-by: eqy <eddiey@nvidia.com>	2025-09-29 09:03:36 -07:00
pytorchbot	37e2626639	Update the operator benchmarking, to benchmark using torch.compile (#164101 ) Update the operator benchmarking, to benchmark using torch.compile (#161394) This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are: - Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit` - Added `--benchmark-name` argument for customizing the benchmark name in output - Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name Sample command to run a single operator: `python -m pt.mm_test --use-compile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394 Approved by: https://github.com/jbschlosser (cherry picked from commit af60398c3a057506363e028bf328843a755b4f24) Co-authored-by: jainapurva <apurvajain.kota@gmail.com>	2025-09-29 07:49:05 -07:00
pytorchbot	d7a703ea92	[SymmMem] Barrier on team instead of world (#163376 ) [SymmMem] Barrier on team instead of world (#163298) As titled. Avoiding a potential hang when running dispatch and combine in subgroups. The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163298 Approved by: https://github.com/fegin (cherry picked from commit f8fb437197033c33ecc435cd5e1e6a5b2bc5bf69) Co-authored-by: Ke Wen <kw2501@meta.com>	2025-09-26 16:41:18 -07:00
pytorchbot	daa3d04325	[SymmMem] Fix memory allocation hold-up (#163375 ) [SymmMem] Fix memory allocation hold-up (#162680) Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162680 Approved by: https://github.com/ezyang ghstack dependencies: #163298 (cherry picked from commit 7130b174e07dbc1a708934b18dede3d88e8f779f) Co-authored-by: Ke Wen <kw2501@meta.com>	2025-09-26 16:35:56 -07:00
pytorchbot	999304396f	[dist] handle discontiguous allgather/reducescatter inputs (#163987 ) [dist] handle discontiguous allgather/reducescatter inputs (#163712) Fixes #163483 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163712 Approved by: https://github.com/ezyang, https://github.com/kwen2501 (cherry picked from commit 71eec6a0bf69f712f4b9279fdc8d1459be0426e6) Co-authored-by: Natalia Gimelshein <ngimel@meta.com>	2025-09-26 16:21:08 -07:00
pytorchbot	5340e741df	[Reland][163423] Promote `@requires_nvshmem` instead of `enable_triton` (#163916 ) [Reland][163423] Promote `@requires_nvshmem` instead of `enable_triton` (#163549) #163423 was approved but reverted due to a revert of base. Relanding without base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163549 Approved by: https://github.com/wdvr (cherry picked from commit 6e6c899347db952f6a691feb4e8610fe9cca0279) Co-authored-by: Ke Wen <kw2501@fb.com> Co-authored-by: Wouter Devriendt <wouterdevriendt@meta.com>	2025-09-26 15:58:30 -07:00
pytorchbot	7cadf8ac04	[Inductor][Intel GPU] Save `threads_per_warp` from tirton compiled kernel for launching kernel correctly in cpp wrapper. (#163388 ) [Inductor][Intel GPU] Save `threads_per_warp` from tirton compiled kernel for launching kernel correctly in cpp wrapper. (#163315) On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163315 Approved by: https://github.com/EikanWang, https://github.com/desertfire (cherry picked from commit 9f8a311af09586ac4026d6a56fc7c4ac7acc62ed) Co-authored-by: xinan.lin <xinan.lin@intel.com>	2025-09-26 14:42:09 -04:00
pytorchbot	f9e495fe8e	Move inductor jobs 3.9->3.10 (#163954 ) Move inductor jobs 3.9->3.10 (#162323) Related to: https://github.com/pytorch/pytorch/issues/161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162323 Approved by: https://github.com/huydhn, https://github.com/Skylion007 (cherry picked from commit e8eeb060348f250975124abb957b1d7d9c4af9a0) Co-authored-by: atalman <atalman@fb.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-26 12:37:50 -04:00
pytorchbot	57dc68844d	[CI] Fix test_triton_wait_until hang (#163914 ) [CI] Fix test_triton_wait_until hang (#163886) I don't know why `nvshmem_barrier_all_kernel` leads the test to hang. Will investigate. But since it is an unnecessary call here, I am removing it to unblock other PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163886 Approved by: https://github.com/fegin (cherry picked from commit 96275dbf88372bb32a123c4ea918498128fbecb9) Co-authored-by: Ke Wen <kw2501@meta.com>	2025-09-26 12:16:00 -04:00
Cui, Yifeng	63da9d2730	[Release 2.9] Update torch-xpu-ops commit pin (#163622 ) Update commit pin to 789f59	2025-09-26 09:46:02 -04:00
pytorchbot	824d59fbf6	[CI] Install libuv for Win testing (#163907 ) [CI] Install libuv for Win testing (#163797) Current working theory why `f0078941cf` caused a regression, are because Windows CI no longer could be build with distributed, as it could not find libuv Pull Request resolved: https://github.com/pytorch/pytorch/pull/163797 Approved by: https://github.com/wdvr (cherry picked from commit cc660d38ac533b92f3ad4cb1105f7a16f74b9f09) Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-26 00:03:22 -07:00
pytorchbot	fc8bf12b38	Fix cpp build (#163887 ) Fix cpp build (#162774) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/162774 Approved by: https://github.com/malfet, https://github.com/atalman (cherry picked from commit b61bdc7cc4c841bf7574bc993f3fd445682f0997) Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-09-25 14:50:59 -07:00
pytorchbot	49dab18ecf	[CD] Add statically linked windows libraries to exclude list (#163862 ) [CD] Add statically linked windows libraries to exclude list (#163768) Fixes: https://github.com/pytorch/pytorch/issues/159514 Seeing following in the Wheel build logs: ``` Linking CXX static library lib\kineto.lib Linking CXX static library lib\dnnl.lib .... ``` These files are around 800MB uncompressed and 109MB compressed, hence provide ~50% size reduction for Windows CPU builds. Test Plan: Build Pytorch Windows binary. Build vision, audio and torchcodec with this binary. Smoke test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163768 Approved by: https://github.com/albanD, https://github.com/malfet (cherry picked from commit 98c4e35f14601909c113b4fd2857b6f0fb525316) Co-authored-by: atalman <atalman@fb.com>	2025-09-25 14:46:56 -07:00
Camyll Harajli	0154ca1d3d	[BE] Update Python min version to 3.10 (#162310 ) (#163885 ) * [BE] Update Python min version to 3.10 (#162310) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310 Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi * comment out executorch --------- Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2025-09-25 14:44:48 -07:00
Andrey Talman	132d9fac3b	Revert "[BE] Update Python min version to 3.10 (#162310 )" (#163882 ) Revert "[BE] Update Python min version to 3.10 (#162310) (#163802)" This reverts commit 7d024a6e299eee2830e9fbdae1913e432160bb23.	2025-09-25 10:54:12 -07:00
Camyll Harajli	87c5d4a858	[cherrypick] [CI] Move Windows build/tests to Python-3.10 #162862 (#163800 ) [CI] Move Windows build/tests to Python-3.10 (#162862) What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely: - `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts - With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such) - Copied test type dependencies from `be01a40157/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1 (L16)` into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps I think in the long run, one needs to update `4432e2cacd/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1` that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time Pull Request resolved: https://github.com/pytorch/pytorch/pull/162862 Approved by: https://github.com/wdvr, https://github.com/huydhn ghstack dependencies: #163339, #163341 Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2025-09-25 09:06:52 -07:00
Andrey Talman	b0dc90881c	[CD] Simplify NVIDIA driver installation step (#163349 ) (#163790 ) Undo changes introduced in https://github.com/pytorch/pytorch/pull/160956 as driver has been updated to 580 for both fleets Fixes https://github.com/pytorch/pytorch/issues/163342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163349 Approved by: https://github.com/seemethere Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-25 10:40:57 -04:00
Andrey Talman	c0577aad39	Use cuda nvrtc so file based on cuda version used by torch (#163642 ) (#163788 ) Fixes https://github.com/pytorch/pytorch/issues/162367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163642 Approved by: https://github.com/msaroufim	2025-09-25 10:40:09 -04:00
pytorchbot	9952b87600	[CD] CUDA 13.0 fix preload logic to include nvidia/cu13/lib/ (#163766 ) [CD] CUDA 13.0 fix preload logic to include nvidia/cu13/lib/ (#163661) Preload logic no longer works with CUDA 13.0 See the installation path: ``` ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/cu13/lib/ libcheckpoint.so libcudadevrt.a libcufft.so.12 libcufile_rdma.so.1 libcusolver.so.12 libnvJitLink.so.13 libnvperf_target.so libnvrtc.alt.so.13 libpcsamplingutil.so libcublas.so.13 libcudart.so.13 libcufftw.so.12 libcupti.so.13 libcusolverMg.so.12 libnvblas.so.13 libnvrtc-builtins.alt.so.13.0 libnvrtc.so.13 libcublasLt.so.13 libcudart_static.a libcufile.so.0 libcurand.so.10 libcusparse.so.12 libnvperf_host.so libnvrtc-builtins.so.13.0 libnvtx3interop.so.1 ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/ cu13 cudnn cusparselt nccl nvshmem ``` Test using script from : https://github.com/pytorch/pytorch/issues/162367 ``` Kernel test passed! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163661 Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/Camyll (cherry picked from commit 141fc7276ebc722b6076cc3afe4fbc6307a1b775) Co-authored-by: atalman <atalman@fb.com>	2025-09-25 10:38:16 -04:00
Andrey Talman	300bade202	[Cherry-Pick] [CD] CUDA 13 specific followup changes. Remove sm50-70 From CUDA 12.6 and CUDA 12.8 builds (#162455 ) (#163764 ) * [CD] CUDA 13 specific followup changes (#162455) Follow up for CUDA 13 bring up https://github.com/pytorch/pytorch/issues/159779 sm50-70 should not be added to sbsa build arch list, as previous archs had no support for arm. remove platform_machine from PYTORCH_EXTRA_INSTALL_REQUIREMENTS Pull Request resolved: https://github.com/pytorch/pytorch/pull/162455 Approved by: https://github.com/atalman * update --------- Co-authored-by: Ting Lu <tingl@nvidia.com>	2025-09-25 10:37:52 -04:00
pytorchbot	96f0c0fa07	Fix some edge cases (#163106 ) Fix some edge cases (#162295) ``` Summary 🔝 Top 5 Performance Differences (by absolute %): shape: (5, 7) ┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 56.937931 ┆ 58.960459 ┆ 1.035522 ┆ 3.552163 │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306 ┆ 86.295642 ┆ 0.967209 ┆ -3.27911 │ │ causal ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594 ┆ 114.380841 ┆ 1.025353 ┆ 2.535349 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149 ┆ 76.685445 ┆ 1.024793 ┆ 2.479344 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 55.279932 ┆ 56.369312 ┆ 1.019707 ┆ 1.97066 │ └────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘ 🔺 Top 5 Cases Where no_peel (change) is Faster than base (baseline): shape: (5, 7) ┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 56.937931 ┆ 58.960459 ┆ 1.035522 ┆ 3.552163 │ │ causal ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594 ┆ 114.380841 ┆ 1.025353 ┆ 2.535349 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149 ┆ 76.685445 ┆ 1.024793 ┆ 2.479344 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 55.279932 ┆ 56.369312 ┆ 1.019707 ┆ 1.97066 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 4096, 4, 4096, 64) ┆ 111.08814 ┆ 112.447047 ┆ 1.012233 ┆ 1.22327 │ └────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘ 🔻 Top 5 Cases Where no_peel (change) is Slower than base (baseline): shape: (5, 7) ┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306 ┆ 86.295642 ┆ 0.967209 ┆ -3.27911 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 1024, 4, 1024, 64) ┆ 78.23082 ┆ 76.693169 ┆ 0.980345 ┆ -1.965531 │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95663 ┆ 95.573333 ┆ 0.985733 ┆ -1.426717 │ │ alibi ┆ torch.bfloat16 ┆ (4, 16, 2048, 4, 2048, 64) ┆ 93.373473 ┆ 92.294147 ┆ 0.988441 ┆ -1.155924 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95147 ┆ 96.105389 ┆ 0.991273 ┆ -0.872685 │ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162295 Approved by: https://github.com/mlazos, https://github.com/v0i0 (cherry picked from commit 864ffe12d737403230e8257b9bce0a830bd590c1) Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-09-25 10:29:39 -04:00
Camyll Harajli	7d024a6e29	[BE] Update Python min version to 3.10 (#162310 ) (#163802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310 Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: #162862 Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2025-09-24 15:48:19 -07:00
pytorchbot	be29c5b207	Add analytics ID to cpp docs (#163695 ) Add analytics ID to cpp docs (#163370) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163370 Approved by: https://github.com/albanD (cherry picked from commit e6a9db58d71e474deac28276de1f611638c32eeb) Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-09-24 15:45:17 -07:00
pytorchbot	5322dab793	Update pytorch.org links in docs/conf.py (#163703 ) Update pytorch.org links in docs/conf.py (#163682) Update links in conf.py to docs.pytorch.org Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163682 Approved by: https://github.com/sekyondaMeta, https://github.com/albanD (cherry picked from commit 8c8416b021e59a5ec58aceb38eeffc63885a28bc) Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-09-24 15:44:43 -07:00
pytorchbot	1dadb6196b	[BE] Introduce `CONDA_ROOT_DIR` (#163805 ) [BE] Introduce `CONDA_ROOT_DIR` (#163341) Which equal to `%CONDA_PARENT_DIR%/Miniconda3`, and replace this pattern with `%CONDA_ROOT_DIR%` throughout the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/163341 Approved by: https://github.com/clee2000 ghstack dependencies: #163339 (cherry picked from commit a273475b01e912f402378a522bb9c4ed37e8413a) Co-authored-by: Nikita Shulga <nshulga@meta.com>	2025-09-24 15:42:16 -07:00
pytorchbot	6c058c1262	Move ROCM trunk wheel builds to 3.10 (#163804 ) Move ROCM trunk wheel builds to 3.10 (#163339) This code is a delicious spaghetti: Sometimes python version is defined in jinja template (see https://github.com/pytorch/pytorch/pull/162297 ) sometimes in shell script (see https://github.com/pytorch/pytorch/pull/162877 ), but this time around it's in a python file (and there is another one called `generate_binary_build_matrix.py` that defines `FULL_PYTHON_VERSIONS`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163339 Approved by: https://github.com/clee2000 (cherry picked from commit 52dd7a898c117305b4407c7f26bbcc7b39f20aaa) Co-authored-by: Nikita Shulga <nshulga@meta.com>	2025-09-24 15:41:55 -07:00
pytorchbot	715dca6725	[export] Remove .contiguous() when saving weights to raw bytes (#163662 ) [export] Remove .contiguous() when saving weights to raw bytes (#163587) Summary: `.contiguous()` will discard the original storage size of the tensor, and could lead to issues during loading. Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_1D_tensor_slicing buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_2D_tensor_slicing Differential Revision: D83016250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163587 Approved by: https://github.com/angelayi (cherry picked from commit 720a7b2887ca4efc8d63b32373182bc97918c76e) Co-authored-by: Yiming Zhou <yimingzhou@meta.com>	2025-09-23 10:15:06 -07:00
pytorchbot	47cb45e4f6	Update pytorch_sphinx_theme2 to latest hash (#163655 ) Update pytorch_sphinx_theme2 to latest hash (#163269) The updated theme: - Fixes articleBody in the json+ld that caused previous Google Search issues - Other minor fixes - 404.html fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/163269 Approved by: https://github.com/albanD (cherry picked from commit 68e75be86ab618bb6b1dc32b603a780ff6046262) Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-09-23 10:13:51 -07:00
pytorchbot	4966d058f2	CUDA 13.0 Warning update for supported architectures (#163633 ) CUDA 13.0 Warning update for supported architectures (#163585) Please see build script: `8da008678f/.ci/manywheel/build_cuda.sh (L69-L71)` This should display correct warning: `` Please install PyTorch with a following CUDA configurations: 12.6 12.8 13.0 following instructions at https://pytorch.org/get-started/locally/ `` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163585 Approved by: https://github.com/malfet (cherry picked from commit 3c64b2abab5a23809140da5bd6272307b776e459) Co-authored-by: atalman <atalman@fb.com>	2025-09-23 10:13:06 -07:00
pytorchbot	579794ed7b	[SymmMem] Fix put_signal + wait_until hang (#163458 ) [SymmMem] Fix put_signal + wait_until hang (#163194) The test used a wrong ptr to refer to remote address: ``` dst_ptr = out_hdl.buffer_ptrs[peer] src_ptr = inp_hdl.buffer_ptrs[rank] sig_ptr = out_hdl.signal_pad_ptrs[peer] ``` All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang. Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163194 Approved by: https://github.com/ngimel ghstack dependencies: #163025, #163152 (cherry picked from commit 80f8be9840c20c3efe1274266b52ab098f4d1030) Co-authored-by: Ke Wen <kw2501@meta.com>	2025-09-23 10:10:02 -07:00
David Berard	7cf37ae3cb	[2.9 cherry pick][triton] update 3.5 pin to bbb06c0334a6772b92d24bde54956e675c8c6604 (#163382 ) (#163583 ) Includes: * https://github.com/triton-lang/triton/pull/8211 to work around a PTXAS bug that was causing 03-matrix-multiplication tutorial matmuls to underperform due to excessive WGMMA waits * https://github.com/triton-lang/triton/pull/8157 to fix a convert_layout bug Verified that this passes Triton CI in https://github.com/pytorch/pytorch/pull/159158 and improves gemm perf (see https://github.com/pytorch/pytorch/issues/159704) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163382 Approved by: https://github.com/Camyll, https://github.com/atalman	2025-09-22 18:20:20 -07:00
Richard Zou	f83cf0714e	[graph partition] Add way to register custom rule (#163310 ) (#163395 ) This PR adds an experimental way to register a custom rule for if inductor should partition the graph around an operator. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/163310 Approved by: https://github.com/ProExpertProg, https://github.com/BoyuanFeng, https://github.com/eellison ghstack dependencies: #162117, #162307, #162651	2025-09-22 18:18:07 -07:00
pytorchbot	ddd5074afc	[CI] Update NVIDIA driver to `580.82.07` (#163522 ) [CI] Update NVIDIA driver to `580.82.07` (#163111) To make CI machines capable of running CUDA-13 tests. Unfortunately, this upgrade regresses NUMBA integration, so live patch it with `6e08c9d08e` This fix was suggested in https://github.com/pytorch/pytorch/issues/162878#issuecomment-3288635745 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163111 Approved by: https://github.com/huydhn (cherry picked from commit 8dbac62edb48815dfca84dfdcca40d6a24d0652b) Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2025-09-22 11:45:48 -04:00
pytorchbot	35c55da805	[Graph Partition] improve custom op output alias (#163380 ) [Graph Partition] improve custom op output alias (#163227) For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. `72fedf0575/torch/_inductor/ir.py (L7976-L7982)` According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163227 Approved by: https://github.com/zou3519 (cherry picked from commit 4967ad8baa724b8b1acc123698bb1265723feb87) Co-authored-by: Boyuan Feng <boyuan@meta.com>	2025-09-19 16:36:03 -07:00
pytorchbot	a576d48637	Skip test_ind_worker_queue on Windows and macOS (flaky) (#163363 ) Skip test_ind_worker_queue on Windows and macOS (flaky) (#162555) Fixes https://github.com/pytorch/pytorch/issues/68643 It was closed by the bot yesterday and the issue was still there https://github.com/pytorch/pytorch/actions/runs/17595694816/job/49989589647. It's better to just skip it directly in the code as this test has been disabled on Windows and MacOS since 2021 O_o Pull Request resolved: https://github.com/pytorch/pytorch/pull/162555 Approved by: https://github.com/clee2000 (cherry picked from commit 98e22c8a693644c6d235d7a858dc411b1aefafa7) Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-19 13:07:00 -07:00
pytorchbot	25d8c0be68	Add decomp rule to assert_tensor_metadata for BatchedTensors (#163361 ) Add decomp rule to assert_tensor_metadata for BatchedTensors (#163008) Whenever there is device move, export introduces assert_tensor_metadata aten operator to make sure to guard for device specialization. This aten op didn't work with Vmap because we didn't register explicit decomp rule saying we just skip BatchedTensor and call it on underlying tensor Differential Revision: [D82483979](https://our.internmc.facebook.com/intern/diff/D82483979) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163008 Approved by: https://github.com/huydhn (cherry picked from commit e28983be76aa4651e3cb69dc3a4234d75038d938) Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>	2025-09-19 13:00:57 -07:00
Boyuan Feng	b1aae80953	[Cherry Pick][Graph Partition] allow sharing default device context (#163097 ) cherry pick PR 162873	2025-09-19 11:10:29 -07:00
eqy	76bebf38de	[Release 2.9] [cuDNN][SDPA][submodule] Roll-back cuDNN frontend upgrade, update Met… (#163265 ) [cuDNN][SDPA][submodule] Roll-back cuDNN frontend upgrade, update Meta registration (#163104) For https://github.com/pytorch/torchtitan/issues/1713 Also note that we will need to rollback the cuDNN frontend upgrade in 2.9 as it currently introduces a segmentation fault by assuming tensors have their strides and sizes populated at graph creation time `1a7b4b78db/include/cudnn_frontend/node/sdpa_support_surface.h (L447%C2%A0)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163104 Approved by: https://github.com/drisspg	2025-09-19 10:53:04 -07:00
pytorchbot	bc158ebdc7	[SymmMem] Fix NVSHMEM plugin + Triton 3.5 (#163262 ) [SymmMem] Fix NVSHMEM plugin + Triton 3.5 (#163152) 1. The dispatch signatures defined in `core.extern_elementwise` call must match the C signature of the NVSHMEM functions, in particular the dtypes. Otherwise, there would be weird errors, such as IMA or hang. When matched, most of time the NVSHMEM device function will be inlined into the generated PTX. When not matched, it is represented as a function call in the PTX (not sure if it is the function call that goes wrong). 2. When calling the `core.extern` wrappers from the `triton.jit` kernels, the input must be cast to match the signatures defined in 1, e.g. via `nbytes.to(tl.int64)`. Otherwise, Triton will report a key error when searching for such kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163152 Approved by: https://github.com/ngimel ghstack dependencies: #163025 (cherry picked from commit 57a54a04b6eb78e0aa7d13b48e25fb8c0c49fd60) Co-authored-by: Ke Wen <kw2501@meta.com>	2025-09-19 10:51:02 -07:00
Camyll Harajli	ffa6f63fe2	Revert "Make distributed modules importable even when backend not bui… (#163024 ) Revert "Make distributed modules importable even when backend not built (#159889)" (#162568) This reverts commit a0d026688cd69583d5a4e0c6f3e5fda141a7f4a9. Revert "Always build USE_DISTRIBUTED. (#160449)" This reverts commit d80297a6846f1f2c36fd4f19e22919f2abe8fcea. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162568 Approved by: https://github.com/huydhn Co-authored-by: Edward Yang <ezyang@meta.com>	2025-09-19 10:34:55 -07:00
pytorchbot	baab5c6c8b	[ONNX] Update export docstring & Set fallback=False by default (#162637 ) * [ONNX] Update export docstring (#162622) Update export docstring to reflect the latest configuration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162622 Approved by: https://github.com/titaiwangms (cherry picked from commit 7e2e83cdbe532b230dee40cfe0454116c9b64710) * Change fallback option to False in ONNX export * Change fallback parameter default to False --------- Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-09-16 17:23:47 -07:00
pytorchbot	9718af107e	Support vmap + custom autograd function/improve DTensor constructor inefficiency (#162738 ) Support vmap + custom autograd function/improve DTensor constructor inefficiency (#162240) This makes gemma3 exportable on transformers=4.55.4 In HF, there is a torch funciton mode called TransformGetItemToIndex which internally calls custom autograd function. When this custom autograd function is called under vmap, It triggers CustomFunctionHigherOrderOP which error-ed because there was no pre-dispatch proxy mode implementation. Since there are number of requests lately to add various operators in pre-dispatch IR, I introduce a decorator in export that works similar to `allow_in_graph`. Basically: 1) We intercept custom_autograd_function.apply at pre-dispatch mode when this decorator is applied 2) We apply `flat_apply` HOP to hide the pytree spec for this autograd function. Note that this adds restriction that this custom autograd function needs to take in fx-able types. 3) subclass constructor decorator is implemented similarly, so we just refactor it to use similar implementation as this new decorator. eventually we should delete the subclass constructor decorator. 4) Move some code in subclass constructor decorator to exit early in non-export environment which should shave off some inefficiency (around 1% according to @swolchok 's benchmark) Fixes: https://github.com/pytorch/pytorch/issues/161563#issuecomment-3246309758 Differential Revision: [D82141316](https://our.internmc.facebook.com/intern/diff/D82141316) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162240 Approved by: https://github.com/ydwu4 (cherry picked from commit 463fbc8ca0537e5635236190d2ca38ce6fcef831) Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>	2025-09-16 17:22:16 -07:00
pytorchbot	7f8ba48c2a	Fix the regression issue caused by non-arrch64 platforms not hitting the MKLDNN path. (#162778 ) Fix the regression issue caused by non-arrch64 platforms not hitting the MKLDNN path. (#162168) This issue was introduced by the commit in issue #161065. Added an extra check to provide a proper path for other platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162168 Approved by: https://github.com/mingfeima, https://github.com/malfet (cherry picked from commit 563921619b3e820b170475b9278ff94ee6e1a32c) Co-authored-by: Yuxingwang-intel <yuxing.wang@intel.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-16 17:21:10 -07:00
Cui, Yifeng	aebf427c53	[Release 2.9] Update torch-xpu-ops commit pin (#162935 ) Update commit pin to f8408a	2025-09-16 17:19:31 -07:00
pytorchbot	44baf2ff8d	fix deterministic scatter_add path for multi-d tensors (#162977 ) fix deterministic scatter_add path for multi-d tensors (#162866) PReviously for more than 2d tensor `select` didn't work correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162866 Approved by: https://github.com/valentinandrei (cherry picked from commit bf6b40da3e3be7718b8ddc94eed2da8cabaa5e86) Co-authored-by: Natalia Gimelshein <ngimel@meta.com>	2025-09-16 17:17:36 -07:00
pytorchbot	1076941ff7	[ONNX] Fix rotary_embedding_23 implementation (#163041 ) [ONNX] Fix rotary_embedding_23 implementation (#162865) The implementation of rotary_embedding_23 when input is 3D was incorrect. ## Tested Locally with ```py import onnx_ir as ir import onnx import torch import os import numpy as np base_path = "/home/justinchu/dev/onnx/onnx/backend/test/data/node" test_names = [ "test_rotary_embedding", "test_rotary_embedding_3d_input", "test_rotary_embedding_interleaved", "test_rotary_embedding_no_position_ids", "test_rotary_embedding_no_position_ids_interleaved", "test_rotary_embedding_no_position_ids_rotary_dim", "test_rotary_embedding_with_interleaved_rotary_dim", "test_rotary_embedding_with_rotary_dim", ] model_paths = [os.path.join(base_path, name) for name in test_names] for path in model_paths: print(f"Checking {path} for issues...") model = onnx.load(os.path.join(path, "model.onnx")) input0 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_0.pb")) ).numpy() input1 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_1.pb")) ).numpy() input2 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_2.pb")) ).numpy() if os.path.exists(os.path.join(path, "test_data_set_0", "input_3.pb")): input3 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_3.pb")) ).numpy() else: input3 = None output0 = ir.from_proto( onnx.load_tensor(os.path.join(path, "test_data_set_0", "output_0.pb")) ).numpy() m = ir.from_proto(model) node = m.graph[-1] print(node) assert node.op_type == "RotaryEmbedding" interleaved = node.attributes.get_int("interleaved", 0) num_heads = node.attributes.get_int("num_heads", 0) rotary_embedding_dim = node.attributes.get_int("rotary_embedding_dim", 0) torch_out = torch.onnx.ops.rotary_embedding( torch.tensor(input0), torch.tensor(input1), torch.tensor(input2), position_ids=torch.tensor(input3) if input3 is not None else None, interleaved=bool(interleaved), num_heads=num_heads, rotary_embedding_dim=rotary_embedding_dim, ) torch_out = torch_out.detach().cpu().numpy() np.testing.assert_allclose(torch_out, output0) ``` Fix https://github.com/pytorch/pytorch/issues/162848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162865 Approved by: https://github.com/kunal-vaishnavi, https://github.com/titaiwangms (cherry picked from commit fdf68fa5d70abebee1c5090a51ea30c7aa40b9b0) Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-09-16 17:16:23 -07:00
pytorchbot	0ac9fa4413	[ez][CI] Fix docs push in nightly workflow (#163085 ) [ez][CI] Fix docs push in nightly workflow (#162657) HUD metrics page says docs push hasn't happened in 21 days <img width="293" height="142" alt="image" src="https://github.com/user-attachments/assets/f930aab8-0503-4bf2-b962-8c375dec6b78" /> I guess main branch docs just haven't been updated? Did anyone notice? Do we care? Either way I think this should fix it Likely started after https://github.com/pytorch/pytorch/pull/161182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162657 Approved by: https://github.com/huydhn (cherry picked from commit 2f533959430c2a41fe16ef79fe4d680a5c4e0585) Co-authored-by: Catherine Lee <csl@fb.com>	2025-09-16 12:04:17 -07:00
pytorchbot	152383b745	fix typo: summit -> submit (#162597 ) fix typo: summit -> submit (#162587) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162587 Approved by: https://github.com/justinchuby (cherry picked from commit fefc406a3d0d90db0f808419fb88045f90b213cd) Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>	2025-09-12 11:41:11 -04:00
pytorchbot	c31a8186c1	[CD] Aarch64 Fix packaging ``libarm_compute.so`` and other libraries to the aarch64 CUDA wheels (#162596 ) [CD] Aarch64 Fix packaging ``libarm_compute.so`` and other libraries to the aarch64 CUDA wheels (#162566) Fixes aarch64 linux packaging, following error: https://github.com/pytorch/vision/actions/runs/17612462583/job/50037380487#step:15:62 ``` Traceback (most recent call last): File "/__w/vision/vision/pytorch/vision/setup.py", line 13, in <module> import torch File "/__w/_temp/conda_environment_17612462583/lib/python3.11/site-packages/torch/__init__.py", line 415, in <module> from torch._C import * # noqa: F403 ^^^^^^^^^^^^^^^^^^^^^^ ImportError: libarm_compute.so: cannot open shared object file: No such file or directory ``` Due to missing dependencies. Current Error: File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl renamed as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl Hence the repackaging does not take any effect. This PR does following File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl deleted File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl Looks like after migrating from zipping the wheel to wheel pack renaming the wheel is no longer necessary. Hence removing renaming and deleting old file. ``` 2025-09-10T10:10:05.9652454Z Using nvidia libs from pypi - skipping CUDA library bundling 2025-09-10T10:10:05.9656595Z Copying to /pytorch/dist/tmp/torch/lib/libgomp.so.1 2025-09-10T10:10:05.9873843Z Copying to /pytorch/dist/tmp/torch/lib/libgfortran.so.5 2025-09-10T10:10:06.0410041Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute.so 2025-09-10T10:10:06.2869242Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute_graph.so 2025-09-10T10:10:06.4385740Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_lp64_gomp.so.0 2025-09-10T10:10:06.5461372Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_lp64_gomp.so.0 2025-09-10T10:10:06.5728970Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_core.so.0 2025-09-10T10:10:06.6231872Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_core.so.0 2025-09-10T10:10:14.1503110Z Updated tag from Tag: cp310-cp310-linux_aarch64 2025-09-10T10:10:14.1503482Z to Tag: cp310-cp310-manylinux_2_28_aarch64 2025-09-10T10:10:14.1503682Z 2025-09-10T10:10:41.6498892Z Repacking wheel as /pytorch/dist/torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK 2025-09-10T10:10:41.9394460Z Renaming torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl wheel to torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl ``` Test Plan, Executed on local file: ``` inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/WHEEL inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/entry_points.txt inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/top_level.txt inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/RECORD Bundling CUDA libraries with wheel Updated tag from Tag: cp310-cp310-manylinux_2_28_aarch64 to Tag: cp310-cp310-manylinux_2_28_aarch64 Repacking wheel as ubuntu/dist/torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK Copying torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl to artifacts Build Complete. Created torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl.. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162566 Approved by: https://github.com/jeanschmidt, https://github.com/NicolasHug (cherry picked from commit 3d32bb114bf0d5bd0193dc40f20253635dddf080) Co-authored-by: atalman <atalman@fb.com>	2025-09-10 12:22:02 -04:00
pytorchbot	ce928e17c1	CUDA 13.0 Windows Nvidia Driver Update to 580.88 (#162501 ) CUDA 13.0 Windows Nvidia Driver Update to 580.88 (#162425) Related to https://github.com/pytorch/pytorch/issues/162333 https://github.com/pytorch/pytorch/issues/159779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162425 Approved by: https://github.com/tinglvv, https://github.com/malfet (cherry picked from commit e38e953432764e00f16999c8b7df6346ad357a16) Co-authored-by: atalman <atalman@fb.com>	2025-09-09 14:27:57 -04:00
Andrey Talman	cd2c98a5b5	[Release 2.9] Release only changes (#162493 )	2025-09-09 11:15:20 -07:00
Huy Do	4840a1a591	Run vLLM tests on all trunk commits before 2.9 branch cut (#161797 ) This makes it easier to bisect issue now given that we don't have lots of time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161797 Approved by: https://github.com/yangw-dev	2025-09-09 05:56:41 +00:00
Yang Wang	d49205fe1f	Add more tests for vllm and clean out the old vllm test (#162292 ) Test failure coverage from pytorch 2.8 release issues [internal access only](https://docs.google.com/document/d/1zvK1eUAHubHGGHg9jKxd-QlP89fzgfqOBvE2m9mUs90/edit?tab=t.0 ) See coverage mapping \| Given test / pattern \| Suite ID (from config) \| \|---\|---\| \| pytest -v -s basic_correctness/test_cumem.py \| vllm_basic_correctness_test \| \| pytest -v -s entrypoints/openai/test_sleep.py \| vllm_entrypoints_test \| \| pytest -v -s entrypoints/openai/test_translation_validation.py::test_long_audio_request \| vllm_entrypoints_test \| \| pytest -v -s lora/test_quant_model.py \| vllm_lora_28_failure_test \| \| pytest -v -s -x tests/lora/test_llama_tp.py \| vllm_lora_tp_test_distributed \| \| pytest -v -s distributed/test_sequence_parallel.py -k test_tp_sp_generation \|vllm_distributed_test_28_failure_test \| \| pytest -v -s distributed/test_sequence_parallel.py::test_tp_sp_generation[...] \| vllm_distributed_test_28_failure_test \| \| pytest models/language/generation/test_mistral.py::test_models[...] \| vllm_languagde_model_test_extended_generation_28_failure_test \| \| pytest models/multimodal/pooling/test_jinavl_reranker.py::test_model_text_image[...] \| vllm_multi_model_test_28_failure_test \| \| tests/lora/test_qwen2vl.py::test_qwen2vl_lora \| vllm_lora_test \| \| tests/lora/test_qwen2vl.py::test_qwen25vl_lora \| vllm_lora_test \| \| tests/lora/test_qwen2vl.py::test_qwen2vl_lora_beam_search \| vllm_lora_test \| \| tests/lora/test_phi.py::test_phi2_lora \| DIDN'T FIND IT IT IN VLLM \| \| models/multimodal/generation/test_voxtral.py::test_models_with_multiple_audios[5-128-half] \| vllm_multi_model_test_28_failure_test \| \| models/test_initialization.py::test_can_initialize[VoxtralForConditionalGeneration] \| vllm_basic_models_test \| \| pytest -v -s -x lora/test_chatglm3_tp.py -k test_chatglm3_lora_tp4_fully_sharded_loras \| vllm_lora_tp_test_distributed \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/162292 Approved by: https://github.com/atalman, https://github.com/huydhn	2025-09-09 05:53:46 +00:00
James Wu	d85392a88e	Add BundledAOTAutogradSerializableCallable (#162170 ) This PR hooks up the python wrapper inductor backend to aot_compile. This is not the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now. In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162170 Approved by: https://github.com/zhxchen17 ghstack dependencies: #162169	2025-09-09 05:42:19 +00:00
Chien-Chin Huang	7feb8fc589	[SymmMEM] Allow to import _SymmetricMemory when NVSHMEM is not available (#162142 ) Summary: As we have multiple backends, _SymmetricMemory should not be imported together with NVSHMEM related modules Pull Request resolved: https://github.com/pytorch/pytorch/pull/162142 Approved by: https://github.com/dcci, https://github.com/kwen2501	2025-09-09 05:37:43 +00:00
PyTorch MergeBot	60d009267e	Revert "testing infra and some fixes (#162183 )" This reverts commit d8b6622bb6a3879d3832ab6cdc26ff4188ea4a2d. Reverted https://github.com/pytorch/pytorch/pull/162183 on behalf of https://github.com/huydhn due to Failing a test on macos ([comment](https://github.com/pytorch/pytorch/pull/162183#issuecomment-3268922096))	2025-09-09 05:26:32 +00:00
Isuru Fernando	4590438329	[fx] fix qualified name for methods of torch.Tensor (#162407 ) This fixes an error in the previous PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162407 Approved by: https://github.com/ezyang, https://github.com/XuehaiPan	2025-09-09 05:14:43 +00:00
Jeffro	8494afb837	Add missing fstream include to fix std::ofstream compilation error (#162421 ) ## Summary This PR adds a missing `#include <fstream>` to fix a compilation error that occurred with the clang compiler on the standard Google internal compile setup (built with bazel). ## Details The `std::ofstream` type was implicitly instantiated, which can cause compilation to fail with certain compilers. In this case, the clang compiler within the Google internal compile setup failed with an implicit instantiation error of `std::basic_ofstream<char>`. By explicitly including the `<fstream>` header, this PR resolves the error and ensures proper compilation in a wider range of setups and compilers. ## Error message: ``` torch/csrc/distributed/c10d/FlightRecorder.cpp:8:17: error: implicit instantiation of undefined template 'std::basic_ofstream<char>' 8 \| std::ofstream file(filename_, std::ios::binary); \| ^ libcxx/include/__fwd/fstream.h:26:7: note: template is declared here 26 \| class basic_ofstream; \| ^ 1 error generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162421 Approved by: https://github.com/ezyang	2025-09-09 05:14:32 +00:00
PyTorch UpdateBot	7ad40de60e	[audio hash update] update the pinned audio hash (#162437 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162437 Approved by: https://github.com/pytorchbot	2025-09-09 04:41:34 +00:00
PyTorch UpdateBot	607327beae	[vllm hash update] update the pinned vllm hash (#162356 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162356 Approved by: https://github.com/pytorchbot	2025-09-09 04:40:25 +00:00
Ke Wen	f216d64bfe	[SymmMem] Better tuning of A2AV based on accurate node boundary (#162003 ) Use `world_within_direct_access()` to distinguish intra- vs inter- node. Previously we assumed a fixed node size of 8, which is not true for NVL72. Also added env var `TORCH_SYMMMEM_NBLOCKS` for control. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162003 Approved by: https://github.com/ngimel, https://github.com/fduwjj	2025-09-09 04:18:17 +00:00
Nikita Shulga	847d7f21af	[CUDA-13] Implement workaround for cudaErrorNotSupported (#162412 ) See https://github.com/pytorch/pytorch/issues/162333#issuecomment-3267929585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162412 Approved by: https://github.com/eqy, https://github.com/atalman	2025-09-09 04:12:10 +00:00
Ke Wen	065c446193	[SymmMem] Use global pe for put and get (#162394 ) NVSHMEM put/get APIs take global PE instead of local counterpart. So we'd need to do a translation within the kernel. Also added a sub-group test for dispatch and combine mimic'ing the Expert Parallel cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162394 Approved by: https://github.com/ngimel, https://github.com/fegin ghstack dependencies: #162320	2025-09-09 03:58:48 +00:00
Ke Wen	98ecc0f374	[SymmMem] Add team pool to hold duplicated teams for the same rank group (#162320 ) When multiple threadblocks call device-side collectives concurrently, NVSHMEM requires each call being made on a separate team struct, see [Collective operations scopes and active sets](https://docs.nvidia.com/nvshmem/api/gen/api/collectives.html?highlight=nvshmem_barrier_all#collective-operations-scopes-and-active-sets). This PR adds a util `get_n_teams` for creating duplicated nvshmem teams for the same rank group, i.e. team pool. So that we can use them on device side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162320 Approved by: https://github.com/ngimel	2025-09-09 03:58:48 +00:00
Arsh Zahed	4c45090cf7	[DTensor] Check if tracing for sharding propagation to handle unhashable keys (#160798 ) Fixes #159590 This is similar to the reverted commit #156868, except it resolves an issue with two caches becoming misaligned, leading to incorrect objects for stateful placements (i.e. `_MaskPartial`) as in issue #159601. This adds little to no overhead in eager ([see past benchmarks](https://github.com/pytorch/pytorch/pull/156868#issuecomment-3047831149)). This also handles cases such as #159590 where dynamo is disabled during tracing by entering the Python Dispatcher ahead of the sharding propogation during compile. Tests are added/modified to handle these, and the list/tuple inputs with the cat op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160798 Approved by: https://github.com/bdhirsh	2025-09-09 03:52:05 +00:00
PyTorch MergeBot	1641606aa4	Revert "Add BundledAOTAutogradSerializableCallable (#162170 )" This reverts commit 5babb4d5c04b1ff7ed5f96f7aea1898cd4faef5a. Reverted https://github.com/pytorch/pytorch/pull/162170 on behalf of https://github.com/huydhn due to This PR has a merge conflict with D81793200 on aot_compile.py where PRs and diffs are landed in reverted order ([comment](https://github.com/pytorch/pytorch/pull/162170#issuecomment-3268735428))	2025-09-09 03:33:36 +00:00
Shunting Zhang	7b8a64557d	[inductor] fix 3d tiled online softmax (#162341 ) The online_softmax_reduce runtime helper previously assumes the input tl.Tensor's are 2d tensors. But with tiled reduction, they can be 3d (y, x, r). Pull Request resolved: https://github.com/pytorch/pytorch/pull/162341 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #162311	2025-09-09 02:59:52 +00:00
Tugsbayasgalan Manlaibaatar	d8b6622bb6	testing infra and some fixes (#162183 ) This PR is quite large in that it covers most of rough edges in the new strict export flow: 1. Handle nn_module_stack correctly now that we are tracing wrapper module 2. module_call_spec needs to get queried from source directly because we are not running the bytecode anymore. 3. Correct input and output handling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162183 Approved by: https://github.com/zhxchen17 ghstack dependencies: #162167	2025-09-09 02:42:11 +00:00
Yiming Zhou	a965f09793	[export] Update PT2 archive docs (#162308 ) Summary: Minor updates based on the recent refactoring for weight saving and loading Test Plan: doc change only Rollback Plan: Differential Revision: D81821994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162308 Approved by: https://github.com/angelayi	2025-09-09 02:08:13 +00:00
Kurt Mohler	583bbf7761	[MPS] Add `native_dropout` and `native_dropout_backward` (#162108 ) Fixes #162002 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162108 Approved by: https://github.com/malfet	2025-09-09 01:44:06 +00:00
Scott Wolchok	e025c0f459	Dynamo: set_eval_frame microoptimization (#162220 ) Optimize for common case and remove a pair of refcount operations (see new comments.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162220 Approved by: https://github.com/jansel, https://github.com/williamwen42 ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219	2025-09-09 01:10:06 +00:00
Scott Wolchok	a8a187b2cf	Overload _get_operation_for_overload_or_packet & friends to accept ArrayRef (#162219 ) Avoids requiring vector allocation to call this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162219 Approved by: https://github.com/Skylion007 ghstack dependencies: #161591, #161595, #161633, #161634, #161692	2025-09-09 01:10:06 +00:00
Scott Wolchok	12db2a7889	Call checkLong in is_int_or_symint, completing TODO (#161692 ) Calling this first minimizes overhead for plain old ints, making cheap things cheap. Differential Revision: [D81530098](https://our.internmc.facebook.com/intern/diff/D81530098) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161692 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #161591, #161595, #161633, #161634	2025-09-09 01:10:06 +00:00
Scott Wolchok	eab2afeff7	fastpath type Tensor in THPVariable_NewWithVar (#161634 ) It is cheap to do an exact check against Tensor and much faster when it works (PyType_IsSubtype does not have this fastpath, I checked [source](`9ee0214b5d/Objects/typeobject.c (L2889)`)). Spot-checked in perf on detach-DTensor-in-a-loop benchmark; small win but clear. Differential Revision: [D81530101](https://our.internmc.facebook.com/intern/diff/D81530101) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161634 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #161591, #161595, #161633	2025-09-09 01:10:06 +00:00
Scott Wolchok	a951f435fd	Avoid redundant PyTuple_GetSize call in _maybe_handle_torch_function (#161633 ) py::args::size() calls PyTuple_GetSize. Compiler can't know the two calls will always return the same result, so we have to consolidate them ourselves. Differential Revision: [D81530096](https://our.internmc.facebook.com/intern/diff/D81530096) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161633 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #161591, #161595	2025-09-09 01:10:06 +00:00
karthickai	6eb14ac60f	[Inductor] Fix cross-device scalar lowering - cpu scalar with cuda tensor fails in torch.compile (#161447 ) This PR fixes bug in TorchInductor where cross-device scalar indexing fails during compilation, causing discrepancies from eager mode behavior. Fixes: #140457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161447 Approved by: https://github.com/mlazos	2025-09-09 01:07:02 +00:00
PyTorch MergeBot	ed77e23b68	Revert "[dynamo] Constant fold torch.autograd._profiler_enabled (#158482 )" This reverts commit d7e1b8b11d7430c7633dcad6f6596b5df8fa02f7. Reverted https://github.com/pytorch/pytorch/pull/158482 on behalf of https://github.com/borgstrom due to NCCL hangs in S560336 ([comment](https://github.com/pytorch/pytorch/pull/158482#issuecomment-3268426781))	2025-09-09 00:21:05 +00:00
Ting Lu	897c4e70a7	Move to small wheel approach for CUDA SBSA wheel (#160720 ) https://github.com/pytorch/pytorch/issues/160673 Use download.pytorch.org's dependencies like x86 build instead of bundling libs into the wheel Pull Request resolved: https://github.com/pytorch/pytorch/pull/160720 Approved by: https://github.com/atalman	2025-09-09 00:18:43 +00:00
Zhengxu Chen	8485aac873	[precompile] Fix inlined source tracking with generators. (#162389 ) Summary: When compiled code has generator, code.co_firstlineno will be inconsistent with the result from inspect.getsource, which returns the toplevel enclosing code source rather than the inner code location. In this case, it seems simpler to just use the toplevel enclosing code location rather than the co_firstlineno field. Test Plan: test_package.py -k test_code_with_generator Rollback Plan: Differential Revision: D81929751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162389 Approved by: https://github.com/dolpm, https://github.com/hrithick-codes	2025-09-09 00:13:54 +00:00
atalman	c0fc86b511	Fix aarch64 wheel pack (#159481 ) PR that introduced the change: https://github.com/pytorch/builder/pull/1775 Use wheel pack instead of zip to repack the wheel. It should regenerate the RECORD file and update all the hashes correctly. TODO: Apply wheel pack instead of zip to Rest of builds Add validation test to make sure wheel contents matches RECORD file Pull Request resolved: https://github.com/pytorch/pytorch/pull/159481 Approved by: https://github.com/malfet	2025-09-08 23:36:50 +00:00
Thomas Bohnstingl	07f07309c6	[associative_scan] Autograd separated (#139939 ) This PR implements the Autograd feature of the associative_scan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139939 Approved by: https://github.com/huydhn	2025-09-08 23:30:11 +00:00
Laith Sakka	189a054cfb	Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. [attempt2] (#160869 ) [relanding again after fixing internal build] Summary: This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous() but want to find those call sites to handle this properly by calling is_contiguous_or_false() and not is_contiguous() explitly when appropriate. I had to fix one issue after removing the implicit size oblivious reasoning. here is context we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE. when people call is_contiguous we do sym_is_contiguous().guard_bool() when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false() one issue not handled well was this path ``` c10::SymBool TensorImpl::sym_is_contiguous_custom( at::MemoryFormat memory_format) const { if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) { return pyobj_slot_.load_pyobj_interpreter()->is_contiguous( this, memory_format); } return sym_is_contiguous_default(memory_format); } ``` namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format); This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning. once we removed that implicit size oblivious reasoning, the right thing we want is to call return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format); otherwise we would get DDE even if the caller is doing sym_is_contiguous. so I had to define it for pyinterpreter, and then I had to override it for nested tensors. Approved by: https://github.com/ezyang Test Plan: contbuild & OSS CI, see `e444cd24d4` Rollback Plan: Differential Revision: D80435179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160869 Approved by: https://github.com/ezyang	2025-09-08 22:59:13 +00:00
Colin Peppler	5fd6b6a2db	[refactor] add helper sizevars function, is_size_one, for size==1 checks (#162189 ) ## Summary - document guard behavior in `SizeVarAllocator.is_size_one` - use `is_size_one` for broadcast/expand checks. - This diff is a no-op since we'd use `shape_env.evaluate_expr(... fallback_value=False)` `a4f9132a17/torch/_inductor/sizevars.py (L450-L453)` ------ https://chatgpt.com/codex/tasks/task_e_68b8d0d1f2c48328b2d38c00e738bc8c Pull Request resolved: https://github.com/pytorch/pytorch/pull/162189 Approved by: https://github.com/laithsakka	2025-09-08 22:48:16 +00:00
drisspg	ac9ccd0dc2	Add return-max-scores to flex-attention (#161667 ) # Summary ### Update API ```Py class AuxRequest(NamedTuple): """Request which auxiliary outputs to compute from flex_attention. Each field is a boolean indicating whether that auxiliary output should be computed. """ lse: bool = False max_scores: bool = False class AuxOutput(NamedTuple): """Auxiliary outputs from flex_attention operation. Fields will be None if not requested, or contain the tensor if requested. """ lse: Optional[Tensor] = None max_scores: Optional[Tensor] = None out_only = flex_attention(query, key, value, score_mod) out_max, aux_max = flex_attention( query, key, value, score_mod, return_aux=FlexAttentionAuxRequest(max_scores=True), ) out_both, aux_both = flex_attention( query, key, value, score_mod, return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True), ) ``` Returns the max post mod scores from flex attention. Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups. Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args. We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors. ### Req Grad I currently dont return a max_scores that supports backproping grads. I think this might be feasible but since max is essentially 1 hot on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch). For now no grad, we can re-visit if needed. ## Perf I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path. ```Shell 🔝 Top 5 TFlops Deltas (by absolute %): shape: (5, 7) ┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡ │ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │ │ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘ 🔺 Top 5 Positive TFlops Deltas (highest +%): shape: (5, 7) ┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡ │ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │ │ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 161.031318 ┆ 158.597808 ┆ 2.43351 ┆ 1.534391 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘ 🔻 Top 5 Negative TFlops Deltas (lowest -%): shape: (5, 7) ┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡ │ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, ┆ 175.546923 ┆ 177.81205 ┆ -2.265127 ┆ -1.273888 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, ┆ 156.282597 ┆ 158.209134 ┆ -1.926537 ┆ -1.217715 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16, ┆ 232.542929 ┆ 235.140136 ┆ -2.597207 ┆ -1.104536 │ │ ┆ ┆ 2048, 128) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 169.652791 ┆ 171.475986 ┆ -1.823195 ┆ -1.063236 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng	2025-09-08 22:44:48 +00:00
Avik Chaudhuri	711c8c821e	shape guards (#161178 ) Summary: This PR introduces shape guards to export. Previously only value ranges, equalities, and specializations would be tracked for symbolic expressions, and we had a forward hook to check them. Instead now we create a function to check shape guards and call it in the exported program. Test Plan: updated several tests Rollback Plan: Differential Revision: D80713603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161178 Approved by: https://github.com/tugsbayasgalan	2025-09-08 22:44:09 +00:00
Laith Sakka	2c538c9acf	rewrite __maybe_broadcast should_expand check for unbacked (#162109 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162109 Approved by: https://github.com/aorenste ghstack dependencies: #162084, #162099	2025-09-08 22:41:18 +00:00
Laith Sakka	85fe94e933	make should_swap more dde friendly (#162099 ) unblock customers for common cases with DDE ,until @pianpwk land the change to should_swap https://github.com/pytorch/pytorch/pull/160473. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162099 Approved by: https://github.com/aorenste ghstack dependencies: #162084	2025-09-08 22:41:18 +00:00
Hao Wu	fecd9686f5	Graph split event tracker (#159795 ) Summary: A tool to track events in graph split, specifically on how nodes being end up in acc or cpu subgraphs. Usage: use env var to specify a mode and necessary arguments. FX_NET_ACC_SPLITTER_TRACKER_MODE: Tracker mode. ``` Different modes of the event tracker: "0": Tracker not enabled (by default) "1": Tracker enabled but no dumps. Information available by setting breakpoints and visually inspect in pdb. "2": Tracker enabled and dumps all events to DUMP_PREFIX_all.txt "3": In addition to events dump, track nodes specified by ENV_FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES recusrively and dump to DUMP_PREFIX_nodex.txt "4:: In addition to events dump, track all nodes with more than 1 event recusrively and dump to DUMP_PREFIX_nodex.txt ``` FX_NET_ACC_SPLITTER_TRACKER_DUMP_PATH: overriding dump path. Leave empty for `~`. FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES: Nodes to track for mode "3". Test Plan: New unit test Reviewed By: georgiaphillips Differential Revision: D79203595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159795 Approved by: https://github.com/ezyang	2025-09-08 21:30:17 +00:00
PyTorch MergeBot	dd44faa9d9	Revert "Modify ROCm MI2xx-based workflows to run on cron schedule (#162103 )" This reverts commit 0af70e2353e1dcda83175fd4834ecb7b63e009e0. Reverted https://github.com/pytorch/pytorch/pull/162103 on behalf of https://github.com/jithunnair-amd due to Cirrascale network outage resolved. Reverting back to running per commit to aid in triage and CI health ([comment](https://github.com/pytorch/pytorch/pull/162103#issuecomment-3267977825))	2025-09-08 20:53:05 +00:00
PyTorch MergeBot	5d819f3faf	Revert "[associative_scan] Autograd separated (#139939 )" This reverts commit 103f725afa8dbf0204a1be6a042ab93aa16d85d8. Reverted https://github.com/pytorch/pytorch/pull/139939 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing a weird failure after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/139939#issuecomment-3267945657))	2025-09-08 20:42:47 +00:00
Nikita Shulga	015423bef8	Add fp16-overflow regression test (#162401 ) Discovered while debugging https://github.com/pytorch/pytorch/issues/160841 where sdpa returned NaNs, because during the computation intermediate values were cast back to fp16 before normalization, which was fixed by https://github.com/pytorch/pytorch/pull/161999 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162401 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-09-08 20:33:23 +00:00
William Wen	26a1b9cce2	[dynamo] fix resume_execution.py KeyError in Python 3.11+ (#162318 ) Fixes https://github.com/pytorch/pytorch/issues/162313 Differential Revision: [D81938289](https://our.internmc.facebook.com/intern/diff/D81938289) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162318 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/anijain2305	2025-09-08 20:26:24 +00:00
Benjamin Glass	8f114650eb	Add std::any_of to ConvParams struct (#162334 ) Removes some for-loops that didn't short-circuit in favor of std::any_of. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162334 Approved by: https://github.com/Skylion007	2025-09-08 20:12:20 +00:00
Aaron Gokaslan	ec2c1371af	[BE]: Update cudnn frontend submodule to 1.14.1 (#162347 ) Fixes a few bugs introduced to CUDNN 1.11 which affects all our CUDA13 builds. Also adds support for new CUDNN features whenever we choose to update. @eqy pretty sure this addresses the concern you had over the previous upgrade since that bugfix is now merged. This is a simple header only update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162347 Approved by: https://github.com/eqy, https://github.com/atalman	2025-09-08 20:03:23 +00:00
IvanKobzarev	8ec01f34e9	[bucketing] custom_ops mode to hide inductor copies overhead (#161499 ) Adding "_custom_ops" bucketing to temporary fallback to eager execution of for_each, to workaround too many generated kernels on inductor side. This PR also reverts parts of bucketing changes for cycles detection that resulted in accuracy problems. Differential Revision: [D81152293](https://our.internmc.facebook.com/intern/diff/D81152293) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161499 Approved by: https://github.com/eellison	2025-09-08 20:03:08 +00:00
Ting Lu	9c991b63ff	[CD] [aarch64] Add CUDA 12.6 and 12.8 to build matrix, remove 12.9 build (#162364 ) https://github.com/pytorch/pytorch/issues/159779 Add the full CUDA support matrix to sbsa build (12.6, 12.8) Same arch support as x86 build Remove 12.9 sbsa build Pull Request resolved: https://github.com/pytorch/pytorch/pull/162364 Approved by: https://github.com/atalman	2025-09-08 20:00:25 +00:00
rzou	4e50651c5f	[DTensor] fix F.one_hot (#162307 ) F.one_hot(dtensor) used to run into a mixed DTensor-Tensor operation due to an arange call creating a new Tensor (not DTensor). This PR fixes it by allowing implicit replication of Tensors for the arange call and the one consumer of the arange call (the at::eq call). Test Plan: - new test. Also, F.one_hot(num_classes=-1) is broken so we skip that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162307 Approved by: https://github.com/ezyang ghstack dependencies: #162117	2025-09-08 19:37:08 +00:00
Edward Z. Yang	a0d026688c	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-08 19:10:36 +00:00
Edward Yang	d80297a684	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-08 19:10:36 +00:00
angelayi	fbcabb4fbd	Handle f([]) vs. f() in fake tensor caching (#162284 ) Fixes https://github.com/pytorch/pytorch/issues/162279 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162284 Approved by: https://github.com/manuelcandales, https://github.com/aorenste	2025-09-08 18:28:05 +00:00
PyTorch UpdateBot	314d47a210	[audio hash update] update the pinned audio hash (#162315 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162315 Approved by: https://github.com/pytorchbot	2025-09-08 18:26:33 +00:00
atalman	bc4176c92a	CD Windows CUDA 13.0 build - fix packaging of cuda dlls (#162383 ) Trying to fix https://github.com/pytorch/pytorch/issues/162333 CUDA 13.0 file structure changed. Instead of keeping most of dlls in bin folder its now in ``bin\x64`` except for cudnn dll. See attached picture : <img width="511" height="361" alt="Screenshot 2025-09-08 at 9 46 26 AM" src="https://github.com/user-attachments/assets/d2e630ee-930f-4da6-9b81-f9ef48fde7ce" /> <img width="490" height="333" alt="Screenshot 2025-09-08 at 9 46 34 AM" src="https://github.com/user-attachments/assets/194cbf43-b6ef-4218-b516-db37b91302be" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162383 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/malfet	2025-09-08 17:57:22 +00:00
eqy	de5dc1f038	[cuDNN][SDPA][Nested Tensor] add forward/backward caching support for cuDNN SDPA Nested tensor/varlen (#161434 ) Don't recompile every time Pull Request resolved: https://github.com/pytorch/pytorch/pull/161434 Approved by: https://github.com/drisspg	2025-09-08 17:51:13 +00:00
morrison-turnansky	72e6717d00	Avoid crash with release_available_cached_blocks (#162269 ) updated release behavior for cached blocks Fixes #159567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162269 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-09-08 17:46:43 +00:00
Shunting Zhang	ebd29a13fe	[inductor] fuse for scalar shared data (#162311 ) LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: https://github.com/pytorch/pytorch/pull/162311 Approved by: https://github.com/jansel	2025-09-08 17:20:46 +00:00
fengqing.lu	5793dd7875	[Intel GPU] Integrate OneDNN SDPA training forward and backward (#161058 ) This PR is the first split PR of https://github.com/pytorch/pytorch/pull/156272, only contains the OneDNN code. Please help review. Pending on OneDNN v3.9 commit update. Don't merge. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161058 Approved by: https://github.com/guangyey, https://github.com/EikanWang	2025-09-08 17:07:31 +00:00
Scott Wolchok	49c446c617	Add C++ function for torch.distributed.tensor._op_schema.is_view_op (#161595 ) This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better. Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161595 Approved by: https://github.com/ezyang, https://github.com/bdhirsh ghstack dependencies: #161466, #161586, #161590, #161591	2025-09-08 16:28:08 +00:00
Scott Wolchok	8e076d889c	Don't call check_has_torch_dispatch in THPVariable_NewWithVar if we already know (#161591 ) We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap. Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161591 Approved by: https://github.com/ezyang ghstack dependencies: #161466, #161586, #161590	2025-09-08 16:28:08 +00:00
Chien-Chin Huang	f044fa2902	[AsyncTP] Use assertEqual instead of allClose for bf16 tests (#162041 ) The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162041 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel ghstack dependencies: #162040	2025-09-08 16:12:52 +00:00
PyTorch MergeBot	a92773eeb1	Revert "Use vectorized stores for all dtypes in cat (#161649 )" This reverts commit 377033757ae5ca524ea842f1b0a5f446ed3d8fe0. Reverted https://github.com/pytorch/pytorch/pull/161649 on behalf of https://github.com/ngimel due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/161649#issuecomment-3266963044))	2025-09-08 15:58:58 +00:00
PyTorch MergeBot	53297f6ad0	Revert "[audio hash update] update the pinned audio hash (#162315 )" This reverts commit c9ac8c25ef9ad020542898ab569910a9d0cd1f7e. Reverted https://github.com/pytorch/pytorch/pull/162315 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if this introduced the failure https://github.com/pytorch/pytorch/actions/runs/17539536914/job/49810513700 ([comment](https://github.com/pytorch/pytorch/pull/162315#issuecomment-3266932718))	2025-09-08 15:52:30 +00:00
IvanKobzarev	25c170b72e	[inductor] Runtime estimations: use nccl estimator; mm only benchmark mode (#161405 ) During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms. Adding optional usage of: - c10d.time_estimator for collectives, which is based on NCCL estimator Benchmark mode only for matmuls, as they are highly dependent on mm backend - The logic mostly copied from Ruisi's PRs for inductor simple_fsdp https://github.com/pytorch/pytorch/pull/157572 This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()` Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161405 Approved by: https://github.com/eellison	2025-09-08 14:33:19 +00:00
David Berard	3f5993316e	[upstream triton] update triton pin to triton 3.5 (#162278 ) Update PyTorch to the latest Triton release candidate branch (release/3.5.x in triton-lang/triton) Notably: * this does not include the version number bump from 3.4 -> 3.5 (we'll do that in a follow-up PR) * sam_fast is still failing, so we've disabled it temporarily https://github.com/pytorch/pytorch/issues/162282 and we are committed to fixing it, ideally before the branch cut but possibly as a cherry-pick into the release branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162278 Approved by: https://github.com/atalman ghstack dependencies: #162244, #162309	2025-09-08 14:29:24 +00:00
PyTorch UpdateBot	e101411b9f	Update slow tests (#161395 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161395 Approved by: https://github.com/pytorchbot	2025-09-08 13:33:32 +00:00
PyTorch UpdateBot	32911ff541	[xla hash update] update the pinned xla hash (#162372 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162372 Approved by: https://github.com/pytorchbot	2025-09-08 11:31:16 +00:00
Chien-Chin Huang	5b90e85112	[AsyncTP] Fixes AsyncMM (#162040 ) The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect. Removing the alpha and beta fixes the issue. Thanks @ngimel to figure out the root cause. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162040 Approved by: https://github.com/danielvegamyhre	2025-09-08 10:53:59 +00:00
David Berard	31d5c67539	[inductor][triton] support static cuda launcher after triton # 7866 (#162309 ) Fixes static cuda launcher after https://github.com/triton-lang/triton/pull/7866. Static cuda launcher checks to make sure that no hook knobs are set (and if they are, it throws an error). But Triton has changed the semantics of hooks so that "empty hooks" are now represented by empty `HookChain`s instead of being represented by `None`. This PR changes the way we define "empty hooks" to account for HookChains. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162309 Approved by: https://github.com/aakhundov ghstack dependencies: #162244	2025-09-08 07:57:48 +00:00
David Berard	fb0afa853e	[inductor][triton] more JITCallable._hash_lock support (#162244 ) Follow-up to #161768. Context: ProcessPool pickles the outputs before sending them back to the main process. Triton kernels have some un-pickleable fields, so `prepare_for_pickle()` is used to strip out those fields. Previously, in the standard case (without triton_bundler.py), `prepare_for_pickle()` would strip out the un-pickleable fields and they would never be added back after unpickling, because the un-pickleable fields were not actually needed after compilation finished. In #161768 updated `prepare_for_pickle` to also strip out the `fn._hash_lock` field, a newly added field in JITCallable instances which is a `threading.RLock()`, which is not pickleable. It turns out that we do need to restore the `fn._hash_lock` field, even in the non-triton_bundler case - the MultiKernel case uses the hash lock. To do this, we add `restore_after_unpickle()` which will restore fields (or if the old fields are not provided, initialize just the hash_lock) Compile time benchmarks look good, maybe a very minor regression (see the comment below on the PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162244 Approved by: https://github.com/atalman	2025-09-08 07:57:48 +00:00
PyTorch MergeBot	1e0656f063	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit de893e96c775023aa3be895060848fac3296772c. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))	2025-09-08 07:04:36 +00:00
PyTorch MergeBot	29e09a6545	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit 01edcd4df8bf0c7b4cc2d3ec868bd2059eeea83b. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))	2025-09-08 07:04:36 +00:00
PyTorch UpdateBot	c9ac8c25ef	[audio hash update] update the pinned audio hash (#162315 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162315 Approved by: https://github.com/pytorchbot	2025-09-08 04:17:23 +00:00
Thomas Bohnstingl	103f725afa	[associative_scan] Autograd separated (#139939 ) This PR implements the Autograd feature of the associative_scan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139939 Approved by: https://github.com/ydwu4	2025-09-08 03:21:17 +00:00
James Wu	5babb4d5c0	Add BundledAOTAutogradSerializableCallable (#162170 ) This PR hooks up the python wrapper inductor backend to aot_compile. This is not the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now. In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162170 Approved by: https://github.com/zhxchen17 ghstack dependencies: #162169	2025-09-07 23:37:31 +00:00
James Wu	eb9073a6b7	[easy] [precompile] Convert CompileArtifacts to callable (#162169 ) The goal of this PR stack is to be able to implement `aot_compile_module`, which AOT precompiles a torch.nn.Module. Step 1 is a simple refactor to make CompileArtifacts itself the callable, which makes it easier to use directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162169 Approved by: https://github.com/zhxchen17	2025-09-07 23:37:31 +00:00
Yidi Wu	ec2e3687c7	[while_loop][autograd] support autograd_key of while_loop (#160483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160483 Approved by: https://github.com/zou3519	2025-09-07 21:55:29 +00:00
PyTorch MergeBot	ff2de5d522	Revert "[2/N]Port several test files under test/distributed to Intel GPU (#159473 )" This reverts commit 040d00af048967dde7938d358d7f5988cbd18388. Reverted https://github.com/pytorch/pytorch/pull/159473 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal signals, @d4l3k please help the author to have this change landed. [D81718444](https://www.internalfb.com/diff/D81718444) ([comment](https://github.com/pytorch/pytorch/pull/159473#issuecomment-3264046983))	2025-09-07 21:06:38 +00:00
PyTorch MergeBot	8235c4f65d	Revert "[ROCm] Enabling several UTs (#161715 )" This reverts commit b9ba612f7a968f7b27e121ca8f4d0a4d954f5354. Reverted https://github.com/pytorch/pytorch/pull/161715 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/159473, feel free to merge it back once conflicts are cleared ([comment](https://github.com/pytorch/pytorch/pull/161715#issuecomment-3264040604))	2025-09-07 21:03:17 +00:00
PyTorch MergeBot	e246a85b76	Revert "[1/N] Port 5 _composable/fsdp distributed test cases to Intel GPU (#159118 )" This reverts commit 5c473e9f5ee0ef0fc38e6cf34a95b547f8cdc8d5. Reverted https://github.com/pytorch/pytorch/pull/159118 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/159473 ([comment](https://github.com/pytorch/pytorch/pull/159118#issuecomment-3264037799))	2025-09-07 21:00:29 +00:00
PyTorch MergeBot	df59c21768	Revert "[BE] Cleanup stale comments/copy from `gemm` (#162001 )" This reverts commit 6087ef41e54c2494b117ffd923faf20f515a6806. Reverted https://github.com/pytorch/pytorch/pull/162001 on behalf of https://github.com/jeanschmidt due to breaks internal ads signal, see [D81845017](https://www.internalfb.com/diff/D81845017) ([comment](https://github.com/pytorch/pytorch/pull/162001#issuecomment-3264034312))	2025-09-07 20:53:16 +00:00
PyTorch MergeBot	093ab5f477	Revert "[inductor] add kernel template choice (ktc) (#161347 )" This reverts commit 9a8d454c464c0b811fc4586ff104424bccf1da0c. Reverted https://github.com/pytorch/pytorch/pull/161347 on behalf of https://github.com/jeanschmidt due to Seems to have broken internal builds, see [D81520569](https://www.internalfb.com/diff/D81520569) ([comment](https://github.com/pytorch/pytorch/pull/161347#issuecomment-3264027436))	2025-09-07 20:39:39 +00:00
PyTorch MergeBot	4348db0b92	Revert "[inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers (#161348 )" This reverts commit c32111149921b48bfef909293f1049e21619ed76. Reverted https://github.com/pytorch/pytorch/pull/161348 on behalf of https://github.com/jeanschmidt due to Seems to have broken internal builds, see [D81520569](https://www.internalfb.com/diff/D81520569) ([comment](https://github.com/pytorch/pytorch/pull/161347#issuecomment-3264027436))	2025-09-07 20:39:39 +00:00
Vinayak Pawar	9ad5e8edb1	Improve typing of ONNX decorators with ParamSpec (#162332 ) ## Summary This PR improves typing in ONNX-related modules by replacing TypeVar bound to Callable[..., Any] with ParamSpec to preserve parameter types and avoid type erasure in decorator functions. ## Changes - `torch/onnx/_internal/exporter/_flags.py`: Replace TCallable TypeVar with ParamSpec - `torch/onnx/ops/_impl.py`: Replace _T TypeVar with ParamSpec for _onnx_op decorator - `torch/onnx/_internal/exporter/_torchlib/_torchlib_registry.py`: Replace _T TypeVar with ParamSpec ## Motivation The previous implementation used TypeVar bound to Callable which erased parameter type information to Any. ParamSpec preserves the exact parameter types and return types, providing better type safety and IDE support. ## Testing - Verified all changes compile and import correctly - Created comprehensive test suite to validate ParamSpec functionality - No linting errors introduced - Maintains backward compatibility Fixes #142306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162332 Approved by: https://github.com/Skylion007	2025-09-07 18:06:03 +00:00
PyTorch MergeBot	7a83cf430e	Revert " [while_loop][autograd] support autograd_key of while_loop (#160483 )" This reverts commit 2b8a83901c58a0858ea9e4ce00055f48e6ed164c. Reverted https://github.com/pytorch/pytorch/pull/160483 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but some trunk tests are failing either from this PR or the previous one in the stack ([comment](https://github.com/pytorch/pytorch/pull/160483#issuecomment-3263597325))	2025-09-07 08:50:49 +00:00
PyTorch MergeBot	ada43ed39c	Revert "[inductor] pdl inductor option (disabled by default) (#160928 )" This reverts commit 9458d1ac3bd70c2af316a8ba95d2c6c9c1199c9c. Reverted https://github.com/pytorch/pytorch/pull/160928 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160928#issuecomment-3263560378))	2025-09-07 07:37:37 +00:00
Huy Do	93fb23d6fa	Build vLLM nightly wheels (#162000 ) This uses the same approach as building triton wheel where we publish a nightly wheel for vLLM whenever its pinned commit is updated. The key change is to use `pytorch/manylinux2_28-builder` as the base image to build vLLM, so there are a couple of changes on the vLLM Dockerfile used by lumen_cli 1. `pytorch/manylinux2_28-builder` is RedHat instead of Debian-based, so no apt-get 2. Fix a bug in `.github/actions/build-external-packages/action.yml` where `CUDA_VERSION` is not set correctly, preventing CUDA 12.9 build 3. Fix a bug in `.github/actions/build-external-packages/action.yml` where `TORCH_WHEELS_PATH` is not set correctly and always defaulted to `dist` 4. In vLLM Dockerfile, use the correct index for the selected CUDA version, i.e. https://download.pytorch.org/whl/nightly/cu12[89] for CUDA 12.[89] 5. Install torch, vision, audio in one command. Unlike the CI image `pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm`, `pytorch/manylinux2_28-builder` doesn't have any torch dependencies preinstalled 6. Bump xformers version to 0.0.32.post2 now that PyTorch 2.8.0 has been landed on vLLM We need to prepare 3 wheels for vLLM, xformers, and flashinfer-python. And I rename them in the same convention as PyTorch nightlies `MAJOR.MINOR.PATCH.devYYYYMMDD` so that vLLM nightlies will work with torch nightlies on the same date. ### Usage * Install latest nightlies ``` pip install --pre torch torchvision torchaudio vllm xformers flashinfer_python \ --index-url https://download.pytorch.org/whl/nightly/cu129 ``` * Install a specific version ``` pip install --pre torch==2.9.0.dev20250903 torchvision torchaudio \ vllm==1.0.0.dev20250903 \ xformers=0.0.33.dev20250903 \ flashinfer_python=0.2.14.dev20250903 \ --index-url https://download.pytorch.org/whl/nightly/cu129 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162000 Approved by: https://github.com/atalman	2025-09-07 06:09:17 +00:00
PyTorch MergeBot	104f2680e0	Revert "Add return-max-scores to flex-attention (#161667 )" This reverts commit 486b20b73cfcf32a773a4301b1b97f91c157ce76. Reverted https://github.com/pytorch/pytorch/pull/161667 on behalf of https://github.com/huydhn due to Sorry for reverting your change but reverting https://github.com/pytorch/pytorch/pull/161730 does not seem to fix all trunk failures ([comment](https://github.com/pytorch/pytorch/pull/161667#issuecomment-3263512642))	2025-09-07 06:00:55 +00:00
PyTorch MergeBot	eac3d6f04c	Revert "[inductor] fuse for scalar shared data (#162311 )" This reverts commit 2a45837e98c63cae9d1a2e2133a727b829e549d5. Reverted https://github.com/pytorch/pytorch/pull/162311 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is breaking lint ([comment](https://github.com/pytorch/pytorch/pull/162311#issuecomment-3263511162))	2025-09-07 05:57:43 +00:00
PyTorch UpdateBot	fea20775ad	[vllm hash update] update the pinned vllm hash (#162314 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162314 Approved by: https://github.com/pytorchbot	2025-09-07 04:29:23 +00:00
Shunting Zhang	2a45837e98	[inductor] fuse for scalar shared data (#162311 ) LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: https://github.com/pytorch/pytorch/pull/162311 Approved by: https://github.com/jansel ghstack dependencies: #162028, #162221, #162303	2025-09-07 01:48:45 +00:00
Yiming Zhou	b919560c4a	[nativert] AOTI lowering and packaging as NativeRT delegate (#162285 ) Summary: A demo for creating AOTI delegate for NativeRT in OSS. - It supports full graph lowering only. - It leverages `executorch_call_delegate` HOP but doesn't rely on `executorch`. - The delegate graph is obtained by tracing a `LoweredBackendModule` whose forward function calls `executorch_call_delegate`. - The main difference between `executorch_call_delegate` and `aoti_call_delegate` is that the delegate graph from `executorch_call_delegate` doesn't have weights lifted as inputs. - original_ep and delegate_ep are treated as flat EP dictionary and there is no nested structure. - The naming contract is enforced by `model_name` and `backend_id` Test Plan: CI Rollback Plan: Differential Revision: D81641157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162285 Approved by: https://github.com/dolpm	2025-09-07 01:29:54 +00:00
Animesh Jain	e3068cdb44	[dynamo] Use relaxed CLOSURE_MATCH guard then ID_MATCH (#162247 ) I am unable to write a test that would fail here. The reason is that when we do _dynamo.disable(fn) in the compiled frame, the id of disabled function changes but currently we guard on the original function - `fn` whose id is not changing. This PR still guards on the `fn.__code__` just to be more precise. Thanks to @thenumberouscode for pointing this out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162247 Approved by: https://github.com/StrongerXi, https://github.com/jansel	2025-09-07 01:25:52 +00:00
Yiming Zhou	5211f1f908	[export] Move example inputs in move_to_device_pass (#162301 ) Summary: If i have a EP that's exported on CPU and want to AOTI compile it for CUDA. I need to use `move_to_device_pass`. But in `torch._inductor.aoti_compile_and_package()`, it directly uses the `example_inputs` attached to the EP, so we should move the example inputs as well if applicable. Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_move_device_example_inputs Rollback Plan: Differential Revision: D81812366 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162301 Approved by: https://github.com/angelayi	2025-09-06 23:54:54 +00:00
Yidi Wu	2b8a83901c	[while_loop][autograd] support autograd_key of while_loop (#160483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160483 Approved by: https://github.com/zou3519 ghstack dependencies: #160548, #160467	2025-09-06 21:26:33 +00:00
Yidi Wu	48e3be3ab6	[while_loop][autograd] add hop while_loop_stack_output (#160467 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160467 Approved by: https://github.com/zou3519 ghstack dependencies: #160548	2025-09-06 21:26:33 +00:00
mansiag05	5927a70934	NLLLoss: validate target is 0D when input is 1D (#161412 ) Add a shape check in nll_loss_forward to error out when both input and target are 1D. Added a unit test to cover the incompatible 1D/1D case. Fixes #157420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161412 Approved by: https://github.com/ngimel	2025-09-06 20:58:42 +00:00
Shunting Zhang	1a588ace46	[inductor] rename deps during refreshing (#162303 ) Skiping renaming cause wrong dependencies when mutations are involved. Test: CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162303 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #162028, #162221	2025-09-06 20:38:28 +00:00
Shunting Zhang	541aa23de5	[inductor] fix TemplateBuffer.extract_read_writes (#162221 ) Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162221 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #162028	2025-09-06 20:38:28 +00:00
Tugsbayasgalan Manlaibaatar	047603d35b	New export implementation with flat inp/out (#162167 ) This is my first attempt of building new export API. The main thing it addresses is correctly getting input and output relations. Subsequent diffs willl add functionality for dynamic shapes, nn_module_stack etc. Differential Revision: [D81793205](https://our.internmc.facebook.com/intern/diff/D81793205) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162167 Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri	2025-09-06 20:03:52 +00:00
Deng, Daisy	ae0edc133e	[3/N] Enable 6 fsdp test on Intel GPU (#161601 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR is created base on PR https://github.com/pytorch/pytorch/pull/158533 and https://github.com/pytorch/pytorch/pull/159473 and will work on some test files under test/distributed/fsdp. We could enable Intel GPU with following methods and try the best to keep the original code styles in this PR: 1. add allow_xpu=True in instantiate_device_type_tests() if needed. 2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend 3. enabled XPU for some test path Pull Request resolved: https://github.com/pytorch/pytorch/pull/161601 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-06 16:47:13 +00:00
Daniel Vega-Myhre	b6d0a9ea90	MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump (#162209 ) ## Summary - We just landed 2d-2d support for mxfp8 grouped gemm in FBGEMM: https://github.com/pytorch/FBGEMM/pull/4816 - This is needed for backward pass of mxfp8 MoE training with grouped gemms - Changes: - Add dispatching + input validation for mxfp8 grouped gemm in `torch._scaled_grouped_mm` - Add meta registration input validation for mxfp8 grouped gemm, for composability with compile - Add unit tests exercising torch._scaled_grouped_mm with mxfp8 inputs - Bump FBGEMM third party submodule to include: - https://github.com/pytorch/FBGEMM/pull/4816 - https://github.com/pytorch/FBGEMM/pull/4820 - https://github.com/pytorch/FBGEMM/pull/4821 - https://github.com/pytorch/FBGEMM/pull/4823 #### How fbgemm dependency was bumped Documenting this since I haven't found it documented elsewhere: - `cd ~/pytorch/third_party/fbgemm` - `git fetch` - `git checkout <hash>` - `cd ~/pytorch` - `git add third_party/fbgemm` ## Test plan #### Test build ``` USE_FBGEMM_GENAI=1 python -m pip install --no-build-isolation -v -e . ... Successfully installed torch-2.9.0a0+gitf5070f3 ``` [full build log](https://www.internalfb.com/phabricator/paste/view/P1933787581) #### Unit tests ``` pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm_ ... test/test_matmul_cuda.py ......... [100%] ============================================================== 9 passed, 1668 deselected in 5.34s =============================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162209 Approved by: https://github.com/ngimel	2025-09-06 15:25:30 +00:00
eqy	5985e28912	[CUDA 13][cuDNN][Windows] Roll back cuDNN upgrade from 9.13 to 9.12 on Windows (#162322 ) Forward fix for #162268 CC @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/162322 Approved by: https://github.com/atalman, https://github.com/nWEIdia	2025-09-06 13:32:07 +00:00
Blaine Burton Rister	9aedb3cd87	[AOTI-FX] Support registering custom FX backends (#162317 ) # Feature Currently, `torch._inductor.compile_aot` always uses the `WrapperFxCodegen` class. In contrast, Python and C++ codegen allow users to register custom backends. This PR brings that feature to FX codegen. # Test plan Added a CI test registering a custom FX backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162317 Approved by: https://github.com/jansel	2025-09-06 07:32:03 +00:00
PyTorch MergeBot	0ff8eabf13	Revert "[dynamo] Graph break on on user-defined class in compiled region (#161670 )" This reverts commit 146371483318e17929daefd37c8e459d9d6d47bb. Reverted https://github.com/pytorch/pytorch/pull/161670 on behalf of https://github.com/jeanschmidt due to seems to have introduced https://github.com/pytorch/pytorch/actions/runs/17507127561/job/49733379267 and https://github.com/pytorch/pytorch/actions/runs/17507127561/job/49733379271 ([comment](https://github.com/pytorch/pytorch/pull/161670#issuecomment-3261241229))	2025-09-06 06:18:57 +00:00
Jeffro	28f4ab0737	Add -Wno-ctad-maybe-unsupported compiler flag (#162223 ) When running bazel build, we (Google) run into the following error. The `-Wctad-maybe-unsupported` warning would be raised to an error and break the build in certain cases. So, we propose to suppress the warning to make the build with bazel more smooth. This is the error message we got: ``` c10/util/IntrusiveList.h:166:12: error: 'std::reverse_iterator' may not intend to support class template argument deduction [-Werror,-Wctad-maybe-unsupported] 166 \| return std::reverse_iterator{end()}; \| ^ c10/test/util/IntrusiveList_test.cpp:24:18: note: in instantiation of member function 'c10::IntrusiveList<(anonymous namespace)::ListItem>::rbegin' requested here 24 \| auto it = c1.rbegin(); \| ^ c10/test/util/IntrusiveList_test.cpp:43:5: note: in instantiation of function template specialization '(anonymous namespace)::check_containers_equal<(anonymous namespace)::ListItem>' requested here 43 \| check_containers_equal(l, v); \| ^ libcxx/include/__iterator/reverse_iterator.h:51:7: note: add a deduction guide to suppress this warning 51 \| class reverse_iterator \| ^ 1 error generated. ``` @haifeng-jin Pull Request resolved: https://github.com/pytorch/pytorch/pull/162223 Approved by: https://github.com/ezyang	2025-09-06 06:11:37 +00:00
Codeboi007	c98ddaca6d	Fixed comment to match logic in distributed_c10d.py (#162158 ) inconsistent with the logic introduced in #162157 and modified in #142216.This update ensures the documentation matches the actual behavior of the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162158 Approved by: https://github.com/wconstab	2025-09-06 05:37:49 +00:00
morrison-turnansky	bc505977fb	torch.zeros bound checks for symint (#161976 ) Fixes #161490 I added a bounds check for negative symints to create a better error message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161976 Approved by: https://github.com/ezyang	2025-09-06 05:37:42 +00:00
orangeH25	aac1a50a19	Add api info for torch._C._nn.pyi (#162148 ) Fix part of #148404 APis involved are as followed: - cross_entropy_loss - hardsigmoid_ - hardswish - hardswish_ - huber_loss Pull Request resolved: https://github.com/pytorch/pytorch/pull/162148 Approved by: https://github.com/FFFrog, https://github.com/ezyang	2025-09-06 05:21:40 +00:00
Isuru Fernando	20b47acef8	[fx] fix qualified name for methods of torch.Tensor (#162224 ) Fixes #160077, #154721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162224 Approved by: https://github.com/ezyang	2025-09-06 05:16:19 +00:00
Mario Šaško	da4db4b33d	Fix `DeviceMesh._flatten` docstring example (#162277 ) Fix the `DeviceMesh._flatten` docstring example of use. Alternative fix would be to replace `mesh_3d["dp", "cp"]` with `mesh_3d["cp", "tp"]`. (I verified the fix using the `gloo` backend) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162277 Approved by: https://github.com/ezyang	2025-09-06 05:00:00 +00:00
PyTorch MergeBot	a3e5466002	Revert "Resize to 0 if not going to be used (#161730 )" This reverts commit 081cab045472ce045634548cc6c14a4870641e23. Reverted https://github.com/pytorch/pytorch/pull/161730 on behalf of https://github.com/davidberard98 due to functorch/test_aotdispatch.py::TestAOTModuleSimplified::test_flex_attn_noncontiguous_tangents [GH job link](https://github.com/pytorch/pytorch/actions/runs/17506617662/job/49731934012) [HUD commit link](`081cab0454`) ([comment](https://github.com/pytorch/pytorch/pull/161730#issuecomment-3260492575))	2025-09-06 04:17:08 +00:00
Boyuan Feng	c0983e6cc0	[Graph Partition] interface for custom cg wrapper (#162207 ) This PR adds an interface to allow users to specify custom cudagraph wrapper. User example: [vllm](https://github.com/vllm-project/vllm/pull/24281) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162207 Approved by: https://github.com/zou3519, https://github.com/eellison, https://github.com/ProExpertProg	2025-09-06 03:13:01 +00:00
Edward Z. Yang	b2b4add0e7	Docs on export joint with descriptors (#159006 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159006 Approved by: https://github.com/SherlockNoMad	2025-09-06 03:02:58 +00:00
Gabriel Ferns	20629b1619	Add contiguous subgraph transformation threshold (#162192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162192 Approved by: https://github.com/coconutruben	2025-09-06 02:48:00 +00:00
Raman Kumar	c3ceca2995	codebase structure documentation to include torchgen (#162261 ) 📚 The doc update adding description about torchgen folder in code structure guide Pull Request resolved: https://github.com/pytorch/pytorch/pull/162261 Approved by: https://github.com/ezyang	2025-09-06 02:10:57 +00:00
Eddie Yan	145a3a7bda	[CUDA 13][cuDNN] Bump CUDA 13 to cuDNN 9.13.0 (#162268 ) Fixes some `d_qk` != `d_v` cases on Hopper that are broken by cuDNN 9.11-9.12 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162268 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2025-09-06 01:59:03 +00:00
ruisizhang123	291cd11f2d	[inductor] estimate peak memory in codegen only when buffer reuse (#162300 ) As titled, this PR ensures peak memory is estimated only when buffer reuse is enabled. Without this config, some nodes' successor nodes are eliminated from memory estimation after inductor bucketing, which can cause errors. The original codegen peak memory estimation code is from this PR: https://github.com/pytorch/pytorch/pull/159530 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162300 Approved by: https://github.com/eellison, https://github.com/v0i0	2025-09-06 01:30:38 +00:00
Yang Wang	7f4ff79210	remove deprecated vllm test (#162306 ) Fixes https://github.com/pytorch/pytorch/issues/162274 the test is removed from vllm side Pull Request resolved: https://github.com/pytorch/pytorch/pull/162306 Approved by: https://github.com/malfet	2025-09-06 01:27:13 +00:00
Will Feng	0f45aaf441	Disable autocast when running joint graph passes (#162304 ) Fixes #159469. See https://github.com/pytorch/pytorch/issues/159469#issuecomment-3221474027 for root-cause analysis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162304 Approved by: https://github.com/bdhirsh, https://github.com/zou3519, https://github.com/eellison	2025-09-06 00:57:58 +00:00
dolpm	4f72d932fe	re-land triton runtime implementation" (#162217 ) Summary: original pr - https://github.com/pytorch/pytorch/pull/161798 Test Plan: ci Rollback Plan: Differential Revision: D81724234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162217 Approved by: https://github.com/SherlockNoMad	2025-09-06 00:52:29 +00:00
Rob Timpe	1463714833	[dynamo] Graph break on on user-defined class in compiled region (#161670 ) Currently, user-defined classes inside of a compiled frame will cause the whole frame to be skipped by dynamo. This change defers the Unsupported exception until the __build_class__ builtin is actually called, which allows a graph break to be inserted. Fixes #161562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161670 Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas	2025-09-06 00:04:57 +00:00
drisspg	081cab0454	Resize to 0 if not going to be used (#161730 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #161730 * #161667 ```Py with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32) buf1 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32) buf2 = empty_strided_cuda((2, 32, 1024, 64), (2097152, 65536, 64, 1), torch.float32) # Topologically Sorted Source Nodes: [flex_attention], Original ATen: [] stream0 = get_raw_stream(0) triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, buf1, arg4_1, arg3_1, arg5_1, arg6_1, buf2, 8, 2, 32, stream=stream0) del arg0_1 del arg1_1 del arg2_1 del arg3_1 del arg4_1 del arg5_1 del arg6_1 del buf0 del buf1 return (buf2, ) ``` Vs ```Py with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32) buf1 = empty_strided_cuda((0, ), (1, ), torch.float32) buf2 = empty_strided_cuda((2, 32, 1024, 64), (2097152, 65536, 64, 1), torch.float32) # Topologically Sorted Source Nodes: [flex_attention], Original ATen: [] stream0 = get_raw_stream(0) triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, buf1, arg4_1, arg3_1, arg5_1, arg6_1, buf2, 8, 2, 32, stream=stream0) del arg0_1 del arg1_1 del arg2_1 del arg3_1 del arg4_1 del arg5_1 del arg6_1 del buf0 del buf1 return (buf2, ) ``` <img width="428" height="145" alt="Screenshot 2025-08-28 at 12 37 11 PM" src="https://github.com/user-attachments/assets/240a7bca-97e1-40c4-bf93-f075fdc1a40d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161730 Approved by: https://github.com/Skylion007, https://github.com/BoyuanFeng ghstack dependencies: #161667	2025-09-05 23:21:46 +00:00
drisspg	486b20b73c	Add return-max-scores to flex-attention (#161667 ) # Summary ### Update API ```Py class AuxRequest(NamedTuple): """Request which auxiliary outputs to compute from flex_attention. Each field is a boolean indicating whether that auxiliary output should be computed. """ lse: bool = False max_scores: bool = False class AuxOutput(NamedTuple): """Auxiliary outputs from flex_attention operation. Fields will be None if not requested, or contain the tensor if requested. """ lse: Optional[Tensor] = None max_scores: Optional[Tensor] = None out_only = flex_attention(query, key, value, score_mod) out_max, aux_max = flex_attention( query, key, value, score_mod, return_aux=FlexAttentionAuxRequest(max_scores=True), ) out_both, aux_both = flex_attention( query, key, value, score_mod, return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True), ) ``` Returns the max post mod scores from flex attention. Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups. Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args. We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors. ### Req Grad I currently dont return a max_scores that supports backproping grads. I think this might be feasible but since max is essentially 1 hot on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch). For now no grad, we can re-visit if needed. ## Perf I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path. ```Shell 🔝 Top 5 TFlops Deltas (by absolute %): shape: (5, 7) ┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡ │ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │ │ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘ 🔺 Top 5 Positive TFlops Deltas (highest +%): shape: (5, 7) ┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡ │ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │ │ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 161.031318 ┆ 158.597808 ┆ 2.43351 ┆ 1.534391 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘ 🔻 Top 5 Negative TFlops Deltas (lowest -%): shape: (5, 7) ┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡ │ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, ┆ 175.546923 ┆ 177.81205 ┆ -2.265127 ┆ -1.273888 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, ┆ 156.282597 ┆ 158.209134 ┆ -1.926537 ┆ -1.217715 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16, ┆ 232.542929 ┆ 235.140136 ┆ -2.597207 ┆ -1.104536 │ │ ┆ ┆ 2048, 128) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 169.652791 ┆ 171.475986 ┆ -1.823195 ┆ -1.063236 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng	2025-09-05 23:21:46 +00:00
Xuan Zhang	4d4abec80f	allow user to pass in custom partitioner function (#157580 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157580 Approved by: https://github.com/bdhirsh	2025-09-05 22:49:39 +00:00
Nikita Shulga	9c03d6be87	[CD][BE] Delete Python-3.9 case (#162265 ) And raise error when building for an unsupported version Pull Request resolved: https://github.com/pytorch/pytorch/pull/162265 Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: #162297	2025-09-05 22:46:36 +00:00
Nikita Shulga	8d50355d97	[CD][EZ] Update libtorch python version to 3.10 (#162297 ) Not sure why it was at 3.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162297 Approved by: https://github.com/clee2000, https://github.com/atalman	2025-09-05 22:46:36 +00:00
dolpm	e0a62b266c	[aot-precompile] default-filter global guards (#162090 ) if the user doesn't provide their own guard filter fn, we should by default filter global guards. pytest test/dynamo/test_aot_compile.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/162090 Approved by: https://github.com/zhxchen17	2025-09-05 22:44:55 +00:00
Saurabh Mishra	01ab325cc2	[DCP][Quantization] Fix the issue when scale vector is in a different SafeTensors file (#162214 ) Summary: The current dequantization implementation assumes that the weight and scale tenors are in the same SafeTensors files. This diff fixes the issue to support the case when these could be in different files. Test Plan: buck test fbcode//caffe2/test/distributed/checkpoint\:test_quantized_hf_storage Buck UI: https://www.internalfb.com/buck2/532bf151-bb40-41fd-b080-ff898675afe2 Test UI: https://www.internalfb.com/intern/testinfra/testrun/15199648851011082 Rollback Plan: Differential Revision: D81718598 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162214 Approved by: https://github.com/wwwjn	2025-09-05 22:43:58 +00:00
Laith Sakka	79fcd5247a	symbolic cpp channels_last_contiguous (#160402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160402 Approved by: https://github.com/aorenste	2025-09-05 21:40:32 +00:00
rzou	70d36e047d	Making batching rule for F.embedding DTensor-aware (#162117 ) `vmap(F.embedding)(DTensor, DTensor)` was failing because F.embedding's batching rule generates a new tensor via at::arange, at::arange generates a regular tensor, and DTensor rightfully errors on mixed DTensor-regular Tensor operations. This PR fixes the problem by activating DTensor implicit replication on just the at::arange and the subsequent add operation. In order to accomplish this I move the DTensor implicit replication flag to C++ (most batching rules are in C++). Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/162117 Approved by: https://github.com/bdhirsh	2025-09-05 21:40:14 +00:00
Nikita Shulga	a00cdc1e41	[CD][BE] Get rid of SETUPTOOLS and PYYAML extra pins (#162266 ) As those weren't really a pins to begin with, and requirments.txt already has those Pull Request resolved: https://github.com/pytorch/pytorch/pull/162266 Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: #162263, #162264	2025-09-05 21:32:52 +00:00
Shunzhi Wen	c10195e723	[C10d][Gloo] Enable complex datatype support in ProcessGroupGloo (#156633 ) - Enable communication of tensors with Complex datatype in ProcessGroupGloo, similar to how ProcessGroupNCCL handles it. - Move a function, which checks if Complex datatype is supported by a reduce operation, from ProcessGroupNCCL.cpp into a new file to be shared with ProcessGroupGloo. Fixes #156632 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156633 Approved by: https://github.com/d4l3k	2025-09-05 21:24:36 +00:00
Boyuan Feng	771f369448	[Inductor] Improve RoPE (#161420 ) This PR fuses ROPE from 2 kernels into 1 kernel. Shape: ``` q: [B, Hq, S, D] k: [B, Hkv, S, D] ``` `Hq=32, Hkv=8, D=128` following Llama3 setting. <img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161420 Approved by: https://github.com/shunting314	2025-09-05 20:55:20 +00:00
henrylhtsang	92a43025e0	[cutlass backend] Add FP8 tests for multiple linears (#160782 ) Adding a test that is closer to real use case. Thanks @mlazos for fixing a few issues so this test works for most cases. We still have to skip the AOTI and dynamic case due to accuracy issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160782 Approved by: https://github.com/mlazos	2025-09-05 20:23:25 +00:00
Xuehai Pan	2fa0520a64	[BE][pytree] cleanup parameterized pytree tests (#160842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160842 Approved by: https://github.com/Skylion007	2025-09-05 20:15:29 +00:00
Edward Z. Yang	01edcd4df8	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-05 20:15:11 +00:00
Edward Yang	de893e96c7	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-05 20:15:11 +00:00
Nikita Shulga	6087ef41e5	[BE] Cleanup stale comments/copy from `gemm` (#162001 ) Followup after https://github.com/pytorch/pytorch/pull/154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/162001 Approved by: https://github.com/drisspg ghstack dependencies: #161999	2025-09-05 19:59:51 +00:00
Nikita Shulga	a3c7f77e50	[EZ][CD] Update MacOS deployment platform to 11.0 (#162264 ) Fixes following warning ``` MACOSX_DEPLOYMENT_TARGET is set to a lower value (10.15) than the version on which the Python interpreter was compiled (11.0) ``` Update deployment platform in `README.MD` as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/162264 Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: #162263	2025-09-05 19:58:04 +00:00
Justin Chu	3771380f83	[ONNX] Hide draft export under a flag (#162225 ) Use `TORCH_ONNX_ENABLE_DRAFT_EXPORT` to control whether draft_export should be used as a strategy in onnx export. Follow up of https://github.com/pytorch/pytorch/pull/161454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162225 Approved by: https://github.com/xadupre, https://github.com/titaiwangms	2025-09-05 19:54:50 +00:00
PyTorch MergeBot	adae7f66aa	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit c37103234afc832dcad307e9016230810957c9d5. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))	2025-09-05 18:58:47 +00:00
PyTorch MergeBot	70f865ac9b	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit ef3be6726f7ff4b77c22db10cec5b686f9107ea9. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))	2025-09-05 18:58:47 +00:00
Scott Wolchok	88d94d17e8	Add torch.Tensor._make_dtensor to accelerate DTensor.__new__ further (#161590 ) This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from #160580 (120ish usec -> 110ish usec) Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161590 Approved by: https://github.com/albanD ghstack dependencies: #161466, #161586	2025-09-05 18:43:41 +00:00
Ruben Rodriguez Buchillon	c321111499	[inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers (#161348 ) \# why - every callsite just executes the generator on the spot - previous pr adds the ability to add an override before expensive generators are executed, so we don't need this generator anymore \# what - rather than yielding the ChoiceCaller, just return the list of all valid ChoiceCallers \# testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520574](https://our.internmc.facebook.com/intern/diff/D81520574) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161348 Approved by: https://github.com/eellison ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345, #161346, #161347	2025-09-05 18:02:53 +00:00
Ruben Rodriguez Buchillon	9a8d454c46	[inductor] add kernel template choice (ktc) (#161347 ) # why - gather everything up to make choices, without running potentially expensive generators - enables overrides where we toss the entire list of configs from inductor, without having to enumrate it (expensive) # what - add a holding class that just gets all the components necessary to generate a ChoiceCaller - use that class to generate ChoiceCallers - this does not (yet) add the override function, but just prepares the scene ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520569](https://our.internmc.facebook.com/intern/diff/D81520569) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161347 Approved by: https://github.com/eellison ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345, #161346	2025-09-05 18:02:53 +00:00
Ruben Rodriguez Buchillon	e02e9edb55	[inductor] V.choice.get_mm_configs takes a stack of templates (#161346 ) # why - enables us to just gather relevant templates and get all choices at once - that in turns allows us to make op wide override decisions # what - V.choice.get_mm_configs takes a stack of templates - all callsites just provide a stack of size 1 right now but do not merge everything yet (other features pending) # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520583](https://our.internmc.facebook.com/intern/diff/D81520583) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161346 Approved by: https://github.com/eellison ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345	2025-09-05 18:02:46 +00:00
Ruben Rodriguez Buchillon	d63ad53a99	[inductor][ez] return choicecallers directly (#161345 ) # why - remove repeat patterns - we have everything to make the choicecallers - templates - input_nodes - layouts - all the kwargs # what - yield a choicecaller directly from V.choices.get_mm_configs # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520577](https://our.internmc.facebook.com/intern/diff/D81520577) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161345 Approved by: https://github.com/jansel ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344	2025-09-05 18:02:38 +00:00
Ruben Rodriguez Buchillon	031d79cb51	[inductor] move max-autotune logic inside V.choices.get_mm_configs (#161344 ) # why - heuristics providers know decide whether to (or which choices to add) in the max-autotune case - enables an eventual override point to gracefully fallback to the standard behavior # what - max-autotune is determined inside V.choices.get_mm_configs because it's mm only right now, we can just do `config.max_autotune or config.max_autotune_gemm` a TODO indicates that this can change in the future when this expands to more templates # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520573](https://our.internmc.facebook.com/intern/diff/D81520573) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161344 Approved by: https://github.com/jansel ghstack dependencies: #162075, #161340, #161341, #161342, #161343	2025-09-05 18:02:30 +00:00
Ruben Rodriguez Buchillon	a301dc3b60	[inductor][ez] pass template rather than template.uid (#161343 ) # why - simpler interface - enables future of extracting more things out of the template e.g. a hash # what V.choices.get_mm_configs now takes the whole template rather than just the template.uid # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520576](https://our.internmc.facebook.com/intern/diff/D81520576) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161343 Approved by: https://github.com/jansel ghstack dependencies: #162075, #161340, #161341, #161342	2025-09-05 18:02:22 +00:00
Ruben Rodriguez Buchillon	af590cb729	[inductor][aten] treat like a template in GEMMs (#161342 ) # why - central point to analyze and override all generated choices # what - add a pseudo heuristic for aten that just yields a single, empty kwargs - add a pseudo heuristic with the bias_addmm logic for it - add an addmm specific heuristic that yields a single choice, but also expands it with alpha and beta kwargs - replace all the aten.bind calls with V.choices.get_mm_configs using the now matching API for aten # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520580](https://our.internmc.facebook.com/intern/diff/D81520580) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161342 Approved by: https://github.com/jansel ghstack dependencies: #162075, #161340, #161341	2025-09-05 18:02:10 +00:00
Ruben Rodriguez Buchillon	4902c76c65	[inductor][ez] add template/externchoice uid (#161341 ) # why - to have a central registry of templates/externkernelchoice to match them to heuristics etc, they need unique names - mm is both the triton template name and the aten_mm name # what - add a uid() to KernelTemplate/ExternKernelChoice that returns name - override in ExternKernel to prepend "aten::" - override in TritonTemplate to prepend "triton::" This id is just use to find template heuristics, so it has no other impact # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520579](https://our.internmc.facebook.com/intern/diff/D81520579) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161341 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #162075, #161340	2025-09-05 18:01:58 +00:00
Ruben Rodriguez Buchillon	9602590b15	[inductor] move scaled_mm input nodes logic (#161340 ) # why - a step towards a unified interface for all choices, where any adjustment to nodes (e.g. unsqueezing) happens as part of choice specific preprocessing, behind a common point # what - move the unsqueeze logic for triton nodes for scaled_mm inside the new hookup for adjusting the kernel inputs for template heuristics # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "scale" ``` Differential Revision: [D81520582](https://our.internmc.facebook.com/intern/diff/D81520582) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161340 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #162075	2025-09-05 18:01:44 +00:00
Ruben Rodriguez Buchillon	2ef665ae19	[inductor][contigous mm] mild refactor (#162075 ) # why - use the new heuristics logic better to handle kwargs # what - move all checks into the heuristics to yield a single choice or not choices if the decomposition should not be used - fix `hip` device type, which should be `cuda` - let heuristics handle the kwarg passing # testing in ci Differential Revision: [D81706776](https://our.internmc.facebook.com/intern/diff/D81706776) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162075 Approved by: https://github.com/exclamaforte, https://github.com/jansel	2025-09-05 18:01:07 +00:00
Mikayla Gawarecki	b18bb6796f	Add const to stable amax (#162082 ) Fixes https://github.com/pytorch/pytorch/issues/161826 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162082 Approved by: https://github.com/soulitzer	2025-09-05 17:37:49 +00:00
PyTorch MergeBot	d711f27845	Revert "[ROCm] [CK] Composable Kernel integration for inductor backend (#158747 )" This reverts commit 019fed39aa6b2dd8c69347378d53423e5efae8d4. Reverted https://github.com/pytorch/pytorch/pull/158747 on behalf of https://github.com/jithunnair-amd due to Broke linux-binary-manywheel-rocm / manywheel-py3_9-rocm6_4-test: `019fed39aa/1` ... PR didn't have this job run successfully due to CI outage ([comment](https://github.com/pytorch/pytorch/pull/158747#issuecomment-3259212343))	2025-09-05 17:27:45 +00:00
Nikita Shulga	261a84a176	[CD][BE] Remove unnecessary checks for XCode version (#162263 ) None of them have worked for a while, PyTorch for Mac is build with XCode-15.4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162263 Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi	2025-09-05 17:02:36 +00:00
xinan.lin	98374612fc	[Intel GPU] Update Intel triton commit pin to Triton 3.5.x (#161777 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161777 Approved by: https://github.com/EikanWang	2025-09-05 16:55:47 +00:00
Eddie Yan	c2a3024617	[cuBLASLt][FP8] `cuBLASLt` appears to support float8 rowwise-scaling on H100 (#161305 ) Following #157905 I think the macro around ``` TORCH_INTERNAL_ASSERT(use_rowwise == false, "rowwise scaled_gemm not supported with blaslt"); ``` was never updated and this would cause `float8` tests to fail. Also it appears the `Lt` accepts two inputs with `e4m3` and `e5m2` dtypes simultaneously, so removing that check here as well... CC @lw Pull Request resolved: https://github.com/pytorch/pytorch/pull/161305 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-05 16:55:09 +00:00
Xingyuan Li	b2c7b9ad2d	[Intel GPU][FlexAttention] Enable TMA path on Intel GPU (#162138 ) The existing `can_use_tma` has some conditions that are unnecessary for Intel GPUs. We have removed these useless conditions on the Intel GPU path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162138 Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/jansel, https://github.com/etaf	2025-09-05 16:54:51 +00:00
PyTorch MergeBot	f3cebec39e	Revert "Rename propagate_tensor_meta to make private again (#161744 )" This reverts commit 734ce8eba9c69381f187359bf0fef1d71d84cd20. Reverted https://github.com/pytorch/pytorch/pull/161744 on behalf of https://github.com/jeanschmidt due to seems to break internal tests, see D81657000 for more details ([comment](https://github.com/pytorch/pytorch/pull/161744#issuecomment-3258934519))	2025-09-05 16:20:29 +00:00
Saurabh Mishra	06da7c0730	[DCP][Quantization] Fix for FP8 multiplication during dequantization (#162202 ) Summary: Weight vector needs to be upcasted since some FP8 formats (like Float8_e4m3fn) don't have CPU implementations in PyTorch. Reference: https://docs.pytorch.org/docs/stable/tensors.html#id13 We will use FP32 for the scale vector multiplication and convert to the target dtype. Upcasting helps with the following: 1. Full CPU support: `float32` has complete CPU kernel implementations for all operations 2. Numerical stability: `float32` provides more precision during intermediate calculations 3. Compatibility: Works across all devices (CPU/GPU) and PyTorch versions Test Plan: UTs Rollback Plan: Differential Revision: D81711093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162202 Approved by: https://github.com/wwwjn	2025-09-05 16:06:21 +00:00
Edward Yang	2dd529df00	A basic CLAUDE.md based on bad things I see claude code doing (#162163 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162163 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-09-05 14:52:36 +00:00
Shunting Zhang	a714437093	[ez][inductor] add a few outer dimension reduction cases for LOAF (#162028 ) For the not able to fuse issue reported here: https://github.com/pytorch/pytorch/issues/93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162028 Approved by: https://github.com/jansel, https://github.com/eellison	2025-09-05 09:30:13 +00:00
atalman	bffc7dd1f3	[CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds (#161916 ) Related to https://github.com/pytorch/pytorch/issues/159779 Adding CUDA 13.0 libtorch builds, followup after https://github.com/pytorch/pytorch/pull/160956 Removing CUDA 12.9 builds, See https://github.com/pytorch/pytorch/issues/159980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>	2025-09-05 07:47:54 +00:00
Zeng, Xiangdong	5c473e9f5e	[1/N] Port 5 _composable/fsdp distributed test cases to Intel GPU (#159118 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - enabled XPU for some test path - skip some test cases which Intel GPU does not support Pull Request resolved: https://github.com/pytorch/pytorch/pull/159118 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-05 05:52:15 +00:00
Pian Pawakapan	5da573c42c	[PGO] handle PGO profile merges (#162097 ) Avoid merges from extra PGO key, if same source has different rank. Unlikely to happen (needs code hash match & source variable type to change), but being safe. Differential Revision: D81299840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162097 Approved by: https://github.com/bobrenjc93	2025-09-05 04:58:15 +00:00
PyTorch UpdateBot	494878a11b	[audio hash update] update the pinned audio hash (#162114 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162114 Approved by: https://github.com/pytorchbot	2025-09-05 04:32:16 +00:00
PyTorch UpdateBot	3bbc2e3e4f	[vllm hash update] update the pinned vllm hash (#162226 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162226 Approved by: https://github.com/pytorchbot	2025-09-05 04:32:08 +00:00
Nick Riasanovsky	b67c410398	[BE] [Inductor] Add Kernel name to all coor-desc tuning (#161409 ) Summary: When running coordinate descent tuning the logging is difficult to parse if the results are parallelized at all. This includes the kernel name in each step so post-processing can unify the results, even if run in parallel. Test Plan: NFC. Just a logging change. Rollback Plan: Differential Revision: D80942794 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161409 Approved by: https://github.com/PaulZhang12	2025-09-05 02:53:13 +00:00
Colin L Reliability Rice	be5b03dde9	Allow for using a dedicated binary for the torch subproc pool. (#162093 ) Summary: The binary torch is running inside of can be larger than needed and in certain situations, this can cause a loss of memory. Test Plan: We've manually run tests via ``` TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_WORKER_SUPPRESS_LOGGING=0 make mc8-train-publish-cint-datafm-toy -C minimal_viable_ai/models/ifr_mtml/main_v1/ 2>&1 \| tee ~/run_out ``` and overriding the binary used to be the built fbpkg in /packages. We've also kicked off manual runs at ``` fire-feid-20250903-1051-ae8c6827 ``` Which do show the binary running - https://fburl.com/scuba/procprint/e6lwv32m Rollback Plan: steps: - jk.update: jk: pytorch/compiler:subproc_worker_binary constant_bool: null consistent_pass_rate: null fractional_host_rollout: null sampling_rate: null - manual.note: content: '' Differential Revision: D81616624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162093 Approved by: https://github.com/masnesral	2025-09-05 01:43:46 +00:00
Eddie Yan	73eb4511fb	[B200][NVFP4] Fix argument passing in `test_blockwise_mxfp8_nvfp4_mxfp4_numerics_` (#162185 ) to unblock https://github.com/pytorch/pytorch/pull/159494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162185 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-09-05 01:24:59 +00:00
Jeffro	29280864d9	Add new parameter for gen_pyi.py to make it more configureable. (#161772 ) This is a reposting of PR #128519. This change is important to how we maintain PyTorch at Google. From the previous PR: " This will make the script more flexible for the directory where it is executed. ... We plan to use the deprecated_yaml from a blaze genrule that invokes pyi.py. As the input to the pyi.py, genrule requires the input file to be explicitly listed out. When we feed the value of tools/autograd/deprecated.yaml to genrule, it failed to resolve since tools/autograd is a package from blaze perspective. Any file under a blaze package will a proper blaze target to be access. " Pull Request resolved: https://github.com/pytorch/pytorch/pull/161772 Approved by: https://github.com/albanD Co-authored-by: Haifeng Jin <haifeng-jin@users.noreply.github.com>	2025-09-05 00:48:15 +00:00
angelayi	5c67426d68	[dynamo] Add support for const prop on .item (#162204 ) Fixes some of the errors in https://fb.workplace.com/groups/1028545332188949/permalink/1303030824740397/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/162204 Approved by: https://github.com/williamwen42	2025-09-05 00:28:49 +00:00
Nikita Shulga	d2d4c8e9b2	[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999 ) Followup after https://github.com/pytorch/pytorch/pull/154012 Fixes CPU part of https://github.com/pytorch/pytorch/issues/160841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161999 Approved by: https://github.com/drisspg	2025-09-04 23:35:27 +00:00
Eddie Yan	c7e41071a0	[B200][MXFP8] Fix regex in `test_blockwise_mxfp8_nvfp4_error_messages_recipe_mxfp8_cuda` (#162180 ) to unblock https://github.com/pytorch/pytorch/pull/159494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162180 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/nWEIdia	2025-09-04 23:29:10 +00:00
xinan.lin	9499c8761c	[Inductor][Intel GPU] Register triton template heuristic for addmm tma. (#162132 ) Fixes #162048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162132 Approved by: https://github.com/jansel	2025-09-04 23:01:57 +00:00
Nan Zhang	3a207816cc	Forward fix for user defined triton kernel grid calc (#162162 ) Summary: This change fixes the test: inductor:fxir_backend - test_custom_triton_autotune_dynamic which was broken by https://github.com/pytorch/pytorch/pull/160997 Test Plan: inductor:fxir_backend - test_custom_triton_autotune_dynamic Rollback Plan: Differential Revision: D81679217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162162 Approved by: https://github.com/eellison, https://github.com/jansel	2025-09-04 22:51:23 +00:00
Yiming Zhou	09be1890d7	[export] Fix torch.export.load with storage offset (#162172 ) Summary: As titled Test Plan: CI Rollback Plan: Differential Revision: D81687701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162172 Approved by: https://github.com/angelayi	2025-09-04 22:50:33 +00:00
Pian Pawakapan	0d84ff3b78	[PGO] log add_extra_remote PGO to tlparse (#161751 ) Summary: log when additional PGO profile is merged in, from added read key Test Plan: test_pgo Rollback Plan: Differential Revision: D81284190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161751 Approved by: https://github.com/bobrenjc93	2025-09-04 22:47:03 +00:00
PyTorch MergeBot	1ec2c15914	Revert "Fix Arm64 OSS pytorch build with FBGEMM (#161527 )" This reverts commit dbec08729fb9848bebed6048c63831b87170d061. Reverted https://github.com/pytorch/pytorch/pull/161527 on behalf of https://github.com/malfet due to This breaks all Mac builds, see `b04e922712/1` ([comment](https://github.com/pytorch/pytorch/pull/161527#issuecomment-3256034443))	2025-09-04 22:29:38 +00:00
Shangdi Yu	b04e922712	Fix memory leak in AOTI when calling `aoti_torch_as_strided` (#162118 ) Summary: Fix memory leak in AOTI when calling `aoti_torch_as_strided` If you have something like `AtenTensorHandle buf_handle`; and you allocated memory to it, you have to make it a `RAIIAtenTensorHandle` to release the ownership. Otherwise you have leaked the memory because even when the program ends, there's still a pointer pointing to the underlying storage of `buf_handle_restrided`, and the storage is never freed. Test Plan: ``` buck run fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_pad_non_zero_memory_leak ``` Also verified by looking at `print(f"Allocated memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB")` Differential Revision: D81640339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162118 Approved by: https://github.com/angelayi	2025-09-04 22:17:06 +00:00
Brian Hirsh	0d71a9dd5b	fix incorrect interaction between DDPOptimizer and donated buffers (#160745 ) This should fix https://x.com/wightmanr/status/1953147089518772254?t=ng_R4t0-tRhO_qQE8NqOhw&s=19. Still working on adding a reasonable test. You can see more of a description of the problem in the code comments. But the TLDR is that: * When using DDPOptimizer, we partition the graph and compile several subgraphs. So 1 dynamo graphs becomes N AOT/inductor artifacts * We have some existing logic to stash graph metadata (`fw_metadata`) in dynamo's TracingContext. When using DDPOptimizer, we generate one `fw_metadata` per AOT graph, and we stash it on the 1 TracingContext from dynamo. So we end up clobbering the `fw_metadata` for graph i-1 when AOT and inductor start compiling graph i * This is normally ok, but it becomes a problem if inductor ever wants to read from this `fw_metadata` during backward compilation. Why? We (by default) compile the backwards lazily. So when using DDPOptimizer, we will compile backward graph N, then bw graph N-1, etc. But... at the time that we have stated compiling bw graph N-1, its corresponding fw_metadata has already been clobbered! So we end up reusing graph N's metadata for all of our backward graph compilations. With donated buffer metadata, that means we end up donated and writing into incorrect input buffers The fix that I added was to add more dedicated DDPOptimizer metadata into the TracingContext, so we can properly switch between these N different `fw_metadata` objects in the backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160745 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-09-04 21:57:27 +00:00
Ke Wen	89d41d3f61	[SymmMem] Feed tensor.data_ptr instead of handle.buffer_ptr into kernels (#162193 ) After MemPool support, `get_buffer_ptrs` points to base address of allocation segment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162193 Approved by: https://github.com/ngimel	2025-09-04 21:26:05 +00:00
Ke Wen	9bdcee01f8	[SymmMem] Add root argument to broadcast op (#161090 ) It was missing earlier. Also added range check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161090 Approved by: https://github.com/fegin	2025-09-04 21:09:54 +00:00
Prachi Gupta	b9ba612f7a	[ROCm] Enabling several UTs (#161715 ) All these UTs are working as is, just removing the skip - test_p2p_ipc - test_repros.py: working, added fp8 support - test_activation_checkpointing.py - test_content_store.py - test_cuda_multigpu.py - test_compute_comm_reordering.py - test_segment_reductions.py - test_dataloader.py - test_math_ops.py - test_loop_ordering.py - test_control_flow.py - distributed_test.py - test_mem_tracker.py - test_fsdp_optim_state.py - test_fully_shard_mixed_precision.py: skippped for < ROCm7.0 - test_aot_inductor_custom_ops.py - test_c10d_ops_nccl.py - test_eager_transforms.py - test_sparse_csr.py - test_inductor_collectives.py - test_fake_tensor.py - test_cupy_as_tensor.py - test_cuda.py: enable UTs that are working - test_matmul_cuda.py: enable UTs that are working Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-09-04 20:43:03 +00:00
PyTorch MergeBot	d5b38410b5	Revert "[SymmMem] Add root argument to broadcast op (#161090 )" This reverts commit 3c0ff1b569c45cfa6935ad8031a9d4cf1551aa3f. Reverted https://github.com/pytorch/pytorch/pull/161090 on behalf of https://github.com/jeanschmidt due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/161090#issuecomment-3255574093))	2025-09-04 20:42:31 +00:00
PyTorch MergeBot	48bedd753d	Revert "Fix usage of forwarding references (#161094 )" This reverts commit 1ebd70d0c0d562d3be9abdee2a21906584af7d99. Reverted https://github.com/pytorch/pytorch/pull/161094 on behalf of https://github.com/jeanschmidt due to checking if revert will fix https://github.com/pytorch/pytorch/actions/runs/17470601839/job/49621447581 ([comment](https://github.com/pytorch/pytorch/pull/161094#issuecomment-3255541480))	2025-09-04 20:35:41 +00:00
Wang, Eikan	a3d72b09ae	Apply Triton tensor descriptor for flex-decoding for performance (#161643 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161643 Approved by: https://github.com/drisspg	2025-09-04 20:10:41 +00:00
Edward Z. Yang	ef3be6726f	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-04 20:05:50 +00:00
PyTorch MergeBot	95ee0bfea9	Revert "[nativert] triton runtime implementation (#161798 )" This reverts commit 3dde5d7f9bf80dd6623a712bc429e9e4302464b5. Reverted https://github.com/pytorch/pytorch/pull/161798 on behalf of https://github.com/jeanschmidt due to introducing linting failures ([comment](https://github.com/pytorch/pytorch/pull/161798#issuecomment-3255412085))	2025-09-04 20:05:24 +00:00
Ben Niu	dbec08729f	Fix Arm64 OSS pytorch build with FBGEMM (#161527 ) Summary: X-link: https://github.com/pytorch/FBGEMM/pull/4775 Without this change, Arm64 OSS pytorch build with FBGEMM failed with the following error. Undefined symbols for architecture arm64: "fbgemm::FindMinMax(float const, float, float*, long long)", referenced from: at::native::fbgemm_linear_int8_weight_fp32_activation(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&, at::Tensor const&) in QuantizedLinear.cpp.o at::native::fbgemm_linear_quantize_weight(at::Tensor const&) in QuantizedLinear.cpp.o PackedConvWeight<2>::apply_dynamic(at::Tensor const&, bool) in qconv_dynamic.cpp.o PackedConvWeight<3>::apply_dynamic(at::Tensor const&, bool) in qconv_dynamic.cpp.o at::Tensor PackedLinearWeight::apply_dynamic_impl<false>(at::Tensor, bool) in qlinear_dynamic.cpp.o at::Tensor PackedLinearWeight::apply_dynamic_impl<true>(at::Tensor, bool) in qlinear_dynamic.cpp.o ld: symbol(s) not found for architecture arm64 This change fixed the issue by moving FindMinMax's implementation from QuantUtilsAvx2.cc to QuantUtils.cc. FindMinMax is a platform-agnostic function with AVX2-specific optimizations so conceptually it can be put in QuantUtils.cc. Test Plan: With this change, Arm64 OSS pytorch built successfully with FBGEMM enabled. Rollback Plan: Reviewed By: q10 Differential Revision: D81052327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161527 Approved by: https://github.com/q10	2025-09-04 20:01:13 +00:00
PyTorch MergeBot	c3d54dea9f	Revert "[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999 )" This reverts commit 02c83f13348631d80aa23f57aaff6b7d1223bbdd. Reverted https://github.com/pytorch/pytorch/pull/161999 on behalf of https://github.com/jeanschmidt due to break a few internal tests ([comment](https://github.com/pytorch/pytorch/pull/161999#issuecomment-3255381925))	2025-09-04 19:56:48 +00:00
PyTorch MergeBot	afa6e5604d	Revert "[BE] Cleanup stale comments/copy from `gemm` (#162001 )" This reverts commit b40d9432be44a6b5974ee62e7d19c3c61c5ece37. Reverted https://github.com/pytorch/pytorch/pull/162001 on behalf of https://github.com/jeanschmidt due to break a few internal tests ([comment](https://github.com/pytorch/pytorch/pull/161999#issuecomment-3255381925))	2025-09-04 19:56:48 +00:00
PyTorch MergeBot	9e5247f51d	Revert "[MPS] enable cat op for sparse (#162007 )" This reverts commit 2c03f0acc53ed13fe8ebfe809129f25996e009a0. Reverted https://github.com/pytorch/pytorch/pull/162007 on behalf of https://github.com/jeanschmidt due to Breaks internal builds see [D81588372](https://www.internalfb.com/diff/D81588372), @malfet may you help the author? ([comment](https://github.com/pytorch/pytorch/pull/162007#issuecomment-3255357336))	2025-09-04 19:49:44 +00:00
Edward Yang	c37103234a	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-04 19:43:17 +00:00
dolpm	3dde5d7f9b	[nativert] triton runtime implementation (#161798 ) Summary: att Test Plan: ci Rollback Plan: Reviewed By: minjang Differential Revision: D80828148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161798 Approved by: https://github.com/minjang, https://github.com/SherlockNoMad	2025-09-04 19:00:15 +00:00
Aaron Gokaslan	1f51056bd6	[BE]: Update cpp-httplib submodule to 0.26.0 (#162181 ) Update cpp-httplib with better error handling, bugfixes, and performance. Header only library update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162181 Approved by: https://github.com/jansel	2025-09-04 18:59:32 +00:00
Animesh Jain	6b1900c22f	[dynamo][hops] Remove const outputs from the speculated subgraph (#161355 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161355 Approved by: https://github.com/zou3519	2025-09-04 18:52:01 +00:00
mansiag05	9480cdc0b6	Modified the docs to add example for torch.is_floating_point and torc… (#161951 ) …h.is_complex. The PR proposes adding a simple, self-explanatory example to the documentation page. The example demonstrates the function's output for tensors with various data types, showing both True and False return values. Fixes #161859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161951 Approved by: https://github.com/zou3519	2025-09-04 18:50:19 +00:00
eqy	6f7608d603	[cuDNN][SDPA] Enable cuDNN SDPA by default for SM 9.0, SM 10.0 (#162073 ) for 2.9 🙏 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162073 Approved by: https://github.com/drisspg	2025-09-04 18:46:28 +00:00
Albert W	d1a15abfdc	export: add explicit decomposition for aten.expand_copy and unit test (#161688 ) Fixes #161080 torch.export.export fails with TypeError: expand() got an unexpected keyword argument 'implicit' when calling torch.expand_copy(..., implicit=True). This happened because expand_copy = _make_copy_from_view(aten.expand) register aten. expand as the decomposition path for aten.expand_copy, which doesn’t accept the implicit argument. I have added an explicit a decomposition for aten.expand_copy in torch/_decomp/decompositions.py to ignore the implicit argument, and a simple unit test to demonstrate the bug being fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161688 Approved by: https://github.com/angelayi, https://github.com/can-gaa-hou	2025-09-04 18:16:56 +00:00
Animesh Jain	33028597bf	[dynamo] Make the MRO walk more narrow (#162105 ) I dont have a failing test case but just saw an extra guard somewhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162105 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel	2025-09-04 17:54:33 +00:00
vasiliy	9eadb37cdd	enable float32 and float16 in `torch._grouped_mm` fallback (#162059 ) Summary: Enables `torch.float32` and `torch.float16` options in `torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`, `mat_b`, and `out_dtype` are `torch.bfloat16`. Saving for future PRs: 1. enabling testing on more platforms 2. supporting out_dtype != mat_a.dtype 3. opinfo 4. better compile support Test Plan: ```bash // on A100 and H100 pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x // on H100 pytest test/test_matmul_cuda.py -s -k test_scaled_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/162059 Approved by: https://github.com/ngimel, https://github.com/eqy ghstack dependencies: #161407, #161717	2025-09-04 17:48:52 +00:00
vasiliy	61fb632cfb	move `_grouped_mm` fallback to composite explicit autograd (#161717 ) Summary: Moves the `torch._grouped_mm` fallback from cuda-only code to a place where it can be used by multiple backends. Specifically: 1. make the fallback path and util functions reusable and move them to `ATen/native/GroupedMMUtils.h` 2. register a backend-agnostic kernel to composite explicit autograd key 3. refactor the grouped_mm tests to their own test case and enable CPU At the end of this PR, here is the support matrix: * CUDA SM90+: fast path with test coverage (no change) * CUDA SM80+: fallback with test coverage (no change) * CPU: fallback works, but without test coverage (new in this PR) * other SM versions and other backends: will probably already work, but let's leave this to future PRs * float32/float16: will probably already work, but let's leave this to future PRs Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/161717 Approved by: https://github.com/ngimel, https://github.com/drisspg ghstack dependencies: #161407	2025-09-04 17:48:52 +00:00
vasiliy	8a736fa1ea	create torch._grouped_mm fallback path with for loops / bmm (#161407 ) Summary: Creates a fallback path for `torch._grouped_mm`, using the naive for loop implementation (or bmm). For the sake of keeping the PR small, this PR only enables SM80+ (CUDA capability 8.0 and up), since I am testing this on an A100 machine. In future PRs, we can increase the coverage of the fallback to: 1. float32 and float16, which will extend the GPU coverage 2. cpu Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_3d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_2d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_2d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_3d -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/161407 Approved by: https://github.com/drisspg, https://github.com/eqy	2025-09-04 17:48:44 +00:00
Ke Wen	8bb213b6d5	[SymmMem] Increase signal pad size for NVL72 (#162026 ) so that the signal calls do not step on each other's foot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162026 Approved by: https://github.com/ngimel	2025-09-04 17:41:38 +00:00
Ke Wen	869cbcc16e	[SymmMem] Add a helper API to distinguish intra- and inter- node (#161984 ) Added a helper API to tell if the world is entirely within a P2P domain or crosses network. This is mainly for nblocks tuning purpose. (In later PRs) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161984 Approved by: https://github.com/ngimel ghstack dependencies: #161983	2025-09-04 17:37:59 +00:00
Frank Lin	0c0e056a9e	[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 ) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it capturing graph) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True\|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * Free marker: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * Terminal: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered reusable on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352 Approved by: https://github.com/ngimel, https://github.com/eqy Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-04 17:21:26 +00:00
William Wen	f36f285953	[dynamo] change error_on_graph_break/fullgraph semantics (#161747 ) This PR implements the semantics change to `torch._dynamo.error_on_graph_break`: - ~`torch.compile` now has a new `error_on_graph_break` kwarg that serves as a lower-priority toggle for erroring/continuing on graph breaks~ - `error_on_graph_break` is a new internal `torch.compile `setting that is lower-priority than `fullgraph`. It allows the user to toggle erroring/continuing on graph breaks. - `error_on_graph_break` does nothing when `fullgraph=True` - `error_on_graph_break` does NOT guarantee a single graph Followup [DONE]: need to change the programming model docs to reflect the 3 graph break modes for compilation: - `fullgraph=True`: enforce one graph, no graph breaks, cannot be toggled - `fullgraph=False, error_on_graph_break=True`: errors on graph breaks, latter can be toggled during compile time - `fullgraph=False, error_on_graph_break=False`: resumes tracing on graph breaks, latter can be toggled during compile time Pull Request resolved: https://github.com/pytorch/pytorch/pull/161747 Approved by: https://github.com/mlazos ghstack dependencies: #161739	2025-09-04 17:10:17 +00:00
Cui, Yifeng	ba7f546ccc	Update torch-xpu-ops commit pin (#162062 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@83c5a5](`83c5a5a551`), includes: - Revert "Disable xccl timer avoid drlm hang" because XPU time event issue has been fixed - Fallback lu_factor kernel to CPU for single batch - Enable aten::linalg_inv and aten::linalg_inv_ex on XPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/162062 Approved by: https://github.com/EikanWang	2025-09-04 17:05:33 +00:00
Lakshay Garg	43b7c86a2c	Add dependency-groups.dev to pyproject.toml (#161216 ) [PEP 735](https://peps.python.org/pep-0735) introduces the [dependency-groups] table for a number of use-cases one of which includes specifying development dependencies for projects. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161216 Approved by: https://github.com/seemethere	2025-09-04 16:51:36 +00:00
iupaikov-amd	019fed39aa	[ROCm] [CK] Composable Kernel integration for inductor backend (#158747 ) This is a part of our effort for integrating Composable Kernel library for Inductor backend. Currently we have a submodule, but would prefer to have commit pin control over the library as with Triton. We intentionally avoid putting all installation logic in CI scripts to allow locally built versions to have this functionality. The idea is to have CK as a pytorch dependency in pytorch 2.9 release to allow people to use it with inductor and AOT inductor and then gradually step away from submodule usage. Right now CK usage in SDPA/Gemm is tied to submodule files. This PR is a remake of due to branch error: https://github.com/pytorch/pytorch/pull/156192 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158747 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-09-04 16:51:06 +00:00
Oguz Ulgen	81aeefa657	Add torch.compile support for triton.constexpr_function (#162106 ) Fixes #161868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162106 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-09-04 16:46:55 +00:00
Edward Yang	248355faf5	Don't require FakeStore to be passed into fake backend (#162164 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162164 Approved by: https://github.com/bdhirsh, https://github.com/albanD, https://github.com/wconstab	2025-09-04 16:43:49 +00:00
Lakshay Garg	1ebd70d0c0	Fix usage of forwarding references (#161094 ) I found a number of places that seem to want forwarding references but the type signature does not reflect that Pull Request resolved: https://github.com/pytorch/pytorch/pull/161094 Approved by: https://github.com/malfet	2025-09-04 16:34:39 +00:00
Alexander Grund	cc5bdd1240	Keep default `CMAKE_PREFIX_PATH` in test_aot_inductor_package (#161907 ) `CMAKE_PREFIX_PATH` is a list of paths used to find dependencies. The test overwrites that with a single path causing dependencies such as protobuf or Abseil not being found. Instead prepend the path to the existing value. This fixes a test failure: > pytorch-v2.7.1/test/inductor/test_aot_inductor_package.py", line 242, in test_compile_after_package > self.assertTrue(so_path.exists()) > AssertionError: False is not true Caused by: ``` /software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::utility: No such file or directory /software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::variant: No such file or directory collect2: error: ld returned 1 exit status ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161907 Approved by: https://github.com/Skylion007	2025-09-04 16:27:57 +00:00
Yu, Guangye	3a20a20e70	Fix largeTensorTest malfunction on XPU (#161988 ) # Motivation https://github.com/pytorch/pytorch/pull/143553/files#diff-6492991193449e118ff0c8d42ca544cc38a73604e505ff246a3c711aeab91748R1345 makes `largeTensorTest` malfunction on XPU. This PR aims to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161988 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-09-04 16:10:03 +00:00
PyTorch MergeBot	6b8b3ac440	Revert "[ROCm] Use MI325 (gfx942) runners for binary smoke testing (#162044 )" This reverts commit cd529b686d54bbaa443f5b310140de48422d96c7. Reverted https://github.com/pytorch/pytorch/pull/162044 on behalf of https://github.com/jeffdaily due to mi200 backlog is purged, and mi300 runners are failing in GHA download ([comment](https://github.com/pytorch/pytorch/pull/162044#issuecomment-3254427869))	2025-09-04 16:06:30 +00:00
Boyuan Feng	601ae8e483	[CUDAGraph] add config to error on skipping cudagraph (#161862 ) Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error. This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161862 Approved by: https://github.com/ezyang, https://github.com/mlazos	2025-09-04 15:52:39 +00:00
PyTorch MergeBot	b7dad7dd49	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit 90b08643c3a6eb1f3265b7d1388bd76660759f46. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Already discussed with @ezyang about the internal quirks and errors ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3254219358))	2025-09-04 15:25:07 +00:00
Alexander Grund	e532c9d4f1	Relax tolerance for test_quick_baddbmm_cpu_complex64 (#152424 ) On Zen 2 (AMD EPYC) and Intel Sapphire Rapids this fails with small differences when compiled with native targeted optimizations. I.e. it fails with `-march=znver2` but succeeds with `-march=znver1`. I assume some operator fusing is being used by GCC. Small differences like using `vmovdqa` can be seen in the minimized code of the baddbmm kernel: https://godbolt.org/z/jsxMa91Wb The greatest differences are consistent and the same on both CPU architectures: ``` Greatest absolute difference: 3.43852152582258e-05 at index (1, 2, 1) (up to 1e-05 allowed) Greatest relative difference: 3.6034286949870875e-06 at index (1, 2, 1) (up to 1.3e-06 allowed) ``` Hence I assume this is in the expected tolerances especially as `complex128` and all other types pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152424 Approved by: https://github.com/malfet	2025-09-04 13:26:42 +00:00
PyTorch MergeBot	34aa78274d	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit 4ae57d448c0a7d37e4cfd5c27d977fad2cef4051. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Failing internal tests, probably typechecks. See D81588399 ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3253651785))	2025-09-04 13:13:52 +00:00
Deng, Daisy	040d00af04	[2/N]Port several test files under test/distributed to Intel GPU (#159473 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - use requires_accelerator_dist_backend to allow both nccl and xccl test - enabled XPU for some test path - Change the hardcoded world_size according to device_count. - Unify some common code under torch/testing/_internal for multiple backend, for example: Added xpu for Backend.backend_capability and dist.Backend.register_backend() Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-04 12:53:17 +00:00
Klaus Zimmermann	9c957723a0	Replace setup.py develop with pip install -e (#156710 ) #156027 already replaced most use of `python setup.py develop`. This PR only adds a few more occurrences. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156710 Approved by: https://github.com/atalman	2025-09-04 11:07:44 +00:00
fengqing.lu	acece97c3a	[Intel GPU] Upgrade OneDNN XPU Tag to v3.9.1 (#161932 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161932 Approved by: https://github.com/EikanWang, https://github.com/Skylion007, https://github.com/guangyey	2025-09-04 11:05:10 +00:00
kbabiuchx	ea1883dfd3	Fixes #154982 : add missing to_result_dtype in vector_norm (#155111 ) Fixes #154982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155111 Approved by: https://github.com/isuruf, https://github.com/eellison	2025-09-04 10:49:08 +00:00
Shangdi Yu	d67c29ad22	[inductor] Fix int64 from MutationOutput Buffer (#162020 ) Summary: When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Differential Revision: D81530083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162020 Approved by: https://github.com/davidberard98, https://github.com/eellison	2025-09-04 09:47:57 +00:00
vishalgoyal316	09587daf8c	Adding missing example of torch.full_like Issue#161899 (#162051 ) Fixes #161899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162051 Approved by: https://github.com/zou3519	2025-09-04 08:45:49 +00:00
Chong Gu	c024b1f5a1	[AMD] [Reland] Fix AMD User Defined Kernel Autotune (#161521 ) Summary: This is a reland of D80285441, fixed the unit test. Test Plan: ``` buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR ``` will succeed after this diff. Rollback Plan: Differential Revision: D80971224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161521 Approved by: https://github.com/frank-wei	2025-09-04 08:41:18 +00:00
zeshengzong	8fd3c9ce91	Optimize AMP custom_backend_name error message (#162037 ) Print out amp target dtype and let custom backend easier find out expected dtype while integration. ## Test Result ### Before ```python In [1]: import torch ...: import torch_openreg ...: ...: a = torch.randn(3, 4) ...: b = torch.randn(4, 2) ...: with torch.autocast("openreg", dtype=torch.float16): ...: torch.mm(a, b) ...: /home/coder/code/pytorch/torch/amp/autocast_mode.py:332: UserWarning: In openreg autocast, but the target dtype is not supported. Disabling autocast. openreg Autocast only supports dtypes of torch.float32 currently. warnings.warn(error_message ``` ### After ```python In [1]: import torch ...: import torch_openreg ...: ...: a = torch.randn(3, 4) ...: b = torch.randn(4, 2) ...: with torch.autocast("openreg", dtype=torch.float16): ...: torch.mm(a, b) ...: /home/coder/code/pytorch/torch/amp/autocast_mode.py:332: UserWarning: In openreg autocast, but the target dtype torch.float16 is not supported. Disabling autocast. openreg Autocast only supports dtypes of torch.float32 currently. warnings.warn(error_message) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162037 Approved by: https://github.com/zou3519	2025-09-04 08:27:56 +00:00
Liao, Wei	e19e02c84c	port distributed tensor test files for Intel GPU (#161604 ) In this pr, we port test/distributed/tensor test filesfor Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: Use torch.accelerator for general gpu Skip the case if running on xpu which has known issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/161604 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-04 07:49:25 +00:00
Chris Thi	69a25f6888	[ROCm] Enable USE_FBGEMM_GENAI (#160676 ) Summary: X-link: https://github.com/pytorch/FBGEMM/pull/4703 X-link: https://github.com/facebookresearch/FBGEMM/pull/1728 In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on. Rollback Plan: Differential Revision: D79564024 Test Plan: Ensure builds with: - `USE_FBGEMM_GENAI=1` and without gfx942 - `USE_FBGEMM_GENAI=1` and with gfx942 - `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](`9491d289b3/.ci/docker/libtorch/build.sh (L48)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160676 Approved by: https://github.com/drisspg	2025-09-04 07:13:17 +00:00
tqchen	890626632d	[DLPACK] Optimize toDLPack Conversion Speed (#162111 ) Previously in gh-83069, the toDLPack converter introduces a normalization step that changes the strides to 1 when shape[i] == 1 This step, however, calls as_strided during toDLPack, and can slow down the toDLPack about 3x. This causes PyTorch's DLPack conversion to be around 0.6 us overhead per call from the < 0.2us. This PR updates the logic by adding a need_normalize_strides check, to first confirm if the strides normalization is necessary. In most common cases, when the tensor is continguous, such normalization is not necessary. We confirmed that having this additional step would recover the speed of toDLPack to below 0.2us and can help significantly speedup eager mode integration of DLPack with PyTorch. If we detect that there is normalization needs, the older path will be invoked. Fixes #162113 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162111 Approved by: https://github.com/msaroufim	2025-09-04 05:27:05 +00:00
Guilherme Leobas	480c739112	Capture TypeError in `CONTAINS_OP` (#161069 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161069 Approved by: https://github.com/anijain2305	2025-09-04 04:49:09 +00:00
Gabriel Ferns	66f3b4a682	Contiguous subgraph decomposition (#161241 ) ## Summary Adds a subgraph decomposition for addmm and mm that performs well on large `K` compared to `M` and `N`, and functions well as an alternative to `split-k` on AMD (transposed only), which does not support AMD currently. ## Background On AMD (MI300x), for a matmul A * B, if B is non-contiguous, the resulting matmul is quite a bit slower. For example: ``` args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[1, 178176])) )) ``` is a lot slower than: ``` args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[6144, 1])) )) ``` This PR adds a subgraph decomposition to test out whether making B contiguous is faster than just using the normal kernels. ## Data I ran this on unique non-contiguous shapes from torchbench/huggingface and got these speedups: ``` Parsed 420 unique shapes from benchmark output addmm improvements when best: addmm_16448x512x2048: +0.14% addmm_128x2048x2048: +0.01% addmm_128x768x1000: +0.75% addmm_12672x3072x768: +1.08% addmm_512x768x32000: +0.62% addmm_12608x384x384: +0.00% addmm_4160x1024x4096: +0.90% addmm_16x768x2: +0.56% addmm_12608x3072x768: +0.09% addmm_64x4096x1000: +2.77% addmm_256x1024x512: +1.99% addmm_30x256x256: +1.12% addmm_100480x128x384: +0.91% addmm_6400x2048x512: +0.25% addmm_61568x1024x256: +0.08% addmm_1x768x768: +0.93% addmm_12544x384x384: +0.19% addmm_128x512x1000: +0.77% addmm_2048x128x128: +1.32% addmm_128x3072x1000: +0.24% addmm_7936x512x2048: +0.07% addmm_8192x512x2048: +0.33% addmm_64x1024x1000: +1.43% addmm_128x2304x1000: +0.01% addmm_32768x256x2: +0.75% addmm_64x384x1152: +0.79% addmm_64x640x1000: +0.01% addmm_100480x128x128: +0.87% addmm_1152x3072x768: +1.13% addmm_8192x256x2048: +1.40% addmm_4096x128x768: +0.01% addmm_128x2560x1000: +0.01% addmm_12544x2048x512: +0.43% addmm_200704x24x96: +0.14% addmm_8448x512x2048: +0.96% addmm_50176x256x1024: +0.62% addmm_4160x4096x1024: +0.22% addmm_4096x768x768: +0.32% addmm_220x2048x512: +0.56% addmm_8x2048x1000: +1.12% addmm_256x197951x512: +26.99% addmm_401536x64x192: +0.60% addmm_2040x2048x512: +0.47% addmm_512x1024x256: +1.32% addmm_128x4096x1000: +1.67% addmm_12672x768x768: +0.34% addmm_128x368x1000: +0.77% addmm_96x1280x1000: +0.01% addmm_12544x512x2048: +0.41% addmm_6272x320x1280: +0.76% addmm_12544x3072x768: +0.09% addmm_64x384x1000: +0.39% mm improvements when best: mm_200704x128x512: +1.29% mm_663552x16x16: +0.80% mm_4096x768x768: +0.51% mm_131072x64x31: +0.24% mm_12544x1152x384: +0.11% mm_128x2048x2: +0.46% mm_262144x16x23: +0.62% mm_50176x576x192: +0.37% mm_131072x16x31: +0.26% ================================================================================ BENCHMARK ANALYSIS RESULTS ================================================================================ Operation: addmm ---------------------------------------- Total shapes analyzed: 247 Average Subgraph placement: 3.38 Median Subgraph placement: 2.0 Subgraph is best choice: 52/247 shapes (21.1%) Average improvement when best: 1.15% Median improvement when best: 0.58% Largest improvement when best: +26.99% Operation: bmm ---------------------------------------- Total shapes analyzed: 85 Average Subgraph placement: 24.00 Median Subgraph placement: 21.0 Subgraph is best choice: 0/85 shapes (0.0%) Average improvement when best: N/A (never best) Median improvement when best: N/A (never best) Largest improvement when best: N/A (never best) Operation: mm ---------------------------------------- Total shapes analyzed: 88 Average Subgraph placement: 15.08 Median Subgraph placement: 4.0 Subgraph is best choice: 9/88 shapes (10.2%) Average improvement when best: 0.52% Median improvement when best: 0.46% Largest improvement when best: +1.29% ``` ## Results The largest shape gain, `256,197951,512`, seemed to be driven by a case where the extern kernel is way faster than the best triton configs on the recursive autotune: ``` addmm,Extern,extern_kernels.addmm,256,197951,512,0.38024500012397766 addmm,Triton,256,197951,512,32,256,16,2,2,4,2.005444049835205 addmm,Triton,256,197951,512,32,128,32,2,4,8,2.04189395904541 addmm,Triton,256,197951,512,64,128,16,2,4,8,2.1911399364471436 addmm,Triton,256,197951,512,64,128,32,2,4,8,2.496040105819702 addmm,Triton,256,197951,512,64,128,64,2,8,16,2.9306790828704834 addmm,Triton,256,197951,512,64,64,32,2,4,8,3.0347819328308105 ... ``` Compared to the non-transposed autotune: ``` addmm,Subgraph,contiguous_addmm_1384,256,197951,512,0.5024129748344421 addmm,Extern,extern_kernels.addmm,256,197951,512,0.6881489753723145 addmm,Triton,256,197951,512,32,256,16,2,2,4,2.5115010738372803 addmm,Triton,256,197951,512,32,128,32,2,4,8,2.5167479515075684 addmm,Triton,256,197951,512,64,128,16,2,4,8,2.9507460594177246 addmm,Triton,256,197951,512,64,256,64,2,8,4,2.9673290252685547 addmm,Triton,256,197951,512,64,128,64,2,8,16,3.3906331062316895 addmm,Triton,256,197951,512,64,128,32,2,4,8,3.496859073638916 ``` It seems to perform really well for high values of `K` vs `N` and `M`. Testing this hypothesis with some custom shapes: ``` Parsed 64 unique shapes from benchmark output addmm improvements when best: addmm_128x16384x128: +0.18% addmm_128x262144x256: +38.24% addmm_128x200000x512: +14.76% addmm_256x800000x128: +0.06% addmm_131072x128x256: +0.27% addmm_128x256x131072: +0.25% addmm_2048x200000x64: +12.45% mm improvements when best: mm_128x16384x128: +0.18% mm_128x262144x256: +38.05% mm_128x200000x512: +9.47% mm_256x800000x128: +0.99% mm_512x6400000x256: +3.17% mm_524288x64x64: +0.29% mm_2048x200000x64: +11.19% mm_8192x1000000x256: +34.14% mm_128x4096x100000: +0.40% mm_128x3072x150000: +0.27% ================================================================================ BENCHMARK ANALYSIS RESULTS ================================================================================ Operation: addmm ---------------------------------------- Total shapes analyzed: 33 Average Subgraph placement: 4.39 Median Subgraph placement: 2.0 Subgraph is best choice: 7/33 shapes (21.2%) Average improvement when best: 9.46% Median improvement when best: 0.27% Largest improvement when best: +38.24% Operation: mm ---------------------------------------- Total shapes analyzed: 30 Average Subgraph placement: 7.63 Median Subgraph placement: 2.0 Subgraph is best choice: 10/30 shapes (33.3%) Average improvement when best: 9.81% Median improvement when best: 2.08% Largest improvement when best: +38.05% ``` ## Conclusion Contiguous Subgraph Decompositionseems worthwhile for `mm` and `addmm`, but not `bmm`, and has a very large improvment on low `M`, low `N`, and high `K` shapes. Data gathering scripts: https://gist.github.com/exclamaforte/4a896c064d301b27bf5ca0a4f8fc3866 ## Test Plan: New unit tests. Differential Revision: D80771648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161241 Approved by: https://github.com/eellison	2025-09-04 04:43:58 +00:00
PyTorch UpdateBot	302df2ac5d	[vllm hash update] update the pinned vllm hash (#162115 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162115 Approved by: https://github.com/pytorchbot	2025-09-04 04:26:34 +00:00
Shangdi Yu	dec72ea4b0	[reland] Add inductor provenance mapping for cpp extern kernel (#161656 ) (#162069 ) Summary: Add inductor provenance mapping for cpp extern kernel Test Plan: ``` buck run fbcode//caffe2/test/inductor:provenance_tracing -- -r test_cpu_extern_kernel ``` Differential Revision: D81598857 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162069 Approved by: https://github.com/angelayi	2025-09-04 04:18:43 +00:00
Richard Howell	8975cda252	[pt] strip error messages in profile builds (#162076 ) Summary: Profile builds should match production builds, and error messages result in large static initializers running. Omit them for profile builds too. Test Plan: Before: ``` $ buck build //xplat/caffe2:aten_native_cpuApple -c user.sandcastle_build_mode=profile --show-output $ llvm-nm buck-out/v2/gen/fbsource/31fc3668aa0b4012/xplat/caffe2/__aten_native_cpuApple__/libaten_native_cpuApple.pic.a \| grep ZN3c106detail12_str_wrapperIJPKcRKiS3_RKxS3_RKS3_S3_EE4callES9_S5_S9_S7_S9_S9_S9 0000000000003234 T __ZN3c106detail12_str_wrapperIJPKcRKiS3_RKxS3_RKS3_S3_EE4callES9_S5_S9_S7_S9_S9_S9_ ``` After: ``` $ buck build //xplat/caffe2:aten_native_cpuApple -c user.sandcastle_build_mode=profile --show-output $ llvm-nm buck-out/v2/gen/fbsource/31fc3668aa0b4012/xplat/caffe2/__aten_native_cpuApple__/libaten_native_cpuApple.pic.a \| grep ZN3c106detail12_str_wrapperIJPKcRKiS3_RKxS3_RKS3_S3_EE4callES9_S5_S9_S7_S9_S9_S9 ``` Rollback Plan: Reviewed By: yury-dymov, abashyam Differential Revision: D81599582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162076 Approved by: https://github.com/swolchok	2025-09-04 04:18:27 +00:00
Guilherme Leobas	d636c181f9	Fix `range.__getitem__()` (#161804 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161804 Approved by: https://github.com/anijain2305 ghstack dependencies: #161801, #161802, #161803	2025-09-04 02:33:03 +00:00
Guilherme Leobas	c8255c67cd	redirect `iter(range)` to `range.__iter__()` (#161803 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161803 Approved by: https://github.com/anijain2305 ghstack dependencies: #161801, #161802	2025-09-04 02:33:03 +00:00
Guilherme Leobas	485a7bd82e	Add `range_count` and `range.__contains__` (#161802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161802 Approved by: https://github.com/anijain2305 ghstack dependencies: #161801	2025-09-04 02:33:03 +00:00
Guilherme Leobas	1ef7efa592	Add `range_equals` (#161801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161801 Approved by: https://github.com/anijain2305	2025-09-04 02:33:03 +00:00
Sun, Jiayi	57278d45f0	[Quant][Inductor][CPU] add qconv int8-mixed-bf16 patterns (#161487 ) Summary: Expand the patterns supported by qconv weight prepack, Specifically, expand the conv patterns of int8-mixed-bf16 datatype to support the following two cases: Case 1: the `out_dtype `of `dequantize_per_tensor `is `torch.float32` ``` dq_per_tensor dq_per_channel \| \| to_bf16 to_bf16 \ / Conv2d ``` Case 2: the `out_dtype `of `dequantize_per_tensor `is `torch.bfloat16` ``` dq_per_tensor dq_per_channel \ \| to_bf16 / Conv2d ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161487 Approved by: https://github.com/Xia-Weiwen, https://github.com/CaoE, https://github.com/jansel ghstack dependencies: #161486	2025-09-04 02:01:34 +00:00
Sun, Jiayi	cec0ff1228	[Quant][Inductor][CPU] add qlinear int8-mixed-bf16 patterns (#161486 ) Summary: Expand the patterns supported by qlinear weight prepack, Specifically, expand the linear patterns of int8-mixed-bf16 datatype to support the following two cases: Case 1: the `out_dtype` of `dequantize_per_tensor ` is `torch.float32` dq_per_tensor dq_per_channel \| \| to_bf16 to_bf16 \| \| OPT(reshape) permute \ / addmm/mm \| OPT(reshape) or dq_per_tensor dq_per_channel \| \| to_bf16 to_bf16 \| \| expand permute \ \| expand / bmm \| OPT(add) Case 2: the `out_dtype` of `dequantize_per_tensor ` is `torch.bfloat16` dq_per_tensor dq_per_channel \| \| to_bf16 \| OPT(reshape) permute \ / addmm/mm \| OPT(reshape) or dq_per_tensor dq_per_channel \| \| to_bf16 \| expand permute \ \| expand / bmm \| OPT(add) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161486 Approved by: https://github.com/Xia-Weiwen, https://github.com/jansel	2025-09-04 01:53:02 +00:00
Jacob Szwejbka	65985937d9	expose number of outputs in native runtime for unified runtime (#161723 ) This is only user outputs which is what we want. Spoke to @zhxchen17 though and it seems like nativeRT might have some bugs on propogating updates to things like input mutation or buffer mutation though. Something to take a look at in a follow up. Also I have no idea where the nativeRT tests are. Any pointers @zhxchen17 @SherlockNoMad Pull Request resolved: https://github.com/pytorch/pytorch/pull/161723 Approved by: https://github.com/zhxchen17	2025-09-04 01:20:31 +00:00
Laith Sakka	fbf3d2027d	use sym_or instead of any to avoid dde in calc_conv_nd_return_shape (#162084 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162084 Approved by: https://github.com/aorenste Co-authored-by: Aaron Orenstein <aorenste@fb.com>	2025-09-04 01:20:22 +00:00
William Wen	8678d831c4	[dynamo] rename set_fullgraph to error_on_graph_break (#161739 ) Renaming `set_fullgraph` to `error_on_graph_break` for now. There are no semantic differences yet. In a followup PR, we will introduce a new `torch.compile` option `error_on_graph_break` that has lower priority than `fullgraph` so that `fullgraph` really returns 1 graph. I could keep `set_fullgraph` as a deprecated alias for `error_on_graph_break` for now, but I'm hoping that won't be necessary since it's still private API (there are no internal callsites yet, and there are no significant OSS callsites yet). cc @albanD @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @Lucaskabela @mlazos @guilhermeleobas @xmfan as primary users for `set_fullgraph` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161739 Approved by: https://github.com/xmfan, https://github.com/Lucaskabela, https://github.com/anijain2305, https://github.com/mlazos	2025-09-04 01:15:06 +00:00
Saurabh Mishra	1281470155	[DCP][HuggingFace] Add Support for dequantization of SafeTensors checkpoints (#160682 ) This PR introduces the QuantizedHuggingFaceReader component which enables the reading and dequantization of the quantized tensors in the SafeTensors checkpoint. Following capabilities are inrtoduced: - Configuration the target DType and the block size. - Multi threaded dequantization for efficiency Test Plan: buck test //caffe2/test/distributed/checkpoint\:test_quantized_hf_storage ``` Time elapsed: 2:34.1s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D80174674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160682 Approved by: https://github.com/ankitageorge	2025-09-04 01:09:53 +00:00
Markus Hoehnerbach	9458d1ac3b	[inductor] pdl inductor option (disabled by default) (#160928 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160928 Approved by: https://github.com/eellison	2025-09-04 00:35:23 +00:00
Avik Chaudhuri	3c45af079a	kill allow_complex_guards_as_runtime_asserts (#161794 ) Summary: [reland] Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept). Test Plan: updated tests Rollback Plan: Differential Revision: D81334984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161794 Approved by: https://github.com/zhxchen17	2025-09-04 00:17:01 +00:00
PyTorch MergeBot	aad96a2022	Revert "Contiguous subgraph decomposition (#161241 )" This reverts commit d64718503728001a1e78168fd7f2d4ff23e57285. Reverted https://github.com/pytorch/pytorch/pull/161241 on behalf of https://github.com/jeffdaily due to breaks rocm mi300 tests ([comment](https://github.com/pytorch/pytorch/pull/161241#issuecomment-3251185098))	2025-09-04 00:14:22 +00:00
Rohit Manav	5f3cbc9442	fixed typo error (#162055 ) Fixes #162054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162055 Approved by: https://github.com/RajeshvShiyal, https://github.com/malfet	2025-09-04 00:06:58 +00:00
Xu Han	a918bbad6a	[inductor] fix test output path 2 (#162085 ) Fix test_output_path_2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162085 Approved by: https://github.com/angelayi, https://github.com/jansel	2025-09-04 00:03:47 +00:00
dolpm	8ec551bb35	[aot-compile] strip internal tracebacks for non-verbose graph breaks + include user file/lineno (#162005 ) pytest test/dynamo/test_aot_compile.py -k test_aot_compile_graph_break_error_fmt before ``` Traceback (most recent call last): File "/data/users/$USER/vllm-tests/graph-break.py", line 15, in <module> aot_compiled_fn = compiled.aot_compile((example_inputs, {})) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 717, in aot_compile return aot_compile_fullgraph( ^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/aot_compile.py", line 132, in aot_compile_fullgraph capture_output = convert_frame.fullgraph_capture( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 947, in fullgraph_capture dynamo_output = compile_frame( ^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 1020, in compile_frame bytecode, tracer_output = transform_code_object(code, transform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/bytecode_transformation.py", line 1592, in transform_code_object tracer_output = transformations(instructions, code_options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 992, in transform tracer_output = trace_frame( ^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 312, in _fn return fn(args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 821, in trace_frame run_tracer() File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 803, in run_tracer tracer.run() File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1472, in run while self.step(): ^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1342, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 902, in wrapper return inner_fn(self, inst) ^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 3364, in CALL self._call(inst) File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 3358, in _call self.call_function(fn, args, kwargs) File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1260, in call_function self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward return getattr(self.realize(), name)(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/variables/functions.py", line 1513, in call_function unimplemented_v2( File "/data/users/$USER/pytorch/torch/_dynamo/exc.py", line 596, in unimplemented_v2 raise Unsupported(msg) torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()` Explanation: User-inserted graph break. Message: None Hint: Remove the `torch._dynamo.graph_break()` call. Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}` For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html ``` after ``` Traceback (most recent call last): File "/data/users/$USER/vllm-tests/graph-break.py", line 15, in <module> aot_compiled_fn = compiled.aot_compile((example_inputs, {})) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 737, in aot_compile raise e.with_traceback(None) from e.__cause__ # User compiler error ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()` Explanation: User-inserted graph break. Message: None Hint: Remove the `torch._dynamo.graph_break()` call. Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}` For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html from user code: File "/data/users/$USER/vllm-tests/graph-break.py", line 5, in foo torch._dynamo.graph_break() Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" ``` consistent w/ std torch.compile ``` Traceback (most recent call last): File "/data/users/$USER/vllm-tests/graph-break.py", line 16, in <module> res = compiled(example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 850, in compile_wrapper raise e.with_traceback(None) from e.__cause__ # User compiler error ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()` Explanation: User-inserted graph break. Message: None Hint: Remove the `torch._dynamo.graph_break()` call. Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}` For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html from user code: File "/data/users/$USER/vllm-tests/graph-break.py", line 5, in foo torch._dynamo.graph_break() Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162005 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2025-09-03 23:19:47 +00:00
Catherine Lee	36d207fcaa	[CI] viable strict upgrade: Explicitly name which linux binary wheels should block (#162100 ) Reason: rocm binary builds should not block viable strict upgrade. It is queuing/canceled so viable strict is 1.2 days old Tested by mangling the workflow file to get to the actual call of the python script `python ../test-infra/tools/scripts/fetch_latest_green_commit.py --required-checks '["pull", "trunk", "lint", "^linux-binary-manywheel$", "^linux-binary-libtorch-release$", "linux-aarch64"]' --viable-strict-branch viable/strict --main-branch master`, which I then ran locally where I have credentials. It returned d64718503728001a1e78168fd7f2d4ff23e57285 which is green. Without this change, it returns 5e5870e858f60ff4bf87d03f3592097e934a9580, which is pretty old The other solution would have been to mark it as unstable I think Side note, why is it master and how is it working like that Pull Request resolved: https://github.com/pytorch/pytorch/pull/162100 Approved by: https://github.com/huydhn	2025-09-03 22:38:32 +00:00
Jeff Daily	99f356fa58	[ROCm] revamp miopen integration (#161687 ) Update sources under ATen/miopen and ATen/native/miopen to align with best practices. Avoid reshape_ calls inside backward operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161687 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-03 22:28:09 +00:00
Jithun Nair	0af70e2353	Modify ROCm MI2xx-based workflows to run on cron schedule (#162103 ) To mitigate queueing on MI2xx runners since Cirrascale runners are offline. Match cron schedule of periodic.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/162103 Approved by: https://github.com/jeffdaily, https://github.com/seemethere	2025-09-03 21:51:03 +00:00
Jeff Daily	b1bb98ddeb	[ROCm] TunableOp should use HIP version, not ROCm version (#162067 ) Fixes #160874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162067 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-03 21:42:23 +00:00
Howard Huang	abc447174c	[PP] Add profiling to schedule execution (#160753 ) Profiling title will be `str(action)` <img width="1545" height="694" alt="image" src="https://github.com/user-attachments/assets/60b3506b-b8d6-4ae0-8b32-0d51d45fa2f0" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160753 Approved by: https://github.com/wconstab	2025-09-03 21:31:50 +00:00
Arsh Zahed	734ce8eba9	Rename propagate_tensor_meta to make private again (#161744 ) Rename the wrapper `propagate_tensor_meta` added in #161334 to make it clearly private, and rename the existing LRU function to accommodate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161744 Approved by: https://github.com/bdhirsh	2025-09-03 21:11:45 +00:00
Xinya Zhang	98efc9e93d	[ROCm] Bump AOTriton to 0.11b (#161754 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b: * Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements - AITER ASM kernels deliver over 500TFLOPS training performance. See [AOTriton 0.11b Release Page](https://github.com/ROCm/aotriton/releases/tag/0.11b) for more details. * Now returns natural based `logsumexp` tensor, matching CUDA's behavior - PR #156903 is reverted in this PR as well since it is not needed anymore. * Enables `CausalVariant.LOWER_RIGHT` The build system changes drastically along with new packaging scheme of AOTriton 0.11 * AOTriton 0.11 packs GPU images separately from AOTriton runtime * `aotriton.cmake` now selectively downloads image packs according to `PYTORCH_ROCM_ARCH` * `aotriton.cmake` now only use pre-compiled runtime library that exactly matches the ROCM in the build environment. For PyTorch builds with ROCm versions not listed in the file, the build process will build AOTriton runtime without GPU images from source - This avoids any further ABI breaks like ROCM 6.4 -> 7.0 - recursive git clone is disabled since building AOTriton runtime does not require submodules. Bug fixes: * Fix a kernel bug introduced when implementing SWA Known Problems: * gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status due to accuracy issues. Triton compiler fixes are needed to restore the support status. * Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0. This issue is under investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161754 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-09-03 20:45:44 +00:00
Ke Wen	994f2a5dbc	[SymmMem][CI] Make sure group names are consistent (#162035 ) Unblocking #161741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162035 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-09-03 20:40:24 +00:00
Natalia Gimelshein	d1706d9128	[Symmetric memory] set handle type for ROCm (#161741 ) Fixes #161722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161741 Approved by: https://github.com/kwen2501	2025-09-03 20:33:35 +00:00
arkadip-maitra	1aa7476885	fix to segmentation fault when empty tensor is passed to choose_qpara… (#161966 ) …ms_optimized Fixes #153326 Minimal code to reproduce error: ``` import torch tensor = torch.tensor([]) torch.choose_qparams_optimized( tensor, 0, 200, 0.16, 8 ) ``` Previous Output: `Segmentation fault` Now Output: ``` Traceback (most recent call last): File "/home/amaitra/work/tests/issue_153326.py", line 5, in <module> torch.choose_qparams_optimized( RuntimeError: input tensor is empty and has no data ``` Caused because `const float* input_row =input_tensor.const_data_ptr<float>();` becomes null Pull Request resolved: https://github.com/pytorch/pytorch/pull/161966 Approved by: https://github.com/Skylion007	2025-09-03 20:26:26 +00:00
Aaryaman Vasishta	8e23a1227b	[ROCm/Windows] Fix build failures and support some BLAS calls (#161981 ) * Support getrsBatched/geqrfBatched/gelsBatched on Windows ROCm (fixes https://github.com/ROCm/TheRock/issues/1367) * Fix windows pytorch build with USE_DISTRIBUTED=ON by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/161981 Approved by: https://github.com/ScottTodd, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-03 20:26:14 +00:00
Yulun Wang	850e1382a9	[hipify] Replace cudaStreamCaptureStatusNone (#161992 ) Replacing additional cuda symbols to hip symbols Differential Revision: D81420086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161992 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007	2025-09-03 20:23:32 +00:00
Ke Wen	3c0ff1b569	[SymmMem] Add root argument to broadcast op (#161090 ) It was missing earlier. Also added range check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161090 Approved by: https://github.com/fegin	2025-09-03 20:17:45 +00:00
Yiming Zhou	c465b3d52c	[2/n][export] Refactor PT2 Archive weight saving and loading (#161520 ) Summary: The saving (serialization) part of PT2 archive weight refactoring. The loading (deserialization part) has been landed in D80035490 Test Plan: CI Rollback Plan: bifferential Revision: D80970931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161520 Approved by: https://github.com/SherlockNoMad	2025-09-03 20:12:49 +00:00
andrewor14	f4c33cd44a	[pt2e] Avoid getting model device once per node (#159901 ) Summary: Previously, we call `assert_and_get_unqiue_device` once per node in both prepare and convert. This is expensive and unnecessary since the model device is the same across all nodes, so we should just call this once in the beginning and reuse the same model device across all the nodes. Test Plan: python test/test_quantization.py -k TestQuantizePT2E Pull Request resolved: https://github.com/pytorch/pytorch/pull/159901 Approved by: https://github.com/jerryzh168	2025-09-03 19:29:00 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	92576a594b	Prototype for building non-strict leak detector (#160456 ) Summary: Our strategy for detecting fake tensor leakage in non-strict for outside scope (side effects happening outside of model.forward) is: 1. We do gc.collect() before export and get the alive fake tensors 2. We dump the proxy to fake tensor map from make_fx tracer 3. We query gc again to get alive fake tensors 4. We take the delta between (1) and (3) 5. Filter out fake tensors that are: 1. Associated with `TrackedFake` (input tracking thing in symbolic_shapes) 2. Associated with `gm.meta` 6. Do ID match with the proxies and emit their stacktraces. We rely on (https://github.com/pytorch/pytorch/pull/159923) for other sources of leakages such as: 1. We failed to proxy an operator (like param.data) 2. We cache some tensor in model.forward (https://github.com/pytorch/pytorch/issues/155114) In general, we notice `gc.collect()` and query-ing gc for live objects are kinda slow. So we turn on this feature under env variable. We should document on export public facing documents that if you run into weird errors regarding fake tensors, they should look into turning on this env variable for further analysis. Test Plan: Test plan Rollback Plan: Differential Revision: D80003204 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160456 Approved by: https://github.com/pianpwk	2025-09-03 19:21:27 +00:00
Jithun Nair	cd529b686d	[ROCm] Use MI325 (gfx942) runners for binary smoke testing (#162044 ) ### Motivation * MI250 Cirrascale runners are currently having network timeout leading to huge queueing of binary smoke test jobs: <img width="483" height="133" alt="image" src="https://github.com/user-attachments/assets/17293002-78ad-4fc9-954f-ddd518bf0a43" /> * MI210 Hollywood runners (with runner names such as `pytorch-rocm-hw-`) are not suitable for these jobs, because they seem to take much longer to download artifacts: https://github.com/pytorch/pytorch/pull/153287#issuecomment-2918420345 (this is why these jobs were specifically targeting Cirrascale runners). However, it doesn't seem like Cirrascale runners are necessarily doing much better either e.g. [this recent build](https://github.com/pytorch/pytorch/actions/runs/17332256791/job/49231006755). Moving to MI325 runners should address the stability part at least, while also reducing load on limited MI2xx runner capacity. * However, I'm not sure if the MI325 runners will do any better on the artifact download part (this may need to be investigated more) cc @amdfaa * Also removing `ciflow/binaries` and `ciflow/binaries_wheel` label/tag triggers for `generated-linux-binary-manywheel-rocm-main.yml` because we already trigger ROCm binary build/test jobs via these labels/tags in `generated-linux-binary-manywheel-nightly.yml`. And for developers who want to trigger ROCm binary build/test jobs on their PRs, they can use the `ciflow/rocm-mi300` label/tag as per this PR. ### TODOs (cc @amdfaa): * Check that the workflow runs successfully on the MI325 runners in this PR. Note how long the test jobs take esp. the "Download Build Artifacts" step * Once this PR is merged, clear the queue of jobs targeting `linux.rocm.gpu.mi250` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162044 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-03 18:34:07 +00:00
Isuru Fernando	62c3f9a97f	[inductor] Follow integer overflow rules in TypedExpr (#161922 ) Fixes https://github.com/pytorch/pytorch/issues/161763 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161922 Approved by: https://github.com/jansel	2025-09-03 18:33:18 +00:00
Guilherme Leobas	8076a185c8	Offload set method execution to CPython when possible (#160763 ) Reduces CPython `test_set.py` runtime from 63.477s to 40.298s Pull Request resolved: https://github.com/pytorch/pytorch/pull/160763 Approved by: https://github.com/anijain2305	2025-09-03 18:26:05 +00:00
Ruben Rodriguez Buchillon	f00445b43e	[inductor][ez] add hook for heuristics to adjust kernel input nodes (#161339 ) # why - some templates e.g. scale_mm need to unsqueeze/squeeze the nodes for codegen and heuristics - unified place where we can just adjust them for the template # what - inside get_mm_configs, return not the passed in kernel inputs, but allow the template heuristic to adjust them if necessary - the default implementation right now just passes them back this diff just adds the functionality, but does not exercise it other than the default (passthrough) # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520572](https://our.internmc.facebook.com/intern/diff/D81520572) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161339 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #161123, #161124, #161125, #161126, #161336, #161338	2025-09-03 18:23:22 +00:00
Laith Sakka	3559c354ce	stop suggesting using guard_size_oblivious on data dependent errors (#160510 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160510 Approved by: https://github.com/ezyang	2025-09-03 18:07:59 +00:00
Aleksei Nikiforov	71992dd805	S390x: build nightly binaries for new pythons (#161920 ) Enable python 3.13t, 3.14 and 3.14t on s390x for nightly binaries Fixes #161515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161920 Approved by: https://github.com/malfet	2025-09-03 17:38:38 +00:00
Gabriel Ferns	d647185037	Contiguous subgraph decomposition (#161241 ) ## Summary Adds a subgraph decomposition for addmm and mm that performs well on large `K` compared to `M` and `N`, and functions well as an alternative to `split-k` on AMD (transposed only), which does not support AMD currently. ## Background On AMD (MI300x), for a matmul A * B, if B is non-contiguous, the resulting matmul is quite a bit slower. For example: ``` args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[1, 178176])) )) ``` is a lot slower than: ``` args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[6144, 1])) )) ``` This PR adds a subgraph decomposition to test out whether making B contiguous is faster than just using the normal kernels. ## Data I ran this on unique non-contiguous shapes from torchbench/huggingface and got these speedups: ``` Parsed 420 unique shapes from benchmark output addmm improvements when best: addmm_16448x512x2048: +0.14% addmm_128x2048x2048: +0.01% addmm_128x768x1000: +0.75% addmm_12672x3072x768: +1.08% addmm_512x768x32000: +0.62% addmm_12608x384x384: +0.00% addmm_4160x1024x4096: +0.90% addmm_16x768x2: +0.56% addmm_12608x3072x768: +0.09% addmm_64x4096x1000: +2.77% addmm_256x1024x512: +1.99% addmm_30x256x256: +1.12% addmm_100480x128x384: +0.91% addmm_6400x2048x512: +0.25% addmm_61568x1024x256: +0.08% addmm_1x768x768: +0.93% addmm_12544x384x384: +0.19% addmm_128x512x1000: +0.77% addmm_2048x128x128: +1.32% addmm_128x3072x1000: +0.24% addmm_7936x512x2048: +0.07% addmm_8192x512x2048: +0.33% addmm_64x1024x1000: +1.43% addmm_128x2304x1000: +0.01% addmm_32768x256x2: +0.75% addmm_64x384x1152: +0.79% addmm_64x640x1000: +0.01% addmm_100480x128x128: +0.87% addmm_1152x3072x768: +1.13% addmm_8192x256x2048: +1.40% addmm_4096x128x768: +0.01% addmm_128x2560x1000: +0.01% addmm_12544x2048x512: +0.43% addmm_200704x24x96: +0.14% addmm_8448x512x2048: +0.96% addmm_50176x256x1024: +0.62% addmm_4160x4096x1024: +0.22% addmm_4096x768x768: +0.32% addmm_220x2048x512: +0.56% addmm_8x2048x1000: +1.12% addmm_256x197951x512: +26.99% addmm_401536x64x192: +0.60% addmm_2040x2048x512: +0.47% addmm_512x1024x256: +1.32% addmm_128x4096x1000: +1.67% addmm_12672x768x768: +0.34% addmm_128x368x1000: +0.77% addmm_96x1280x1000: +0.01% addmm_12544x512x2048: +0.41% addmm_6272x320x1280: +0.76% addmm_12544x3072x768: +0.09% addmm_64x384x1000: +0.39% mm improvements when best: mm_200704x128x512: +1.29% mm_663552x16x16: +0.80% mm_4096x768x768: +0.51% mm_131072x64x31: +0.24% mm_12544x1152x384: +0.11% mm_128x2048x2: +0.46% mm_262144x16x23: +0.62% mm_50176x576x192: +0.37% mm_131072x16x31: +0.26% ================================================================================ BENCHMARK ANALYSIS RESULTS ================================================================================ Operation: addmm ---------------------------------------- Total shapes analyzed: 247 Average Subgraph placement: 3.38 Median Subgraph placement: 2.0 Subgraph is best choice: 52/247 shapes (21.1%) Average improvement when best: 1.15% Median improvement when best: 0.58% Largest improvement when best: +26.99% Operation: bmm ---------------------------------------- Total shapes analyzed: 85 Average Subgraph placement: 24.00 Median Subgraph placement: 21.0 Subgraph is best choice: 0/85 shapes (0.0%) Average improvement when best: N/A (never best) Median improvement when best: N/A (never best) Largest improvement when best: N/A (never best) Operation: mm ---------------------------------------- Total shapes analyzed: 88 Average Subgraph placement: 15.08 Median Subgraph placement: 4.0 Subgraph is best choice: 9/88 shapes (10.2%) Average improvement when best: 0.52% Median improvement when best: 0.46% Largest improvement when best: +1.29% ``` ## Results The largest shape gain, `256,197951,512`, seemed to be driven by a case where the extern kernel is way faster than the best triton configs on the recursive autotune: ``` addmm,Extern,extern_kernels.addmm,256,197951,512,0.38024500012397766 addmm,Triton,256,197951,512,32,256,16,2,2,4,2.005444049835205 addmm,Triton,256,197951,512,32,128,32,2,4,8,2.04189395904541 addmm,Triton,256,197951,512,64,128,16,2,4,8,2.1911399364471436 addmm,Triton,256,197951,512,64,128,32,2,4,8,2.496040105819702 addmm,Triton,256,197951,512,64,128,64,2,8,16,2.9306790828704834 addmm,Triton,256,197951,512,64,64,32,2,4,8,3.0347819328308105 ... ``` Compared to the non-transposed autotune: ``` addmm,Subgraph,contiguous_addmm_1384,256,197951,512,0.5024129748344421 addmm,Extern,extern_kernels.addmm,256,197951,512,0.6881489753723145 addmm,Triton,256,197951,512,32,256,16,2,2,4,2.5115010738372803 addmm,Triton,256,197951,512,32,128,32,2,4,8,2.5167479515075684 addmm,Triton,256,197951,512,64,128,16,2,4,8,2.9507460594177246 addmm,Triton,256,197951,512,64,256,64,2,8,4,2.9673290252685547 addmm,Triton,256,197951,512,64,128,64,2,8,16,3.3906331062316895 addmm,Triton,256,197951,512,64,128,32,2,4,8,3.496859073638916 ``` It seems to perform really well for high values of `K` vs `N` and `M`. Testing this hypothesis with some custom shapes: ``` Parsed 64 unique shapes from benchmark output addmm improvements when best: addmm_128x16384x128: +0.18% addmm_128x262144x256: +38.24% addmm_128x200000x512: +14.76% addmm_256x800000x128: +0.06% addmm_131072x128x256: +0.27% addmm_128x256x131072: +0.25% addmm_2048x200000x64: +12.45% mm improvements when best: mm_128x16384x128: +0.18% mm_128x262144x256: +38.05% mm_128x200000x512: +9.47% mm_256x800000x128: +0.99% mm_512x6400000x256: +3.17% mm_524288x64x64: +0.29% mm_2048x200000x64: +11.19% mm_8192x1000000x256: +34.14% mm_128x4096x100000: +0.40% mm_128x3072x150000: +0.27% ================================================================================ BENCHMARK ANALYSIS RESULTS ================================================================================ Operation: addmm ---------------------------------------- Total shapes analyzed: 33 Average Subgraph placement: 4.39 Median Subgraph placement: 2.0 Subgraph is best choice: 7/33 shapes (21.2%) Average improvement when best: 9.46% Median improvement when best: 0.27% Largest improvement when best: +38.24% Operation: mm ---------------------------------------- Total shapes analyzed: 30 Average Subgraph placement: 7.63 Median Subgraph placement: 2.0 Subgraph is best choice: 10/30 shapes (33.3%) Average improvement when best: 9.81% Median improvement when best: 2.08% Largest improvement when best: +38.05% ``` ## Conclusion Contiguous Subgraph Decompositionseems worthwhile for `mm` and `addmm`, but not `bmm`, and has a very large improvment on low `M`, low `N`, and high `K` shapes. Data gathering scripts: https://gist.github.com/exclamaforte/4a896c064d301b27bf5ca0a4f8fc3866 ## Test Plan: New unit tests. Differential Revision: D80771648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161241 Approved by: https://github.com/eellison	2025-09-03 17:02:59 +00:00
Guilherme Leobas	eb18d32bda	Add `range_iterator` (#161800 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161800 Approved by: https://github.com/anijain2305 ghstack dependencies: #161799	2025-09-03 16:55:04 +00:00
Guilherme Leobas	889f01eb73	Add CPython test `test_range` (#161799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161799 Approved by: https://github.com/anijain2305	2025-09-03 16:55:04 +00:00
Xu Han	451ed93156	[inductor] fix split_aot_inductor_output_path on Windows. (#162058 ) fix split_aot_inductor_output_path on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162058 Approved by: https://github.com/angelayi	2025-09-03 16:53:38 +00:00
nandesuka	9491d289b3	Support generic dynamic shape with padding (#160997 ) Summary: Inductor has the following configurations: config.comprehensive_padding config.padding_alignment_bytes config.padding_stride_threshold In the case of static shape by enabling these three options Inductor will generate code for Flexible layout tensors that tries to pad up all stride dimension to be a multiple of config.padding_alignment_bytes for strides above: config.padding_stride_threshold. In the case where dynamic shapes is enabled no padding is done today. This PR introduces the following configuration which allows the user to specify they wish to generated a padded stride even in the case of dynamic shape operations. This is mainly done so we don't break the previous behaviour of not padding up dynamic shape use cases. The config.padding_stride_threshold does not apply since the values of the strides are dynamic. config.pad_dynamic_shapes In addition to this a new mode "python_slow" has been added to launch grid calculation which achieves the same ceildiv behaviour that is generally applicable to integer division. This is done to prevent test regressions and make wrapper_fxir codegen more generic. Test Plan: CI Rollback Plan: Differential Revision: D80468808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160997 Approved by: https://github.com/blaine-rister, https://github.com/jansel	2025-09-03 15:58:18 +00:00
Liao, Wei	c157cf6488	port distributed tensor parallel test files for Intel GPU (#161261 ) In this pr, we port test/distributed/parallel 4 test files and test/distributed/debug 1 test file for Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. Use torch.accelerator for general gpu 2. Skip the case if running on xpu which has known issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/161261 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-03 15:03:32 +00:00
PyTorch MergeBot	bb950284c7	Revert "[inductor][ez] add hook for heuristics to adjust kernel input nodes (#161339 )" This reverts commit 90f50f7e68e120d9574e6e3189e37b4280010ad9. Reverted https://github.com/pytorch/pytorch/pull/161339 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, check D81486248 for more details ([comment](https://github.com/pytorch/pytorch/pull/161339#issuecomment-3249600885))	2025-09-03 14:56:02 +00:00
PyTorch MergeBot	f27985b7e7	Revert "[CUDAGraph] add config to error on skipping cudagraph (#161862 )" This reverts commit 204697f0e695d82894c5010fbec664c4391f90cc. Reverted https://github.com/pytorch/pytorch/pull/161862 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D81522732 for more details ([comment](https://github.com/pytorch/pytorch/pull/161862#issuecomment-3249582583))	2025-09-03 14:50:44 +00:00
PyTorch MergeBot	0cd6c56bdf	Revert "test: ensure editable cached wrapper is respected (#160943 )" This reverts commit bbedc71fd3267c639c38b4ec25eaa22f973d9c4d. Reverted https://github.com/pytorch/pytorch/pull/160943 on behalf of https://github.com/jeanschmidt due to See [D81486248](https://www.internalfb.com/diff/D81486248) for details on broken test ([comment](https://github.com/pytorch/pytorch/pull/160943#issuecomment-3249565671))	2025-09-03 14:46:35 +00:00
Nikita Shulga	b40d9432be	[BE] Cleanup stale comments/copy from `gemm` (#162001 ) Followup after https://github.com/pytorch/pytorch/pull/154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/162001 Approved by: https://github.com/drisspg ghstack dependencies: #161999	2025-09-03 14:31:09 +00:00
Nikita Shulga	02c83f1334	[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999 ) Followup after https://github.com/pytorch/pytorch/pull/154012 Fixes CPU part of https://github.com/pytorch/pytorch/issues/160841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161999 Approved by: https://github.com/drisspg	2025-09-03 14:31:08 +00:00
Nikhil Patel	aed33a8fcb	[Inductor][Tritonparse] Get Inductor kernel params (#161953 ) Summary: Save the config args that Inductor burns into `inductor_metadata` so we can optionally pass them to any Jit Hooks that are set. This allows us to pass them to Tritonparse. Reviewed By: davidberard98, FindHao Differential Revision: D80994791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161953 Approved by: https://github.com/FindHao	2025-09-03 14:11:27 +00:00
Huamin Li	b16d3f4c8c	[AOTI] Fix a bug from load_constants (#161887 ) Summary: we have ``` std::vector<size_t> constants_internal_offset( num_constants - num_folded_constants); ``` but the for loop does not consider it ``` for (size_t i = 0; i < num_constants; i++) { ... constants_internal_offset[i] ... ``` even in the for loop, it does ``` bool from_folded = this->constant_from_folded(i); if (from_folded) { continue; } ``` but `i` could still be wrong Rollback Plan: Differential Revision: D81425007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161887 Approved by: https://github.com/angelayi	2025-09-03 07:45:16 +00:00
Edward Z. Yang	4ae57d448c	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-03 07:33:55 +00:00
Edward Yang	90b08643c3	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-03 07:33:55 +00:00
Scott Wolchok	b0a3e58dd7	Add inline fast paths for SymInt operators (#161586 ) If SymInt::maybe_as_int() returns non-empty, then we get an inline fast path. The philosophy here (as with the previous PR) is to preserve performance in the "plain old ints" case. Observed time spent in SymInt functions in computeStorageNBytes to drop (and not cost shift elsewhere in the function) after this change, profiling detach() using code similar to the benchmark from #160580 and Linux perf. Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161586 Approved by: https://github.com/ezyang ghstack dependencies: #161466	2025-09-03 06:54:47 +00:00
Scott Wolchok	fa1514acf1	Outline SymInt::maybe_as_int_slow_path (#161466 ) Keeps SymInt::maybe_as_int small enough to inline. Differential Revision: [D81530097](https://our.internmc.facebook.com/intern/diff/D81530097) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161466 Approved by: https://github.com/ezyang	2025-09-03 06:54:47 +00:00
FFFrog	827f0d4054	Using get_paths() to get correct installation path for PYTHONPATY (#161947 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161947 Approved by: https://github.com/albanD ghstack dependencies: #161845, #161903	2025-09-03 06:38:03 +00:00
Isalia20	2c03f0acc5	[MPS] enable cat op for sparse (#162007 ) Enable cat op for sparse on MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/162007 Approved by: https://github.com/malfet	2025-09-03 06:31:35 +00:00
Scott Wolchok	f8ffa9194e	Perf nitpicks on python_arg_parser's is_int_or_symint_list (#161998 ) This function has come up in DTensor perf work, and I had a nitpick on #160256 so here it is. I have neither compiled nor measured this, but am reasonably confident it's better nonetheless. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161998 Approved by: https://github.com/ezyang	2025-09-03 05:38:30 +00:00
fengqing.lu	50fc22dedf	[Intel GPU] Fix XPU SDPA default priority_order UT fail (#161690 ) Fixes #161483 When the whole `test/test_transformers.py` file is run, the case `test_default_priority_order` can pass because other xpu cases would call SDPA so that the priority order is set by `eec876deb6/aten/src/ATen/native/mkldnn/xpu/Attention.cpp (L98-L112)` However, when the case `test_default_priority_order` is run separately, the priority order is unset so that this case would fail. This PR fix this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161690 Approved by: https://github.com/guangyey, https://github.com/drisspg	2025-09-03 04:43:27 +00:00
Tianyu Liu	e381d4b020	[DTensor] forbid view ops to redistribute when local split is impossible (#161950 ) This PR is a followup to https://github.com/pytorch/pytorch/pull/149764. In that PR, it only forbids illegal view due to `Flatten`; this PR also forbids illegal view caused by `Split`. This PR also updates the error message to be less about internal implementation details, which users may find confusing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161950 Approved by: https://github.com/ezyang	2025-09-03 04:40:11 +00:00
PyTorch UpdateBot	8875d6e394	[vllm hash update] update the pinned vllm hash (#161929 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161929 Approved by: https://github.com/pytorchbot	2025-09-03 04:26:38 +00:00
Wenyuan Chi	00636e0171	[Reland][Inductor] Prune configs that require more shared memory than the hardware limit. (#161996 ) Summary: This is a re-land of [PR161040](https://github.com/pytorch/pytorch/pull/161040), which had previously caused test failures on AMD GPUs. The tests are now configured to target only NVIDIA GPUs. This diff removes configurations that exceed the hardware shared memory limit, which causes the following compilation error: ``` No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help. ``` Test Plan: ``` pytest test/inductor/test_max_autotune.py pytest test/inductor/test_triton_heuristics.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161996 Approved by: https://github.com/coconutruben	2025-09-03 04:23:09 +00:00
PyTorch UpdateBot	09d2f1b631	[audio hash update] update the pinned audio hash (#161928 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161928 Approved by: https://github.com/pytorchbot	2025-09-03 04:22:55 +00:00
FFFrog	dac8a4b91c	Using pip3 install instead of python setup.py develop/install (#161903 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161903 Approved by: https://github.com/ezyang ghstack dependencies: #161845	2025-09-03 03:12:18 +00:00
FFFrog	d789451ff6	[OpenReg] Migrate Accelerator Document from source/notes into source/accelerator (#161845 ) As the tile stated. As the document grows, the content will become more and more, so in order to make it easier for users to read and easier for developers to maintain, we have split this file into several separate files and placed them in a dedicated directory called "accelerator". Pull Request resolved: https://github.com/pytorch/pytorch/pull/161845 Approved by: https://github.com/albanD	2025-09-03 03:12:18 +00:00
Eli Uriegas	0447f2d99b	build: Add fallback commands to setup.py (#162009 ) Adds fallback commands for the following: * python setup.py install * python setup.py develop Ideally these should just work and should provide backwards compat. Thought process here is that multiple people rely on these commands and just because setuptools wants to drop support for this I don't think a lot of our downstream users who build from source are expecting these to be gone. This should provide some room for developers to move away from these commands until we have a unified frontend for doing all of these commands that should abstract most of these away. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162009 Approved by: https://github.com/clee2000, https://github.com/atalman	2025-09-03 02:56:10 +00:00
William Wen	d5643e8f3a	[dynamo, nested graph breaks] support nested graph breaks that cause skipped frames (#160470 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160470 Approved by: https://github.com/anijain2305 ghstack dependencies: #159329, #159678, #159817, #160138, #159786	2025-09-03 02:47:07 +00:00
Ke Wen	9b81fe281d	[c10d] Lessen density of barrier warning (#162015 ) Warnings are great, but too dense when there are many ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162015 Approved by: https://github.com/d4l3k, https://github.com/H-Huang	2025-09-03 02:20:54 +00:00
Ruben Rodriguez Buchillon	90f50f7e68	[inductor][ez] add hook for heuristics to adjust kernel input nodes (#161339 ) # why - some templates e.g. scale_mm need to unsqueeze/squeeze the nodes for codegen and heuristics - unified place where we can just adjust them for the template # what - inside get_mm_configs, return not the passed in kernel inputs, but allow the template heuristic to adjust them if necessary - the default implementation right now just passes them back this diff just adds the functionality, but does not exercise it other than the default (passthrough) # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520572](https://our.internmc.facebook.com/intern/diff/D81520572) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161339 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #161123, #161124, #161125, #161126, #161336, #161338	2025-09-03 01:03:57 +00:00
Ruben Rodriguez Buchillon	877062c9d3	[inductor][choices][ez] pass through layout and input_nodes (#161338 ) # why - params already available in get_mm_configs - simplifies the code - adds a possibility to edit the nodes/layout in a centralized place # what - add layout and input_nodes into extra_kwargs - no other modifications # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520575](https://our.internmc.facebook.com/intern/diff/D81520575) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161338 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #161123, #161124, #161125, #161126, #161336	2025-09-03 01:03:57 +00:00
Ruben Rodriguez Buchillon	c31dee6fa5	[inductor][ez] ExternChoice with maybe_append_choice (#161336 ) # why - make the API for ExternChoice the same as KernelTemplate - make it possible to use the same retrieval point as templates # what - add a maybe_append_choice to ExternChoice that under the hood invokes self.bind This pr does not actuate the new path, but just exposes it # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py ``` Differential Revision: [D81520578](https://our.internmc.facebook.com/intern/diff/D81520578) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161336 Approved by: https://github.com/jansel ghstack dependencies: #161123, #161124, #161125, #161126	2025-09-03 01:03:57 +00:00
Ruben Rodriguez Buchillon	6cb13dd3cc	[inductor] move scaled_mm template args into heuristics (#161126 ) # why - another step towards get_mm_configs providing all the kwargs needed to add a choice from a template. This in turn will allow us to send all templates through one single call, and handle modifications # what - use the infrastructure for template heuristics to provide extra kwargs that are fixed for a template/op pair to provide the suffix args and epilogue function/fn for scaled_mm # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D80670914](https://our.internmc.facebook.com/intern/diff/D80670914) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161126 Approved by: https://github.com/jansel ghstack dependencies: #161123, #161124, #161125	2025-09-03 01:03:57 +00:00
Ruben Rodriguez Buchillon	cbf01c11ff	[inductor] move addmm/baddbmm template args into heuristics (#161125 ) # why - another step towards get_mm_configs providing all the kwargs needed to add a choice from a template. This in turn will allow us to send all templates through one single call, and handle modifications # what - use the infrastructure for template heuristics to provide extra kwargs that are fixed for a template/op pair to provide the prefix args and epilogue function/fn for addmm/baddbmm - expand kernelinputs to also be able to shuttle around non tensor inputs (scalars) as is needed for alpha and beta # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k addmm ``` Differential Revision: [D80670912](https://our.internmc.facebook.com/intern/diff/D80670912) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161125 Approved by: https://github.com/jansel ghstack dependencies: #161123, #161124	2025-09-03 01:03:57 +00:00
Ruben Rodriguez Buchillon	7cdfa520a6	[inductor] move tma workspace in heuristics (#161124 ) # why - another step towards get_mm_configs providing all the kwargs needed to add a choice from a template. This in turn will allow us to send all templates through one single call, and handle modifications # what use the infrastructure for template heuristics to provide extra kwargs that are fixed for a template/op pair to provide the workspace_arg for all the tma templates # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k tma ``` Differential Revision: [D80670915](https://our.internmc.facebook.com/intern/diff/D80670915) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161124 Approved by: https://github.com/jansel ghstack dependencies: #161123	2025-09-03 01:03:57 +00:00
Ruben Rodriguez Buchillon	1485ac3264	[inductor] add notion of extra_kwargs for mm_configs (#161123 ) # why - some kwargs are choice independent but rather always the same for a specific op or template - this enables us to track those differently than the choice ones, and thus enables interception of them cleaner - maybe_append_choices can then be simplified to just pass through the kwargs # what - hookup for template heuristics to have per template/op extra kwargs that are always the same, for all choices - hookup for the called to get_mm_configs to provide template/op kwargs to override some of the template/choice kwargs this pr does not use the new machinery, and everything is empty for now. subsequent prs start using it to simplify ops # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D80670916](https://our.internmc.facebook.com/intern/diff/D80670916) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161123 Approved by: https://github.com/jansel	2025-09-03 01:03:57 +00:00
Alex Malyshev	c5b8a10be5	Fix compiler errors in 3.14 stub definitions (#161792 ) The functions here expect to return pointers, but currently aren't returning anything. Make them return NULL. The properties array wants an extra set of braces. One pair for the array, another for the first item in the array. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161792 Approved by: https://github.com/Skylion007	2025-09-03 00:58:41 +00:00
Ke Wen	a02ee4a816	[SymmMem] Use non-blocking version of getmem (#162006 ) As titled, so that the `getmem` calls in the loop are non-blocking, so that we max out the issuance rate. Also had a single `nvshmem_quiet()` at the end to make sure all the getmem calls complete. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162006 Approved by: https://github.com/ngimel	2025-09-02 23:55:22 +00:00
xinan.lin	81b7b16618	Reland "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142 )" (#161949 ) This PR reland #161142 which is reverted to be able to revert other PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161949 Approved by: https://github.com/jansel	2025-09-02 23:43:27 +00:00
PyTorch MergeBot	4cdaf8265d	Revert "Update Kineto submodule (#161572 )" This reverts commit d33840c542b387ab08ba49aa6c45aa9567fd9be7. Reverted https://github.com/pytorch/pytorch/pull/161572 on behalf of https://github.com/seemethere due to This appears as though its causing downstream build failures in inductor workflows and for developers working locally. Going to revert out of an abundance of caution. ([comment](https://github.com/pytorch/pytorch/pull/161572#issuecomment-3247121981))	2025-09-02 23:28:19 +00:00
Kevin Fu	874069fbe4	Log Const Folded Node (#161827 ) Summary: Log folded nodes for easier debugging. Test Plan: sandcastle. Rollback Plan: Reviewed By: henryoier Differential Revision: D81352098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161827 Approved by: https://github.com/henryoier, https://github.com/yewentao256	2025-09-02 23:23:51 +00:00
Ke Wen	ab643e4dbb	[SymmMem] Increase minimum nthreads to cover sync needs in NVL72 (#161983 ) `sync_remote_blocks` maps threads to peers. Previously min nthreads is warp size, which is too small to cover NVL72. Bumping it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161983 Approved by: https://github.com/ngimel	2025-09-02 23:18:08 +00:00
Ke Wen	5a2da090ed	[SymmMem] Make sure CUDA runtime is initialized before NVSHMEM init (#161232 ) Previously, without calling `torch.empty` before NVSHMEM init, we see error below: ``` src/host/init/init.cu:nvshmemi_check_state_and_init:1117: nvshmem initialization failed, exiting src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed ``` Fixing it by calling a `cudaFree(nullptr)` to make sure CUDA runtime is initialized before NVSHMEM init. Removing all `torch.empty(1)` calls from tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161232 Approved by: https://github.com/ngimel ghstack dependencies: #161214	2025-09-02 22:53:28 +00:00
Justin Chu	bd39e47fee	[ONNX] Default to dynamo export (#159646 ) Set dynamo=True and enable fallback. 1. Implemented the compatible behavior where BytesIO objects as `f` is accepted 2. Update tests to explicitly set dynamo=False #151693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159646 Approved by: https://github.com/titaiwangms	2025-09-02 22:45:55 +00:00
zhxchen17	e4bd0ff4f8	[aot precompile] Handle closure variables. (#161990 ) We previously assume aot precompile should only work on non closures. This is hard to enforce in practice because we will see a lot of cases with decorater (e.g. hugging face models) ``` def check_inputs(fn): def _fn(self, args, kwargs): for arg in args: assert arg.shape[0] > 1 return fn(args, **kwargs) return _fn @check_inputs def foo(x, y): a = x + x b = y + y c = a + b return c ``` It doesn't make sense to not support these cases since they are straightfowrad to do. This PR adds the logic to handle closure and make sure they can be precompiled properly. Differential Revision: [D81509535](https://our.internmc.facebook.com/intern/diff/D81509535/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161990 Approved by: https://github.com/angelayi	2025-09-02 22:26:04 +00:00
PyTorch MergeBot	15c77a8cfd	Revert "Add inductor provenance mapping for cpp extern kernel (#161656 )" This reverts commit 5e5870e858f60ff4bf87d03f3592097e934a9580. Reverted https://github.com/pytorch/pytorch/pull/161656 on behalf of https://github.com/jeffdaily due to causing failures on ROCm MI300, will add label to PR ([comment](https://github.com/pytorch/pytorch/pull/161656#issuecomment-3246965676))	2025-09-02 22:19:19 +00:00
Kurt Mohler	791eff96c8	[MPS] Add `igamma/igammac` ops (#161927 ) Fixes #161725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161927 Approved by: https://github.com/malfet	2025-09-02 20:52:02 +00:00
Chris Leonard	80dd397f19	Argsort doc stable kwargs (#161986 ) Fixes #129311 Updated torch.argsort documentation to reflect that the 'stable' parameter is a keyword argument and not a normal parameter. @albanD, @soulitzer Pull Request resolved: https://github.com/pytorch/pytorch/pull/161986 Approved by: https://github.com/soulitzer	2025-09-02 20:42:53 +00:00
orangeH25	a75e8cd270	Add api info for torch._C._nn.pyi (#161958 ) Fix part of #148404 APis involved are as followed: - max_pool2d_with_indices - max_pool3d_with_indices - elu - glu - max_unpool2d - max_unpool3d Pull Request resolved: https://github.com/pytorch/pytorch/pull/161958 Approved by: https://github.com/ezyang	2025-09-02 20:39:20 +00:00
PyTorch MergeBot	4e42aa8ffc	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit b7034e9c924412bfbe8ee25a22d7e95239b5ca65. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3246689684))	2025-09-02 20:28:42 +00:00
PyTorch MergeBot	420c52ecf3	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit 626cb7df8161dd4ecb4fe43b60f37ce9076f56b1. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3246677982))	2025-09-02 20:24:01 +00:00
PyTorch MergeBot	82f63c8f6d	Revert "[HOTFIX] Disable DISTRIBUTED_C10D_DIRECT_ACCESS for now (#161946 )" This reverts commit 5561e45758d59c94605873d5db48ed459c004c3b. Reverted https://github.com/pytorch/pytorch/pull/161946 on behalf of https://github.com/jeanschmidt due to Need to be reverted so https://github.com/pytorch/pytorch/pull/159889 can be ([comment](https://github.com/pytorch/pytorch/pull/161946#issuecomment-3246663376))	2025-09-02 20:18:52 +00:00
Xu Han	b4ad38279b	[AOTI] Add Windows-compatible implementation of the mmap-related funcs (#161805 ) Add Windows-compatible implementation of the mmap-related functions. These code was validated on the small developing project: https://github.com/xuhancn/cross_os_mmap?tab=readme-ov-file#cross_os_mmap Pull Request resolved: https://github.com/pytorch/pytorch/pull/161805 Approved by: https://github.com/angelayi	2025-09-02 20:07:41 +00:00
Wei Wang	ef8aabd424	[CD][CUDA13][ARM] aarch64 binary seems to be missing Triton dependency (#161833 ) Requires: filelock, fsspec, jinja2, networkx, setuptools, sympy, typing-extensions Seems to be missing Triton. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161833 Approved by: https://github.com/tinglvv, https://github.com/Skylion007, https://github.com/atalman	2025-09-02 19:31:14 +00:00
Isalia20	dcf385395d	[MPS] Move sparsemps testing from test_mps to test_sparse (#161852 ) Moves Sparse MPS testing from test_mps to test_sparse. Lots of skips now but I expect to remove them iteratively once ops are implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/161852 Approved by: https://github.com/malfet	2025-09-02 19:04:11 +00:00
Animesh Jain	600c25e9a1	[dynamo] Graph break on torch.cuda.sychronize (#161925 ) Today, AOTDispatcher ignores cuda.synchornize. Even if we wrap it in some HOP, we need it to be a barrier op to prevent any inductor reordering. So graph breaking. Fixes https://github.com/pytorch/pytorch/issues/160751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161925 Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/mlazos	2025-09-02 19:00:21 +00:00
Ke Wen	f981a7fa52	[SymmMem] Add device guard before alloc (#161214 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161214 Approved by: https://github.com/ngimel	2025-09-02 18:53:45 +00:00
sibuachu	b7e207ca9f	Make error message descriptive (#150627 ) (#159423 ) Summary: Adding the number of locals shards to error messages makes it easier to debug. Test Plan: UT Differential Revision: D72396478 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159423 Approved by: https://github.com/Saiteja64	2025-09-02 17:54:39 +00:00
Shangdi Yu	5e5870e858	Add inductor provenance mapping for cpp extern kernel (#161656 ) Summary: Add inductor provenance mapping for cpp extern kernel Test Plan: ``` buck run fbcode//caffe2/test/inductor:provenance_tracing -- -r test_cpu_extern_kernel ``` Differential Revision: D81161751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161656 Approved by: https://github.com/angelayi	2025-09-02 17:54:04 +00:00
Yu, Guangye	a99d8d39bc	Update torch-xpu-ops commit pin (#161919 ) # Motivation 1. Fallback some linalg functionality such as `linalg_eig`, `linalg_householder_product`, `linalg_solve_triangular` to CPU; 2. Fix codegen dependency bug. # Additional Context This PR aims to fix https://github.com/pytorch/pytorch/issues/161498 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161919 Approved by: https://github.com/EikanWang	2025-09-02 17:09:07 +00:00
PyTorch MergeBot	d6b74568e2	Revert "Add __init__.pyi to torch/linalg (#160750 )" This reverts commit 9a665ca3c472384e9d722bddba79e5a7680f1abd. Reverted https://github.com/pytorch/pytorch/pull/160750 on behalf of https://github.com/jeanschmidt due to Seems that those errors are legitimate, and there is no test plan. I'll be proceeding with a revert ([comment](https://github.com/pytorch/pytorch/pull/160750#issuecomment-3246095383))	2025-09-02 16:53:55 +00:00
Shivam Raikundalia	d33840c542	Update Kineto submodule (#161572 ) Differential Revision: D81087601 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161572 Approved by: https://github.com/cyyever, https://github.com/aaronenyeshi	2025-09-02 16:31:55 +00:00
Justin Chu	f0c391102b	[ONNX] Remove private members from torch.onnx (#161546 ) Remove import of two functions - _run_symbolic_function - _run_symbolic_method to the `torch.onnx` namespace. Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161546 Approved by: https://github.com/titaiwangms ghstack dependencies: #161323, #161449	2025-09-02 16:31:23 +00:00
Jagadish Krishnamoorthy	a8d6943d36	ROCm: Enable overload tests from test_matmul_cuda (#161540 ) This patch enables hipblaslt backend tests for test_mm_bmm_dtype_overload and test_addmm_baddmm_dtype_overload. Tests were disabled as part of #150812 Rocblas backend tests are not enabled yet, WIP. Test command PYTORCH_TEST_WITH_ROCM=1 pytest test/test_matmul_cuda.py -k 'test_mm_bmm_dtype_overload' -v PYTORCH_TEST_WITH_ROCM=1 pytest test/test_matmul_cuda.py -k 'test_addmm_baddmm_dtype_overload' -v Pull Request resolved: https://github.com/pytorch/pytorch/pull/161540 Approved by: https://github.com/jeffdaily	2025-09-02 16:27:42 +00:00
Justin Chu	d11720efdb	[ONNX] Remove unused logic from internal verification module (#161449 ) Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161449 Approved by: https://github.com/xadupre, https://github.com/titaiwangms ghstack dependencies: #161323	2025-09-02 16:22:49 +00:00
Edward Yang	9a1c5c0a07	Detect torch function in lists as well (#160256 ) We basically follow the same pattern we do for tensor arguments. The major downside is we now have to traverse the entirety of the int list / etc where previously we didn't have. Benchmark suggests 2% regression for relevant things. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160256 Approved by: https://github.com/albanD	2025-09-02 16:22:42 +00:00
Justin Chu	524b78d4f6	[ONNX] Refactor torchscript based exporter (#161323 ) Refactor torchscript based exporter logic to move them to a single (private) location for better code management. Original public module and method apis are preserved. - Updated module paths in `torch/csrc/autograd/python_function.cpp` accordingly - Removed `check_onnx_broadcast` from `torch/autograd/_functions/utils.py` because it is private&unused @albanD / @soulitzer could you review changes in `torch/csrc/autograd/python_function.cpp` and `torch/autograd/_functions/utils.py`? Thanks! ## BC Breaking - Deprecated members in `torch.onnx.verification` are removed Differential Revision: [D81236421](https://our.internmc.facebook.com/intern/diff/D81236421) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161323 Approved by: https://github.com/titaiwangms, https://github.com/angelayi	2025-09-02 16:10:30 +00:00
Wang, Chuanqi	793fc12aff	[CD] Fix setup-xpu action issue (#161934 ) Fix XPU CD test failure, refer https://github.com/pytorch/pytorch/actions/runs/17370923627/job/49315624191 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161934 Approved by: https://github.com/atalman	2025-09-02 16:03:44 +00:00
Boyuan Feng	204697f0e6	[CUDAGraph] add config to error on skipping cudagraph (#161862 ) Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error. This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161862 Approved by: https://github.com/ezyang	2025-09-02 15:28:22 +00:00
Guilherme Leobas	789d494212	Defer loading hipify until it is needed (#160824 ) Saves a few milliseconds when running a test case: Before: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 1.497s ``` After: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 0.909s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160824 Approved by: https://github.com/zou3519	2025-09-02 15:27:37 +00:00
DrStone71	bc4db2c27f	CUDA 13 -- sm_120 -- Nvidia 5090 -- ptxas warning : Value of threads … (#161380 ) bug fix: i have opened a issue ( https://github.com/pytorch/pytorch/issues/161376 ) and i suggest this bug fix. In this metod compile fine. Fixes #161376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161380 Approved by: https://github.com/eqy, https://github.com/malfet Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>	2025-09-02 13:27:57 +00:00
PyTorch MergeBot	e304ea4e69	Revert "[BE] Update xpu driver repo for CD used almalinux 8.10 (#157356 )" This reverts commit c78bbdf4102d2c13bf6aa1abe4352aa7bca401ca. Reverted https://github.com/pytorch/pytorch/pull/157356 on behalf of https://github.com/chuanqi129 due to This PR has performance regression on some workloads ([comment](https://github.com/pytorch/pytorch/pull/157356#issuecomment-3245319046))	2025-09-02 13:20:38 +00:00
Jean Schmidt	1f820de639	[ci] Increase shards for linux-jammy-py3.10-clang18-asan on pull.yml to 7 (#161968 ) [ci] Increase shards for linux-jammy-py3.10-clang18-asan to 7	2025-09-02 14:08:47 +02:00
Rohit Singh Rathaur	fca2601c9d	Improve error message for unsupported padding config (#160866 ) Fixes #160053 The previous error message `Only 2D, 3D, 4D, 5D padding with non-constant padding are supported for now` was not clear now we have ``` python3 Python 3.13.5 \| packaged by conda-forge \| (main, Jun 16 2025, 08:27:50) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch ... import torch.nn.functional as F ... a = torch.empty(2,2,2,2) ... F.pad(a, (1,1), mode="circular") ... Traceback (most recent call last): File "<python-input-0>", line 4, in <module> F.pad(a, (1,1), mode="circular") ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/rrathaur/Desktop/pytorch/torch/nn/functional.py", line 5294, in pad return torch._C._nn.pad(input, pad, mode, value) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ NotImplementedError: Padding size 2 is not supported for 4D input tensor. Supported combinations for non-constant padding: - 2D or 3D input: padding size = 2 (pads last dimension) - 3D or 4D input: padding size = 4 (pads last 2 dimensions) - 4D or 5D input: padding size = 6 (pads last 3 dimensions) >>> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160866 Approved by: https://github.com/mikaylagawarecki	2025-09-02 07:15:59 +00:00
Yu, Guangye	f8746b878d	Add uuid to XPU device properties (#161392 ) # Motivation Fix https://github.com/intel/torch-xpu-ops/issues/1955 Refer to https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md#device-uuid, `ext::intel::info::device::uuid` returns `std::array<unsigned char, 16>` as the UUID. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161392 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-09-02 06:41:32 +00:00
Tianyu Liu	8703debf66	[DTensor] select strategy with no redistribute when redistribute cost is 0 (#161882 ) Before this PR, the `_select_strategy` always selects the first strategy with minimum redistribute cost. This causes unexpected behavior when - multiple strategies have 0 redistribute costs - the first one with 0 redistribute cost may perform local chunking E.g. in memory efficient SDPA, the default orders of candidate strategies have a `Shard(2)` one before the `Replicate()` one. https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_matrix_ops.py#L500-L512 When the input is `Replicate()`, `_select_strategy` will pick the `Shard(2)` strategy and do local chunking first, before local computation. This is clearly unexpected to users. In this PR, we improve `_select_strategy` so that when multiple strategies have 0 redistribute cost, we prioritize the one which keeps input unchanged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161882 Approved by: https://github.com/ezyang	2025-09-02 05:41:56 +00:00
bobrenjc93	1aeb421c34	Make pattern matcher resilient to ddes (#161843 ) Motivated by the following discord support chat: https://discord.com/channels/1189498204333543425/1409578286186758195 ``` import torch @torch.compile(fullgraph=True, mode='reduce-overhead') def get_mask(W: torch.Tensor, percentage_nonzeros: torch.Tensor): total_elements = W.numel() k = int(total_elements * percentage_nonzeros) top_k_indices = torch.topk(torch.abs(W).flatten(), k)[1] mask = torch.zeros(total_elements, dtype=torch.bool, device=W.device) mask.scatter_(0, top_k_indices, True) mask = mask.view(W.shape) return mask x = torch.randn((128, 64), device='cuda') p = torch.tensor(0.50, device='cuda') get_mask(x, p) ``` Results in ``` InductorError: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(TruncToInt(zuf0), 1) (unhinted: Eq(TruncToInt(zuf0), 1)). (Size-like symbols: none) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161843 Approved by: https://github.com/ezyang	2025-09-02 05:16:13 +00:00
Edward Yang	5561e45758	[HOTFIX] Disable DISTRIBUTED_C10D_DIRECT_ACCESS for now (#161946 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161946 Approved by: https://github.com/msaroufim	2025-09-02 05:01:46 +00:00
soulitzer	8171d6052e	Clear custom autograd Function ctx.to_save earlier (#161171 ) Fixes https://github.com/pytorch/pytorch/issues/161186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161171 Approved by: https://github.com/albanD	2025-09-02 03:26:31 +00:00
Dev Sashidhar	d5e0f4202b	Fixes broken memory_viz link in CUDA memory docs (#161426 ) Fixes #161375 The "Using the visualizer" section in torch_cuda_memory.md had a link to https://pytorch.org/memory_viz written in inline Markdown link form. Strangely the same syntax worked earlier on the page as the issuer mentioned, but in this spot it's rendered sa a broken link. I wasn't able to pinpoint why the second occurrence was treated differently, but switching it to the Markdown autolink form fixes the problem consistently. I tested this by rebuilding the docs locally with make html and serving the HTML with a local http.server. With the autolink, the link resolves correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161426 Approved by: https://github.com/soulitzer	2025-09-02 02:06:54 +00:00
Xuehai Pan	13d66e2a66	[BE][Easy] restore #157584 after #158288 (#158541 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158541 Approved by: https://github.com/ezyang	2025-09-02 02:06:50 +00:00
Edward Yang	bbedc71fd3	test: ensure editable cached wrapper is respected (#160943 ) ## Summary - add a test verifying that editing the local cache wrapper is picked up after Dynamo reset ## Testing - `lintrunner -a` (fails: FLAKE8 failure, TEST_HAS_MAIN failure, CODESPELL failure, PYFMT failure) - `PYTHONPATH=. python test/inductor/test_codecache.py TestPyCodeCache.test_editable_cached_wrapper -v` ------ https://chatgpt.com/codex/tasks/task_e_68a3aa3fcc9883239b17d1f4250d1e89 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160943 Approved by: https://github.com/xmfan	2025-09-02 01:48:30 +00:00
Animesh Jain	e9481b6617	[dynamo] Prevent unnecessary recompile on disabled functions in the compiled frame (#161883 ) Trying out a re-impl of https://github.com/pytorch/pytorch/pull/160934 The above PR led to OOM, most likely because of the cache holding to a nested function (which if not held in the cache would have been garbage collected), which holds on to cuda tensors in its closure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161883 Approved by: https://github.com/jansel	2025-09-02 01:13:48 +00:00
gaoyufeng	1c1b28d5b6	Fix slice scatter dtype consistency (#160851 ) Fixes #147842 Fix torch.slice_scatter type inconsistency issue. I noticed previous PRs on this have stalled, so I'm opening this new PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160851 Approved by: https://github.com/soulitzer	2025-09-02 01:08:26 +00:00
Xu Han	2a5c0785e2	[AOTI] split too long string to smaller pieces when its length larger than 16000, fix msvc c2026. (#161850 ) Split too long string to smaller pieces when its length larger than 16000, fix msvc c2026. reproducer: ```cmd pytest test\inductor\test_aot_inductor.py -v -k test_runtime_checks_large_cpu ``` Error message: <img width="1660" height="174" alt="image" src="https://github.com/user-attachments/assets/56fcd9be-24cb-484b-bfdc-f719ff2650b8" /> For MSVC c2026: https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2026?view=msvc-170 We can split too long string to smaller pieces, it can fix this issue. Local validated: <img width="1122" height="232" alt="image" src="https://github.com/user-attachments/assets/cac54cc9-be51-4a5d-b408-06755a4debd5" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161850 Approved by: https://github.com/jansel	2025-09-02 00:09:01 +00:00
Edward Z. Yang	626cb7df81	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-01 23:00:21 +00:00
Edward Yang	b7034e9c92	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-01 23:00:21 +00:00
PyTorch MergeBot	13b65196db	Revert "Defer loading hipify until it is needed (#160824 )" This reverts commit 403a3a393cda7e60f503f3b04b8805a845dcf45d. Reverted https://github.com/pytorch/pytorch/pull/160824 on behalf of https://github.com/atalman due to Broke slow tests test_utils.py::TestHipifyTrie::test_special_char_export_trie_to_regex [GH job link](https://github.com/pytorch/pytorch/actions/runs/17387051351/job/49355619371) [HUD commit link](`403a3a393c`) ([comment](https://github.com/pytorch/pytorch/pull/160824#issuecomment-3243281628))	2025-09-01 21:34:13 +00:00
Guilherme Leobas	403a3a393c	Defer loading hipify until it is needed (#160824 ) Saves a few milliseconds when running a test case: Before: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 1.497s ``` After: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 0.909s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160824 Approved by: https://github.com/zou3519	2025-09-01 20:57:41 +00:00
Ivan Komarov	cbfb005f7c	Fix type checking for persistent loads in the weights-only unpickler (#161661 ) The error message here implies that we can only call `self.persistent_load(...)` for ints or tuples, but due to the second part of the type check being inverted, weights-only unpickler will throw an exception iff `pid` is an int. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161661 Approved by: https://github.com/Skylion007	2025-09-01 19:57:19 +00:00
Huy Do	d232a95d4a	[BE] Consolidate inductor benchmark Docker images and rename jobs (#161536 ) We have 4 different version of inductor benchmark Docker images used in CI at the moment: 1. `pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks` is used by almost all inductor jobs including nightly benchmark 2. `pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks` runs inductor unit tests with python 3.12 3. `pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks` runs inductor unit tests with python 3.13 4. `pytorch-linux-jammy-py3-gcc11-inductor-benchmarks` runs inductor unit tests on CPU My proposal here is to clean up (2) and (3) and to keep (1) under the same setup from https://ghcr.io/pytorch/torchbench. Simplicity is the key here as inductor workflows are getting more and more complex: 1. Unit tests for Python variant like 3.12 and 3.13 were useful when they were first added to CI. They are much less useful now. [Flambeau](https://hud.pytorch.org/flambeau/s/3876ec7b-43f0-42c6-bfbf-899035e5bb77) shows a 0.97 correlation between them. And we are also moving to 3.14 nowadays. I want to choose 3.12 for (1), but will do this separately. This is also what TorchBench and vLLM are using on CI. 1. We are gradually cleaning up 3.9 on CI https://github.com/pytorch/pytorch/issues/161167 Another BE change here is to rename the jobs various inductor workflows because I think names like `linux-jammy-cuda12_8-py3_10-gcc9-inductor-build` is too long and confusing to look at, better just use human-friendly names like `inductor-build`. Other information is already spelled out in the build environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161536 Approved by: https://github.com/zou3519	2025-09-01 19:07:08 +00:00
PyTorch MergeBot	17fa8eec4a	Revert "Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387 )" This reverts commit 4b4cdcfe3af10df624878985caac4e595fbab54c. Reverted https://github.com/pytorch/pytorch/pull/159387 on behalf of https://github.com/atalman due to need to revert due to merge conflicts, please feel free to merge it back in once conflicts are resolved ([comment](https://github.com/pytorch/pytorch/pull/159387#issuecomment-3242945661))	2025-09-01 17:08:27 +00:00
PyTorch MergeBot	54e275e0d8	Revert "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142 )" This reverts commit c83cbd2f2a2de2e3258f07de77d8740743df6d2d. Reverted https://github.com/pytorch/pytorch/pull/161142 on behalf of https://github.com/jeanschmidt due to This PR needs to be reverted to be able to revert another PR, this is due to merge conflicts, I am sorry for this. Please feel free to rebase and merge at your earliest convenience ([comment](https://github.com/pytorch/pytorch/pull/161142#issuecomment-3242937640))	2025-09-01 17:03:50 +00:00
PyTorch MergeBot	63a9c23fe9	Revert "[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 )" This reverts commit 190c391a28845a14df26abb228d26aa813efb20c. Reverted https://github.com/pytorch/pytorch/pull/158352 on behalf of https://github.com/atalman due to Broke cuda 13.0 nightly builds https://github.com/pytorch/pytorch/actions/runs/17382188549/job/49341981474 ([comment](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3242871629))	2025-09-01 16:27:03 +00:00
Ting Lu	fefee08164	[CD] Add CUDA 13.0 Windows build (#161663 ) Test CUDA 13.0 windows build Pull Request resolved: https://github.com/pytorch/pytorch/pull/161663 Approved by: https://github.com/malfet, https://github.com/atalman	2025-09-01 15:27:17 +00:00
PyTorch MergeBot	21fae99c18	Revert "[cuBLASLt][FP8] `cuBLASLt` appears to support float8 rowwise-scaling on H100 (#161305 )" This reverts commit 55c289d5c104c4959cc125c0fb4fb50c9fc71102. Reverted https://github.com/pytorch/pytorch/pull/161305 on behalf of https://github.com/atalman due to Broke test_matmul_cuda.py::TestFP8MatmulCUDA::test_float8_error_messages_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17309011599/job/49140215634) [HUD commit link](`1190b7f73e`) ([comment](https://github.com/pytorch/pytorch/pull/161305#issuecomment-3242652672))	2025-09-01 14:56:47 +00:00
PyTorch UpdateBot	2ba65472dd	[xla hash update] update the pinned xla hash (#161396 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161396 Approved by: https://github.com/pytorchbot	2025-09-01 11:43:03 +00:00
Frank Lin	190c391a28	[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 ) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it capturing graph) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True\|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * Free marker: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * Terminal: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered reusable on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352 Approved by: https://github.com/ngimel Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-01 09:25:01 +00:00
Raman-RH	20bfb2539d	Skip compilation when FX graph has no calls and returns empty (#160536 ) Fixes #160437 Summary: This PR avoids compiling empty FX graphs generated during graph breaks. If there are no calls in the graph, we can just return the empty list of instructions. More precisely, In compile_and_call_fx_graph, if the FX graph contains no calls (count_calls(self.graph) == 0) and the return value list is empty, we now return an empty instruction list immediately Impact: module: dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/160536 Approved by: https://github.com/Lucaskabela	2025-09-01 08:32:22 +00:00
Eli Uriegas	dd2519abe8	ci: Update sphinx, disable google search by default (#161793 ) Includes fixes from https://github.com/pytorch/pytorch_sphinx_theme/pull/207 Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161793 Approved by: https://github.com/malfet, https://github.com/albanD	2025-09-01 07:43:39 +00:00
Ke Wen	2f6b4b1ad3	[4/N][SymmMem] Add `get_remote_tensor` + move up `get_buffer` and `get_signal_pad` (#161533 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): `get_remote_tensor `: return a symmetric tensor given a peer rank. The difference between `get_buffer` API and `get_remote_tensor` API: - the former accepts an offset, whereas the latter doesn't - the latter returns a symmetric tensor at `hdl.offset` on `peer`. As a refactorization, this PR also moves the implementation of `get_buffer` and `get_signal_pad` to the `SymmetricMemory` level as their code is common to all backends. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161533 Approved by: https://github.com/ngimel ghstack dependencies: #161470, #161471, #161532	2025-09-01 07:02:06 +00:00
Zheng, Zhaoqiong	6737e2c996	update supported OS for Intel client GPU (#161699 ) update supported OS for Intel client GPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/161699 Approved by: https://github.com/chuanqi129, https://github.com/malfet	2025-09-01 05:45:09 +00:00
PyTorch UpdateBot	67c31dcd36	[vllm hash update] update the pinned vllm hash (#161867 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161867 Approved by: https://github.com/pytorchbot	2025-09-01 04:37:13 +00:00
Yu, Guangye	cb1e31362c	Remove background thread UT on XPU to fix CI (#161844 ) # Motivation Because we revert `torch._C._set_allocator_settings` in https://github.com/pytorch/pytorch/pull/161626, this UT becomes invalid. Fix https://github.com/pytorch/pytorch/issues/161697 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161844 Approved by: https://github.com/gujinghui	2025-09-01 03:45:26 +00:00
Sean McGovern	9a665ca3c4	Add __init__.pyi to torch/linalg (#160750 ) Fixes #149639 In an effort to improve the type checking coverage, added a stub file for the torch/linalg directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160750 Approved by: https://github.com/Skylion007	2025-08-31 22:39:05 +00:00
Edward Yang	d9d6dde0f4	Leak Python filenames so that we can give good dispatcher errors. (#160418 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160418 Approved by: https://github.com/zou3519	2025-08-31 22:31:39 +00:00
Scott Wolchok	68738beff7	PythonArgs::toBool: order cheap mutually exclusive checks first (#161455 ) symbools are not identical with Py_True or PyFalse, so we can do those cheap checks first and at least get plain old bools to go fast. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161455 Approved by: https://github.com/Skylion007 ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328, #161329, #161432	2025-08-31 21:35:48 +00:00
Ke Wen	25f4aaed9e	[3/N][SymmMem] Expose offset field from handle (#161532 ) As titled, so that kernels relying on direct pointers can use base address and `hdl.offset` to access remote memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161532 Approved by: https://github.com/ngimel ghstack dependencies: #161470, #161471	2025-08-31 18:08:57 +00:00
Ke Wen	61e18b5304	[2/N][SymmMem] Add MemPool allocator and tests (#161471 ) (Porting most of #161008) Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory. To end users, this PR supports a python UI as follows: ``` allocator = symm_mem.get_mempool_allocator(device) mempool = torch.cuda.MemPool(allocator) with torch.cuda.use_mem_pool(mempool): tensor = torch.arange(numel, dtype=dtype, device=device) ``` Added tests for both use cases above. Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471 Approved by: https://github.com/ngimel ghstack dependencies: #161470	2025-08-31 18:08:57 +00:00
Rohit Manav	e92cd94153	removed duplicate imports (#161685 ) Fixes #161684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161685 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2025-08-31 16:21:49 +00:00
Raman Kumar	0d421ace32	fix spelling of word - when (#160185 ) just found a typo while understanding the codebase while working on another PR This fixes typo in word `when` in files ``` native/cpu/PaddingKernel.cpp native/cpu/batch_norm_kernel.cpp ``` @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/160185 Approved by: https://github.com/yewentao256, https://github.com/ezyang	2025-08-31 13:38:23 +00:00
Tan Hoang	91f0bcf43f	[c10d][nvshmem] add nvshmem build rules and dependency for libtorch_cuda (#159562 ) Summary: Add guarded build option for nvshmem-related c10d code with `-c fbcode.caffe2_use_nvshmem` Guarded clause include nvshmem device + host code (static-linked) + these 2 files: - `torch/csrc/distributed/c10d/symm_mem/NVSHMEMSymmetricMemory.cu` - `torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159562 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2025-08-31 12:56:51 +00:00
Xia, Weiwen	75bc23cfc3	[CPU][Inductor] Improve performance of A16W8 GEMM template (#161148 ) Summary This PR improves the performance of A16W8 GEMM template by - Removing the config with block_n=48 & block_m=16 as it is not very efficient. - Using AMX microkernel when M >= 5 so that we use AMX instead of AVX512 for M=5~31. - Converting int8 values to bf16 with intrinsics instead of `at::vec::convert` as the latter does not have optimized implementation for this case. We saw up to >10% performance gain in various cases of running Llama-3.1-8b-instruct. Test plan Already covered by UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161148 Approved by: https://github.com/CaoE, https://github.com/jansel	2025-08-31 09:56:29 +00:00
Natalia Gimelshein	377033757a	Use vectorized stores for all dtypes in cat (#161649 ) resurrecting #151818 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161649 Approved by: https://github.com/Skylion007	2025-08-31 05:42:41 +00:00
PyTorch UpdateBot	f612045ce1	[vllm hash update] update the pinned vllm hash (#161835 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161835 Approved by: https://github.com/pytorchbot	2025-08-31 04:24:04 +00:00
Xu Han	ad7b748686	[AOTI] fix ut, add extension file type for Windows. (#161851 ) fix ut, add extension file type for Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161851 Approved by: https://github.com/ezyang	2025-08-31 01:13:29 +00:00
Isalia20	f3697b033e	[MPS] add bunch of unary funcs for sparse tensors (#161846 ) adds bunch of unary functions for sparse tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/161846 Approved by: https://github.com/malfet	2025-08-30 21:13:05 +00:00
Lakshay Garg	2d31c3d99d	Pass shared_ptr by value (#161834 ) The way AsyncAllreduceCUDADeviceWork is currently implemented, using it will force a copy of `shared_ptr<gloo::Context>` because `std::move` does nothing for a const ref. This PR changes the param type to shared_ptr<> instead of the const ref. This allows more efficient parameter passing. Here's an example that demonstrates the issue: ```cpp #include <memory> #include <iostream> struct Foo {}; void useFoo_ref(const std::shared_ptr<Foo>& f) { std::shared_ptr<Foo> internal = std::move(f); std::cout << "use_count: " << internal.use_count() << '\n'; } void useFoo_val(std::shared_ptr<Foo> f) { std::shared_ptr<Foo> internal = std::move(f); std::cout << "use_count: " << internal.use_count() << '\n'; } int main() { std::shared_ptr<Foo> f1 = std::make_shared<Foo>(); useFoo_ref(std::move(f1)); // prints "use_count: 2" std::shared_ptr<Foo> f2 = std::make_shared<Foo>(); useFoo_val(std::move(f2)); // prints "use_count: 1" } ``` This also aligns well with [C++ Core Guidelines][1] for handling smart pointers. [1]: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines?utm_source=chatgpt.com#Rr-summary-smartptrs Pull Request resolved: https://github.com/pytorch/pytorch/pull/161834 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/kwen2501	2025-08-30 18:00:37 +00:00
PyTorch MergeBot	fb2d5ea697	Revert "[2/N][SymmMem] Add MemPool allocator and tests (#161471 )" This reverts commit b291dc9684d00396239a0c7786b7aac71bf69c05. Reverted https://github.com/pytorch/pytorch/pull/161471 on behalf of https://github.com/atalman due to Multiple internal failures on PR #https://github.com/pytorch/pytorch/pull/161471 will need to land it via co-dev ([comment](https://github.com/pytorch/pytorch/pull/161471#issuecomment-3239283585))	2025-08-30 14:00:29 +00:00
PyTorch MergeBot	2e1345a0f8	Revert "[3/N][SymmMem] Expose offset field from handle (#161532 )" This reverts commit ff9533970ad76ed1905b90df6515aca50354c193. Reverted https://github.com/pytorch/pytorch/pull/161532 on behalf of https://github.com/atalman due to Multiple internal failures on PR #https://github.com/pytorch/pytorch/pull/161471 will need to land it via co-dev ([comment](https://github.com/pytorch/pytorch/pull/161532#issuecomment-3239282308))	2025-08-30 13:57:50 +00:00
PyTorch MergeBot	684ae48c16	Revert "[4/N][SymmMem] Add `get_remote_tensor` + move up `get_buffer` and `get_signal_pad` (#161533 )" This reverts commit 95516ad7e6d92ed131fb6057b29ec52e73190e3c. Reverted https://github.com/pytorch/pytorch/pull/161533 on behalf of https://github.com/atalman due to Multiple internal failures on PR #[161471](https://github.com/pytorch/pytorch/pull/161471) will need to land it via co-dev ([comment](https://github.com/pytorch/pytorch/pull/161533#issuecomment-3239278635))	2025-08-30 13:51:22 +00:00
FFFrog	b93f87d67b	[OpenReg] Integrate Event&Stream from OpenReg Backend into PyTorch (#160100 ) We integrated the openreg backend’s `Stream` and `Event` into PyTorch, all of which are similar to other accelerators like `CUDA`, `XPUs`, etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160100 Approved by: https://github.com/albanD ghstack dependencies: #161603, #160099, #161773	2025-08-30 13:21:28 +00:00
FFFrog	6284881b2a	[OpenReg] Add tests of device and memory for OpenReg (#161773 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161773 Approved by: https://github.com/albanD ghstack dependencies: #161603, #160099	2025-08-30 13:21:28 +00:00
FFFrog	aae9cbb6c0	[OpenReg] Add Event&Stream Support for OpenReg Backend (#160099 ) Referring to the signatures and functions of `Stream` and `Event` in CUDA, we use CPU multithreading and conditional variables to implement equivalent capabilities as the underlying foundation of torch_openreg. Changes: - Add stream capabilities for OpenReg - Add event capabilities for OpenReg - Add kernel launch entrypoint for OpenReg - Add testcases about stream and event for OpenReg - Add example for OpenReg Pull Request resolved: https://github.com/pytorch/pytorch/pull/160099 Approved by: https://github.com/albanD ghstack dependencies: #161603	2025-08-30 13:21:21 +00:00
FFFrog	dad2e50ac5	[OpenReg] Rename cpu_fallback_blacklist to cpu_fallback_blocklist (#161603 ) As the title stated. Related Infos: https://github.com/pytorch/pytorch/pull/158644#discussion_r2301460839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161603 Approved by: https://github.com/albanD	2025-08-30 13:21:15 +00:00
Aleksandar Samardžić	37da7b777b	Fix _scaled_grouped_mm not reported as unsupported on SM100. (#161780 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161780 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel, https://github.com/Skylion007, https://github.com/eqy	2025-08-30 12:33:51 +00:00
xinan.lin	c83cbd2f2a	[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142 ) Fixes #161384, Fixes #161162, Fixes #160946, Fixes #160947, Fixes #160948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161142 Approved by: https://github.com/jansel	2025-08-30 11:09:07 +00:00
Mwiza Kunda	b994f6e3b3	[inductor] check block options after broadcasting and singleton dims have been removed (#161602 ) This will allow for some more cases to use tensor descriptors e.g. before the following block params would not match because the innermost dimension does not have stride 1 ```python block_params=BlockParameters(shape=[64, 4, 1, 1], block_shape=[((XBLOCK + 3)//4), Min(4, XBLOCK), 1, 1], strides=[0, 1, 0, 0], offsets=[(xoffset//4), ModularIndexing(xoffset, 1, 4), 0, 0]) ``` After broadcasting dimensions and singleton dimensions are removed: ```python block_params=BlockParameters(shape=[4], block_shape=[Min(4, XBLOCK)], strides=[1], offsets=[ModularIndexing(xoffset, 1, 4)]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161602 Approved by: https://github.com/jansel	2025-08-30 08:10:51 +00:00
yucai-intel	f44ad54bc6	Update torch-xpu-ops commit pin (#161152 ) Update the torch-xpu-ops commit to [8b58040ee32689487f660462f655085f31506dab](`8b58040ee3`), includes: - Add vectorization path on maxpool forward channel last - Add FlightRecorder support for ProcessGroupXCCL - Fix random build failure on codegen - Suppress dllexport warning on Windows - Make torch-xpu-ops build depend on ATen XPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/161152 Approved by: https://github.com/EikanWang Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-08-30 07:19:24 +00:00
Scott Wolchok	4d3ab2669b	Stop trying to intern arguments in PyObject_FastGetAttrString (#161432 ) If we want them interned, we should intern at callsites. (The numpy reference has bit rotted; see `b222eb66c7 (diff-6bdb6105198083838f51c57b55b3a49472ed23043bb40018f1ea41138e687163)`) Profiling a simple torchdispatch benchmark with perf before/after seems to show that time spent copying std::strings and interning Python strings is gone, though there is some noise and the improvement is very small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161432 Approved by: https://github.com/ezyang ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328, #161329	2025-08-30 06:55:43 +00:00
Scott Wolchok	0ee8a4e281	Fix accidental copy in pushPyOutToStack (#161329 ) `auto` forces a copy. Confirmed this did something noticable with perf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161329 Approved by: https://github.com/zpcore, https://github.com/fduwjj, https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328	2025-08-30 06:55:43 +00:00
Scott Wolchok	eb9526ae35	Avoid double hash lookup in torch._library.simple_registry (#161328 ) Not a huge cost, but free win is free. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161328 Approved by: https://github.com/Skylion007 ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317	2025-08-30 06:55:43 +00:00
Scott Wolchok	302d860157	Improve assert perf in _python_dispatch._correct_storage_aliasing (#161317 ) This assertion was expensive because of is_traceable_wrapper_subclass. Finding a cheap check to run first that's likely to let us skip the rest seems to improve things significantly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161317 Approved by: https://github.com/ezyang, https://github.com/XilunWu, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304, #161308, #161315	2025-08-30 06:55:42 +00:00
Scott Wolchok	0c459f2921	Fix pybind enum efficiency issue in return_and_correct_aliasing (#161315 ) Scanning a list of pybind enums with `in` is slow. See NOTE in code for full explanation. This is a significant optimization; will be updating the torchdispatch/return_and_correct_aliasing portion of this stack with benchmark and results soonish. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161315 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304, #161308	2025-08-30 06:55:42 +00:00
Scott Wolchok	b96bcb9fdb	Optimize _python_dispatch.return_and_correct_aliasing.get_write_alias (#161308 ) - Empty containers are Falsey - Hoist cheap checks first - Microbenchmarked single-element set access method Benchmark code: ``` import timeit to_test = [ ('list(x)', 'x = set([3])'), ('x[0]', 'x = [3]'), ('list(x)[0]', 'x = set([3])'), ('next(iter(x))', 'x = set([3])'), ] for (stmt, setup) in to_test: res = timeit.timeit(stmt=stmt, setup=setup) print(f"Time for `{stmt}`: {res}") ``` Result with Python 3.13 on Mac (with excess digits manually trimmed; directionally matches result on Linux) ``` Time for `list(x)`: 0.03418 Time for `x[0]`: 0.00852 Time for `list(x)[0]`: 0.03561 Time for `next(iter(x))`: 0.02278 ``` FWIW, I was surprised by this result, so I guess I'm glad I wrote the benchmark! Pull Request resolved: https://github.com/pytorch/pytorch/pull/161308 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304	2025-08-30 06:55:42 +00:00
Scott Wolchok	2089ed3d5e	Use `is`, not ==, to check exact type matches in _python_dispatch (#161304 ) `is` checks object identity and is more efficient. Google seems to confirm it is the correct way to do an exact type check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161304 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/bdhirsh ghstack dependencies: #161301, #161292	2025-08-30 06:55:42 +00:00
Scott Wolchok	1a64bf2636	Stop accessing func._schema in _python_dispatch.correct_storage_aliasing (#161292 ) func._schema is a pybind, accessing the arguments/returns is expensive, we have no reason to do it anyway, and even though #161301 makes accessing the arguments/returns less expensive, this still seems to improve performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161292 Approved by: https://github.com/wconstab, https://github.com/malfet, https://github.com/bdhirsh ghstack dependencies: #161301	2025-08-30 06:55:42 +00:00
Scott Wolchok	5d35b49ba7	Fix forced copying def_property_readonly for FunctionSchema & friends (#161301 ) This took me a bit to figure out and I'm pretty sure I've looked at this code before. Pybind uses `return_value_policy::reference_internal` for `def_property`, which [causes the owning object to be kept alive for the lifespan of the return value](https://pybind11.readthedocs.io/en/stable/advanced/functions.html), allowing the getter to safely avoid copying the property value. However, lambdas act like they return `auto`, not `decltype(auto)`, so our lambdas themselves were forcing copies! Testing: observed std::vector<Argument> copying disappear in Linux perf profile of someOpInfo._schema.arguments/returns (in _python_dispatch.correct_storage_aliasing). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161301 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/wconstab	2025-08-30 06:55:42 +00:00
CaoE	db622842bc	[Inductor][CPP] Optimize config selecting for micro gemm when number of mxn blocks can not occupy all the threads (#161144 ) If number of mxn blocks can not occupy all the threads, use smaller register block size will get better performance since the computing size per thread is smaller. It may get ~20% performance improvement for the real case `m1_n512_k4096`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161144 Approved by: https://github.com/leslie-fang-intel	2025-08-30 05:53:49 +00:00
Boyuan Feng	77d8e98e1b	[Inductor] update exp codegen for better precision (#161829 ) Prior to this PR, we have: ``` [Default Behavior] uses `tl.math.exp({x})`: eager diff: tensor(2.6935e-06, device='cuda:0', dtype=torch.float64) compile diff: tensor(9.2757e-06, device='cuda:0', dtype=torch.float64) eager_latency:0.0013996509159580942, compile_latency:0.0013981951951980592 TORCHINDUCTOR_USE_FAST_MATH=1 uses `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)`: eager diff: tensor(2.2315e-06, device='cuda:0', dtype=torch.float64) compile diff: tensor(3.5329e-06, device='cuda:0', dtype=torch.float64) eager_latency:0.0013982331859319662, compile_latency:0.0013824134564199367 Update inductor to use `tl.extra.libdevice.exp(tmp0)`: eager diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64) compile diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64) eager_latency:0.0014109122834153282, compile_latency:0.0014062877025520593 ``` Since `tl.extra.libdevice.exp` leads to both better precision and on-par latency, we use it by default now. Note that `tl.extra.libdevice.exp` used to have a perf issue in [January 2025](https://github.com/triton-lang/triton/issues/5735) since it used due to `ex2.approx.f32` instead of `ex2.approx.ftz.f32`. So `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)` was used as a workaround. I double checked that the issue is resolved and `tl.extra.libdevice.exp` also uses [ex2.approx.ftz.f32](https://github.com/triton-lang/triton/issues/5735#issuecomment-3238421293) today. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161829 Approved by: https://github.com/jansel	2025-08-30 04:56:51 +00:00
Tianren Gao	2fed4fb464	[FlexAttn] Fix Paged Attention Accuracy via Upper Mask Mod and Prevent Invalid Memory Access (#160861 ) Fixes #159247 Issue 1: Accuracy Problem with Non-Divisible KV Sequences --------------------------------------------------------- ### Background Paged attention in flex decoding produced inaccurate results when KV sequence length is not divisible by block size. For example, when `KV_S = 64` and `block_size = 128`, the output didn't match standard attention accuracy. ### Root Cause The current paged attention does not apply upper mask mod when converting from logical to physical mask mod. Instead, it uses a noop_mask by default which makes all the values unmasked, leading to an accuracy mismatch. Adding a upper mask mod according to the origin actual kv_len (64 in this test case) resolves the issue. ### Solution * Applied proper upper bound masking: Updated all calls to `convert_logical_block_mask` to pass `kv_len` as a tensor with proper shape `[B, KV_S]` to provide information of actual batched KV sequence length. The function now correctly applies upper bound checks using the actual KV sequence lengths for each batch ### Files Modified * `torch/nn/attention/experimental/_paged_attention.py`: Added `kv_len` parameter as a tensor to `get_mask_mod` and applied upper mask to the new mask mod. * `test/inductor/test_flex_attention.py`: Fixed all related `kv_len` parameter call in the tests * `test/inductor/test_flex_decoding.py`: Fixed all related `kv_len` parameter call in the tests Issue 2: Invalid Memory Access (IMA) in Triton Kernels ------------------------------------------------------ ### Background The Triton kernel for flex attention was experiencing invalid memory access errors when running with compute sanitizers, particularly with short KV sequences and small batch sizes. ### Root Cause * Kernel launches CTAs (Cooperative Thread Arrays) proportional to GPU's multi-processor count (108 via `SPLIT_KV`) * With small workloads, many CTAs remain idle but still attempt to access `kv_indices` with invalid `indices_idx` values * This caused out-of-bounds memory access violations ### Solution Implemented boundary checks with early exit: 1. Added `MAX_VALID_KV_IDX` parameter in `torch/_inductor/kernel/flex/flex_decoding.py` * Calculate maximum valid KV index based on actual `kv_indices` tensor size and pass it to Triton template 2. Added early exit logic in `torch/_inductor/kernel/flex/templates/flex_decode.py.jinja` * Boundary checks before accessing `kv_indices` in both normal and full blocks * Idle CTAs with invalid `indices_idx` skip computation entirely This prevents invalid memory access while reducing wasted computation on idle thread blocks. Testing & Validation -------------------- ### Accuracy Tests * Added comprehensive test cases covering KV sequences not divisible by block sizes * Verified output matches standard attention for various sequence length combinations ### Sanitizer Results `========= COMPUTE-SANITIZER Starting standalone test_max_autotune... Running test_max_autotune on device: cuda max_autotune config: True test_max_autotune completed successfully! Test passed! ========= ERROR SUMMARY: 0 errors` Before: More than 13720 invalid memory access errors with sanitizers After: Clean execution with 0 errors Both fixes work together to ensure paged attention produces accurate results while running safely without memory access violations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160861 Approved by: https://github.com/BoyuanFeng	2025-08-30 04:50:23 +00:00
PyTorch UpdateBot	76f81b56d3	[audio hash update] update the pinned audio hash (#161836 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161836 Approved by: https://github.com/pytorchbot	2025-08-30 04:23:04 +00:00
Howard Huang	82d2d23e85	Add batch option for send/recv_object_list (#160342 ) `send_object_list` and `recv_object_list` use regular `send`/`recv` P2P ops which means that they will create 2-rank NCCL communicators between ranks if the communicators have not been initialized. This adds an option `use_batch` which will call the send/recv with `batch_isend_irecv` which will re-use the communicators already initialized for collectives in the group. --- BatchP2P ops, creates (or use existing) communicator keyed by device index Regular P2P Ops, creates (or use existing) dedicated 2-rank communicators keyed by “rank1:rank2” See: `c8205cb354/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L3980-L4008)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160342 Approved by: https://github.com/wconstab	2025-08-30 03:29:09 +00:00
PyTorch MergeBot	e015de1969	Revert "Use vectorized stores for all dtypes (#161649 )" This reverts commit f0a517e333d6204f560d8061a4f70523060c93bf. Reverted https://github.com/pytorch/pytorch/pull/161649 on behalf of https://github.com/ngimel due to buggy ([comment](https://github.com/pytorch/pytorch/pull/161649#issuecomment-3238895967))	2025-08-30 03:13:40 +00:00
Nikita Shulga	0af56fc33e	Cleanup stale submodule directories after checkout (#161748 ) Fixes https://github.com/pytorch/pytorch/issues/161510 Test plan: ``` % cd third_party/kineto % git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive M libkineto/third_party/dynolog M libkineto/third_party/fmt M libkineto/third_party/googletest Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104) HEAD is now at fe80f93 Fix MSVC Error (#1134) Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1' Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21' Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' % git checkout 5e75018; git submodule update --init --recursive M libkineto/third_party/dynolog M libkineto/third_party/fmt M libkineto/third_party/googletest Previous HEAD position was fe80f93 Fix MSVC Error (#1134) HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104) warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e' Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850' Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164' Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347' % cd ../.. % git status HEAD detached from 649e397c6de Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) (commit or discard the untracked or modified content in submodules) modified: third_party/kineto (untracked content) % time git submodule foreach --recursive git clean -ffdx ... git submodule foreach --recursive git clean -ffdx 0.47s user 0.96s system 88% cpu 1.625 total % git status HEAD detached from 649e397c6de ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748 Approved by: https://github.com/atalman	2025-08-30 01:30:44 +00:00
Irakli Salia	8627a19adf	[MPS] sparse add unary funcs + add for sparse tensors (#160839 ) Adds several unary functions and add. Enables tests for unary functions in test_sparse but not enabling other tests yet, needs more ops before we fully migrate to testing SparseMPS with `test_sparse.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160839 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-30 01:09:00 +00:00
eellison	ebfee60101	[WIP] more aggressive persistent reduction (#161055 ) Gives 18% speedup on rms norm (2048, 32768). And we have seen other instances where inductor is not aggressive enough about codegening persistent reductions - e.g. 39% on [this kernel from torch ao](https://github.com/pytorch/pytorch/issues/159769#issuecomment-3188568335). Codegen-ing persistent reductions can be risky if you run out of registers. Here, I'm effectively making persistent reductions an option of looped reductions by setting RBLOCK == rnumel, so that we can still fallback to looped reductions as needed. As criteria: - there needs to be significant memory savings from doing a persistent reduction (by keeping memory in register and avoiding another iteration over input) - we should not be coalescing on x dimension, otherwise large rblock will inhibit coalescing - we should not be especially register or arithmetic intensive (this last part uses mem_ops_per_thread, but could be improved). Still need to do dashboard run, although I'm not sure we get a lot of large rblock in our benchmarks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161055 Approved by: https://github.com/jansel	2025-08-30 01:08:45 +00:00
PyTorch MergeBot	6db872fa2c	Revert "Cleanup stale submodule directories after checkout (#161748 )" This reverts commit 0e45023cf9cbe1cf18279c1b0d391ea9464e7731. Reverted https://github.com/pytorch/pytorch/pull/161748 on behalf of https://github.com/malfet due to I still see the same failures, and could not understand, from the log whether those checks are running on not ([comment](https://github.com/pytorch/pytorch/pull/161748#issuecomment-3238791895))	2025-08-30 01:04:11 +00:00
Nikita Shulga	7c30a9d7fc	[MPS] Add slow version of `kthvalue` (#161817 ) Which heavily borrows implementation logic from `topk` As this method is non-deterministic, modified the logic for cpu-ops indices comparison with just an equality statement, as by default random numbers picked for input tensor allow for quite a lot of overlaps Pull Request resolved: https://github.com/pytorch/pytorch/pull/161817 Approved by: https://github.com/dcci	2025-08-30 00:44:29 +00:00
Chien-Chin Huang	c1e504ec2f	[SymmMEM] Move AsyncTP tests to a seperate test class (#161820 ) We move AsyncTP tests to a seperate test suite because 1) Async TP ops are not the core symmetric memory APIs, they are more like applications, 2) MultiProcContinuousTest will skip all the following tests if a test fails (we should fix this too). We still want to get the test signals for the core symmetric memory APIs when Async TP ops fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161820 Approved by: https://github.com/kwen2501	2025-08-30 00:40:40 +00:00
Parshant Sharma	4ad9fbc83a	Unify TypeAlias definitions in optimizer.py (#161493 ) Fixes #160834 This issue unifies TypeAlias definitions in [optimizer.py](https://github.com/pytorch/pytorch/blob/main/torch/optim/optimizer.py) This ensures the following: - Consistency and Standardization - Enhanced IDE support - Prevents runtime confusion Pull Request resolved: https://github.com/pytorch/pytorch/pull/161493 Approved by: https://github.com/Skylion007	2025-08-30 00:35:02 +00:00
Wang, Chuanqi	0f81e7f640	[CI] Fix XPU ci test permission issue (#161389 ) Due to new test runners, refer https://github.com/pytorch/pytorch/actions/runs/17161094208/job/48694776064#step:2:124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161389 Approved by: https://github.com/atalman	2025-08-30 00:03:59 +00:00
Isalia20	3daf20f8e1	[MPS] fix empty input in posneg functions (#161824 ) fix empty posneg function for mps: ```python import torch input_tensor = torch.empty(0, device="mps") out_pos = torch.isposinf(input_tensor) ``` Gives: ``` RuntimeError: [srcBuf length] > 0 INTERNAL ASSERT FAILED at "/Users/Irakli_Salia/Desktop/pytorch/aten/src/ATen/native/mps/OperationUtils.mm":551, please report a bug to PyTorch. Placeholder tensor is empty! ``` on main branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/161824 Approved by: https://github.com/malfet	2025-08-29 23:12:04 +00:00
Zhang, Liangang	3e459491b5	Enable XPU path for FlexAttention (#143553 ) [#RFC153024](https://github.com/pytorch/pytorch/issues/153024) Motivation 1. The Attention has been the critical performance bottleneck in the current LLM models, and FlexAttention is a good choice to cover the broad variants in the transformers series models. With FlexAttention, it is easy for us to enable the paged attention and fused SDPA in the transformers repo on XPU device. Besides, it also provide a candidate to process attention in LLM ecosystem libraries ., e.g., vLLM, SGLang on XPU device. 2. FlexAttention is good start point to push the intel triton based GEMM kernel to be matured. FlexAttention provide both flexattention kernel and flexdecoding kernel to cover both compute bound and memory bound GEMM computation, and different shapes should also been supported to serve LLM inference., e.g. head_dim=64, 96, 128, 256. What does this PR do? 1. Enable the device type for Flexattention kernel and UTs to ensure all important UTs pass on XPU device. 2. For E2E model inference, ensure the functionality of LLM models inference with FlexAttention to be ready. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143553 Approved by: https://github.com/EikanWang, https://github.com/drisspg Co-authored-by: Mao Yunfei <yunfei.mao@intel.com> Co-authored-by: Xingyuan Li <xingyuan.li@intel.com> Co-authored-by: majing <jing1.ma@intel.com> Co-authored-by: Xiao, Wang <wang.xiao@intel.com>	2025-08-29 23:10:58 +00:00
Andrey Talman	0e2c8af5a6	[CI/CD] Windows set git config --global core.ignorecase false (#161813 ) Make sure git on windows have core.ignorecase false Pull Request resolved: https://github.com/pytorch/pytorch/pull/161813 Approved by: https://github.com/malfet	2025-08-29 23:04:43 +00:00
Ruben Rodriguez Buchillon	ea27464a79	[inductor][decompose k] disable on everything other than cuda (#161795 ) # why - untested so far # what - add an empty config heuristic for all devices for decompose k - the cuda heuristic, because it is more specific, will still be picked up - add notes explaining how to enable on other devices # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "decompose_k" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161795 Approved by: https://github.com/PaulZhang12 ghstack dependencies: #161767	2025-08-29 22:41:27 +00:00
Ruben Rodriguez Buchillon	45eccf414f	[inductor][heuristics registry] missing heuristic is not an error anymore, cross device heuristics (#161767 ) # why - not having a heuristic is an error but should not crash, just provide 0 configs - some heuristics are cross device type - cleaner to be explicit about being cross device type than having to enumerate every possible device type # what - on registration, supply device_type=None (explicitly) to say this heuristic is cross device - test to guard the heuristics hierarchies # testing ``` python3 -bb -m pytest test/inductor/test_template_heuristics_registry.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161767 Approved by: https://github.com/PaulZhang12	2025-08-29 22:41:27 +00:00
Wang, Chuanqi	037f3bd475	[CI] Migrate XPU build and test to python 3.10 (#161708 ) Follow #161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161708 Approved by: https://github.com/malfet	2025-08-29 22:31:39 +00:00
PyTorch MergeBot	6e548c1a87	Revert "[CI] Migrate XPU build and test to python 3.10 (#161708 )" This reverts commit 2a70d98abf8256d3d768eff028fca20198579824. Reverted https://github.com/pytorch/pytorch/pull/161708 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing rocm jobs to fail. See: test/inductor/test_max_autotune.py::TestMaxAutotuneSubproc::test_max_autotune_addmm_search_space_EXHAUSTIVE_dynamic_True [GH job link](https://github.com/pytorch/pytorch/actions/runs/17303310877/job/49125664617) [HUD commit link](`2a70d98abf`) ([comment](https://github.com/pytorch/pytorch/pull/161708#issuecomment-3238359944))	2025-08-29 21:49:15 +00:00
zhxchen17	eb78757708	[inductor] Lift fw_compiler and bw_compiler as toplevel functions. (#161762 ) This is a no-op refactor to compiler_fx which lifts the logic of fw_compiler and bw_compiler to toplevel, so that they can be reused in a different stack (e.g. precompile). Differential Revision: [D81292968](https://our.internmc.facebook.com/intern/diff/D81292968/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161762 Approved by: https://github.com/angelayi, https://github.com/yushangdi	2025-08-29 21:46:55 +00:00
David Berard	05eeb29976	[inductor][triton] support JITCallable._hash_lock (#161768 ) Fixes #161618 Triton # 7974 introduces a threading.RLock() in JITCallable, which is not pickle-able. This PR adds this field to the list of un-pickleable fields that need to be handled specially. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161768 Approved by: https://github.com/xuzhao9	2025-08-29 21:20:02 +00:00
Tristan T	18b4fdde8f	Add MTIA to floor_divide op (#161575 ) Summary: Missed file in op registration resulting in fallback during test Reviewed By: andyanwang, srsuryadev Differential Revision: D81085615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161575 Approved by: https://github.com/albanD, https://github.com/malfet	2025-08-29 20:39:29 +00:00
PyTorch MergeBot	f6368e934e	Revert "[MPS] sparse add unary funcs + add for sparse tensors (#160839 )" This reverts commit 93c5112f46a978a029644ae599979416ead5c917. Reverted https://github.com/pytorch/pytorch/pull/160839 on behalf of https://github.com/atalman due to test_sparse_csr.py::TestSparseCompressedCPU::test_consistency_SparseCSR_asinh_cpu_complex64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17329155095/job/49201551217) [HUD commit link](`93c5112f46`) ([comment](https://github.com/pytorch/pytorch/pull/160839#issuecomment-3238093296))	2025-08-29 19:55:39 +00:00
Yidi Wu	bf6aaba0f7	[while_loop] avoid aliasing when body_fn never executes (#160670 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160670 Approved by: https://github.com/zou3519 ghstack dependencies: #160548, #160669	2025-08-29 19:36:37 +00:00
Yidi Wu	456493f7ed	[while_loop][inductor] remove offset check for while_loop (#160669 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160669 Approved by: https://github.com/zou3519 ghstack dependencies: #160548	2025-08-29 19:36:37 +00:00
Huy Do	c74e301455	Bump TorchBench version (#161461 ) To include the latest fixes from TorchBench. I'll setup a nightly commit hash update for this next Pull Request resolved: https://github.com/pytorch/pytorch/pull/161461 Approved by: https://github.com/malfet	2025-08-29 19:21:07 +00:00
Scott Wolchok	67457dbb9d	Fix non-const reference arguments in torch/csrc/jit/python/init.cpp (#161300 ) Shouldn't be any generated code impact, just fixing bad practice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161300 Approved by: https://github.com/wconstab, https://github.com/malfet ghstack dependencies: #161286	2025-08-29 19:01:32 +00:00
Natalia Gimelshein	e9bbd28f22	make einsum produce contiguous inputs in more cases (#161755 ) Fixes #161729 Written by codex This won't produce contiguous inputs for all einsum applications, because we flatten all right-only and left-only dimensions, so if right and left operand dimensions are interleaved in output, we cannot (with current algo) produce contiguous output, however, for common cases like in the linked issue it works. Let's see what CI says Pull Request resolved: https://github.com/pytorch/pytorch/pull/161755 Approved by: https://github.com/malfet, https://github.com/albanD	2025-08-29 18:50:46 +00:00
PaulZhang12	348d781055	[Inductor] Update Outer Reduction Heuristic (#159093 ) Update outer reduction heuristics for significant speedups. HuggingFace: <img width="572" height="705" alt="Screenshot 2025-08-20 at 12 44 51 AM" src="https://github.com/user-attachments/assets/4872a23b-d136-423a-b2e6-187895bccba1" /> Average ~20% speedup on a kernel by kernel basis TorchBench: <img width="572" height="705" alt="Screenshot 2025-08-20 at 12 45 10 AM" src="https://github.com/user-attachments/assets/b8357b6d-6107-4104-b906-292a17d14d48" /> Average ~40% speedup on a kernel by kernel basis <img width="1705" height="729" alt="Screenshot 2025-08-21 at 5 50 32 PM" src="https://github.com/user-attachments/assets/a9715a2b-9e6c-4b33-ba9f-7870dc561e31" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159093 Approved by: https://github.com/jansel	2025-08-29 18:31:22 +00:00
Ting Lu	303f514d5b	[CI] Add basic CUDA 13.0 periodic test (#161013 ) https://github.com/pytorch/pytorch/issues/159779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161013 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com> Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>	2025-08-29 17:56:33 +00:00
Xu Han	f532f99822	[AOTI] normalize_path_separator zip file path (#161781 ) normalize_path_separator zip file path Pull Request resolved: https://github.com/pytorch/pytorch/pull/161781 Approved by: https://github.com/angelayi	2025-08-29 17:53:41 +00:00
Irakli Salia	93c5112f46	[MPS] sparse add unary funcs + add for sparse tensors (#160839 ) Adds several unary functions and add. Enables tests for unary functions in test_sparse but not enabling other tests yet, needs more ops before we fully migrate to testing SparseMPS with `test_sparse.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160839 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-29 16:28:58 +00:00
Mwiza Kunda	0f6a08a029	[inductor] Fix SubgraphInfo round trip (#161779 ) Currently `numels` is not specific to a created subgraph since it is not retrieved by `dataclasses.fields(SubgraphInfo)` due to it not being type annotated, see [ref](https://docs.python.org/3/library/dataclasses.html#module-dataclasses:~:text=The%20%40dataclass%20decorator%20examines%20the%20class%20to%20find%20fields.%20A%20field%20is%20defined%20as%20a%20class%20variable%20that%20has%20a%20type%20annotation.%20With%20two%20exceptions%20described%20below%2C%20nothing%20in%20%40dataclass%20examines%20the%20type%20specified%20in%20the%20variable%20annotation.). So for example the following would happen: ``` self.numels = {"x": sympy.Integer(5)} subgraph_name = "<x>" with self.create_subgraph_body(subgraph_name): self.numels = {"x", sympy.Integer(7)} # this would print that x has size 7, not the original value of 5 print(self.numels) # numels would be None because dataclasses.fields(SubgraphInfo) does not include numels # since it is not type annotated print(self.subgraph_bodies[subgraph_name]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161779 Approved by: https://github.com/eellison	2025-08-29 16:27:29 +00:00
Zain Rizvi	c8fa907e74	Check commit order (#161560 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161560 Approved by: https://github.com/malfet ghstack dependencies: #161558, #161637	2025-08-29 16:22:58 +00:00
ILCSFNO	b99a112688	Update optional tag for `interpolation` in `torch.quantile()` (#161706 ) Fixes #146156 Refix the issue with the extra needed fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161706 Approved by: https://github.com/soulitzer	2025-08-29 16:21:14 +00:00
Chien-Chin Huang	cd6d63f453	[SymmMEM] Fix test_empty_strided_p2p_persistent (#161677 ) test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test. https://github.com/pytorch/pytorch/pull/161668 should also fix the issue but we can land this PR for a safer test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161677 Approved by: https://github.com/kwen2501 ghstack dependencies: #161676	2025-08-29 16:11:58 +00:00
Nikita Shulga	0e45023cf9	Cleanup stale submodule directories after checkout (#161748 ) Fixes https://github.com/pytorch/pytorch/issues/161510 Test plan: ``` % cd third_party/kineto % git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive M libkineto/third_party/dynolog M libkineto/third_party/fmt M libkineto/third_party/googletest Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104) HEAD is now at fe80f93 Fix MSVC Error (#1134) Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1' Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21' Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' % git checkout 5e75018; git submodule update --init --recursive M libkineto/third_party/dynolog M libkineto/third_party/fmt M libkineto/third_party/googletest Previous HEAD position was fe80f93 Fix MSVC Error (#1134) HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104) warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e' Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850' Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164' Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347' % cd ../.. % git status HEAD detached from 649e397c6de Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) (commit or discard the untracked or modified content in submodules) modified: third_party/kineto (untracked content) % time git submodule foreach --recursive git clean -ffdx ... git submodule foreach --recursive git clean -ffdx 0.47s user 0.96s system 88% cpu 1.625 total % git status HEAD detached from 649e397c6de ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748 Approved by: https://github.com/atalman	2025-08-29 14:07:06 +00:00
PyTorch MergeBot	823a329984	Revert "Cleanup stale submodule directories in checkout action (#161748 )" This reverts commit f3c5a82139539c63e6f08966e268c4160e138320. Reverted https://github.com/pytorch/pytorch/pull/161748 on behalf of https://github.com/malfet due to I put the check in the wrong place ([comment](https://github.com/pytorch/pytorch/pull/161748#issuecomment-3237080419))	2025-08-29 13:40:21 +00:00
Ankita George	f0a65cd6d6	Add pg argument to consolidate_safetensors_files_on_every_rank (#161421 ) Summary: Based on feedback on https://github.com/pytorch/torchtitan/pull/1625, adding a pg argument to consolidate_safetensors_files_on_every_rank so that we don't infer the pg and users can supply one if needed. Test Plan: ensure existing tests pass Rollback Plan: Differential Revision: D80954339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161421 Approved by: https://github.com/fegin	2025-08-29 13:31:11 +00:00
Xilun Wu	627decb0ed	[DTensor] fix DTensorTestCase.destroy_pg() when device_type is "cpu" but CUDA device is available (#161015 ) Summary When `device_id` is not None, barrier() will choose the accelerator of the most pripority, which means if the test specifies to use CPU for testing while CUDA is available on the host, the barrier() will use CUDA. To avoid this and better respect `self.device_type`, we add this branch to enforce barrier() to use CPU when `self.device_type` is CPU and other accelerator is also available. Test `pytest test/distributed/tensor/test_dtensor_testbase.py` Debugging Output ``` # from init_process_group() init pg: backend=gloo, device_id = None default_pg has backend: gloo, device_types: [device(type='cuda'), device(type='cpu')] # from barrier() barrier: device_ids = [10], devices = [], device = None, PG=[device(type='cuda'), device(type='cpu')] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161015 Approved by: https://github.com/tianyu-l	2025-08-29 12:47:11 +00:00
zeshengzong	448a7e7e31	Fix `SequentialLR` deprecate warning about invoke `step(epoch)` (#149392 ) Fixes #116776 #76113 #113222 #67958 ## Changes - Refactor `LRScheduler.step` method, leave `epoch` check logic in public method `step` - Move update `lr` logic to `_update_lr` method - Make `SequentialLR` use `_update_lr` to avoid unnecessary warning message ## Test Result ```bash pytest test/optim/test_lrscheduler.py -vv ``` ![image](https://github.com/user-attachments/assets/e1c5527e-193e-4328-bf95-023139ea0416) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149392 Approved by: https://github.com/janeyx99	2025-08-29 11:45:11 +00:00
Malay Bag	ed370ae4b0	[unflatten] Fix test by supporting both MappingKey anf GetAttrKey (#161599 ) Summary: As title Test Plan: Run internal tests Rollback Plan: Differential Revision: D81115712 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161599 Approved by: https://github.com/tugsbayasgalan	2025-08-29 10:08:38 +00:00
David Berard	5859edf113	[BE][inductor] replace "and" -> "logical_and" in bucketize_binary_search (#160941 ) Get rid of these warnings: ``` /home/dberard/local/pytorch-env7/pytorch/torch/_inductor/runtime/triton_helpers.py:317: UserWarning: Logical operators 'and' and 'or' are deprecated for non-scalar tensors; please use '&' or '\|' instead ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160941 Approved by: https://github.com/malfet, https://github.com/jingsh	2025-08-29 09:27:13 +00:00
xinan.lin	5b701a6bb2	[AOTI][Intel GPU] Add XPU quantization ops to AOT Inductor. (#156572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156572 Approved by: https://github.com/EikanWang, https://github.com/angelayi ghstack dependencies: #157430	2025-08-29 09:19:44 +00:00
xinan.lin	48679ef966	[Refactor][XPU] Refactor XPU quantization op and add header files. (#157430 ) This PR refactors the XPU quantization ops to align their code structure with the CPU implementation for consistency. It also adds necessary header files to enable future integration with AOTI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157430 Approved by: https://github.com/angelayi	2025-08-29 09:19:44 +00:00
Natalia Gimelshein	0ca3a6085d	use host+device_id to make sure devices are unique in rendezvous request (#161756 ) Per title, for NVL72 systems where devices with the same indices on multiple hosts are within the same nvlink domain Pull Request resolved: https://github.com/pytorch/pytorch/pull/161756 Approved by: https://github.com/kwen2501	2025-08-29 09:09:45 +00:00
Yiming Zhou	a55d2beb50	[export] Support complex constant in serde (#161517 ) Summary: Fixes #160749 For a model like ``` class M(torch.nn.Module): def forward(self, x): s = torch.sin(x) z = 1j * s return z ``` Its graph will be ``` graph(): %x : [num_users=1] = placeholder[target=x] %sin : [num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%x,), kwargs = {}) %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%sin, 1j), kwargs = {}) return (mul,) ``` `1j` will appear as a constant complex argument in the `aten.mul` Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_complex_constant Rollback Plan: Differential Revision: D80672323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161517 Approved by: https://github.com/angelayi	2025-08-29 08:13:21 +00:00
Chien-Chin Huang	d8a0bdb0d3	[BE][SymmMEM] Change Optional to the shorthand expression for symmetric memory modules (#161676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161676 Approved by: https://github.com/Skylion007	2025-08-29 07:31:16 +00:00
PyTorch UpdateBot	a7c949089a	[vllm hash update] update the pinned vllm hash (#161752 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161752 Approved by: https://github.com/pytorchbot	2025-08-29 04:54:31 +00:00
PyTorch UpdateBot	a6456bfa85	[audio hash update] update the pinned audio hash (#161753 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161753 Approved by: https://github.com/pytorchbot	2025-08-29 04:52:58 +00:00
Nikita Shulga	f3c5a82139	Cleanup stale submodule directories in checkout action (#161748 ) Fixes https://github.com/pytorch/pytorch/issues/161510 Test plan: ``` % cd third_party/kineto % git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive M libkineto/third_party/dynolog M libkineto/third_party/fmt M libkineto/third_party/googletest Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104) HEAD is now at fe80f93 Fix MSVC Error (#1134) Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1' Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159' Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21' Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' % git checkout 5e75018; git submodule update --init --recursive M libkineto/third_party/dynolog M libkineto/third_party/fmt M libkineto/third_party/googletest Previous HEAD position was fe80f93 Fix MSVC Error (#1134) HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104) warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e' Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850' Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164' Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347' % cd ../.. % git status HEAD detached from 649e397c6de Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) (commit or discard the untracked or modified content in submodules) modified: third_party/kineto (untracked content) % time git submodule foreach --recursive git clean -ffdx ... git submodule foreach --recursive git clean -ffdx 0.47s user 0.96s system 88% cpu 1.625 total % git status HEAD detached from 649e397c6de ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748 Approved by: https://github.com/atalman	2025-08-29 03:21:31 +00:00
Angela Yi	5c306c3ccb	[fx] Add lru_cache to warning (#161721 ) Summary: Added lru_cache to the warning message to avoid flooding logs Test Plan: CI Rollback Plan: Differential Revision: D81245618 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161721 Approved by: https://github.com/pianpwk	2025-08-29 02:25:45 +00:00
Dylan Maloy	c1cb1cb26e	fix tests caused by has_triton (#161737 ) Summary: this will only cause it in the event that we are serializing a triton hop. there are a few tests that do weird mocking stuff that this function doesn't like, so this will prevent it from being called there. Test Plan: att Rollback Plan: Differential Revision: D81261486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161737 Approved by: https://github.com/angelayi	2025-08-29 02:25:35 +00:00
drisspg	5cb1d71e59	[Flex] Fix float16 default config 128 headdim (#161647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161647 Approved by: https://github.com/v0i0	2025-08-29 01:48:06 +00:00
Justin Chu	d153af713e	[ez] Improve formatting in error messages for dynamic shapes (#161573 ) Show the repr of `dim` to make the message more clear. Example: before `but got batch instead`, after `but got "batch" instead` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161573 Approved by: https://github.com/angelayi	2025-08-28 23:52:58 +00:00
PyTorch MergeBot	9b67d8e344	Revert "[RELAND] Close some sources of fake tensor leakage (#161589 )" This reverts commit 5790b009751e6ebba35d3e6d05e7c1b135553eee. Reverted https://github.com/pytorch/pytorch/pull/161589 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/17305150611/job/49128381649) [HUD commit link](`5790b00975`) ([comment](https://github.com/pytorch/pytorch/pull/161589#issuecomment-3235224249))	2025-08-28 23:19:36 +00:00
PyTorch MergeBot	47742081c9	Revert "kill allow_complex_guards_as_runtime_asserts (#160198 )" This reverts commit 69d91b94ba5366f4444d8cb8fd3dab4de4f04d3d. Reverted https://github.com/pytorch/pytorch/pull/160198 on behalf of https://github.com/jeffdaily due to let's revert again instead of waiting for forward fix, see earlier comments ([comment](https://github.com/pytorch/pytorch/pull/160198#issuecomment-3235165462))	2025-08-28 22:50:37 +00:00
drisspg	fffa62fa12	Ensure large tensor int32 -> int64 indexing is enabled (#157767 ) Fixes: #https://github.com/pytorch/pytorch/issues/157446 I think that this delta is worth the switch form block-ptrs especially since they are deprecated ## Perf Summary A is nightly B is this diff, so `negative` means this diff improves perf TOP 5 differences <img width="805" height="754" alt="Screenshot 2025-08-24 at 5 49 49 PM" src="https://github.com/user-attachments/assets/aa359cdf-ee9a-427d-be72-1b9aef6f3115" /> <details> <summary><strong>Full perf table (click to expand)</strong></summary> \| attn_type \| dtype \| shape(B,Hq,M,Hkv,N,D) \| TFlops Version A \| TFlops Version B \| \| --- \| --- \| --- \| --- \| --- \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 258.38834144791923 \| 258.6353685004612 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.2192450677751 \| 140.12393320464972 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 122.32683823617003 \| 118.51603755647925 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.48556906165314 \| 137.24259849208627 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 86.59814488695922 \| 84.59431398586257 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 288.52679758135764 \| 292.9174195871856 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 172.25541683643277 \| 172.94326459828508 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 164.40864610599826 \| 165.035129576335 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 176.54876886433945 \| 175.08057670028145 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 125.22491679812626 \| 121.06201152859151 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 339.11952481874283 \| 339.0132835601695 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 227.58583240284406 \| 228.21824999409597 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 185.98569659868966 \| 182.32850843255093 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 188.9495725191772 \| 180.31385312481657 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 106.25789530994302 \| 106.55084959448476 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 357.6430536888533 \| 363.30843452247274 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 262.3241154406613 \| 265.73250045488 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 249.30498953911416 \| 249.35928192833785 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 224.74126243851808 \| 223.71776504077988 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 168.26977014013707 \| 165.47991483333809 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 382.8178701785897 \| 384.34752965862685 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 308.1449710013853 \| 311.0653716044644 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 251.96365252505072 \| 243.92283557225903 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 226.69316232745368 \| 215.22769268913356 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 153.34142545296405 \| 151.9312673939401 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 396.0998000753126 \| 398.35036286102473 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 333.5198415274966 \| 344.6354466169716 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 310.5955933379696 \| 305.66347819546 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 260.4012412689896 \| 259.758666997307 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 234.13034252182635 \| 227.61676497283614 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 396.17615538477196 \| 401.1419104525502 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 359.98648311998414 \| 360.8285563463094 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 291.97720707257736 \| 281.41694809965253 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 250.1703628419691 \| 238.556760291579 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 199.50782826294306 \| 191.52327358439223 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 411.0632004785396 \| 413.6362648405517 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 382.9404387613185 \| 397.74886235657607 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 357.0998545146633 \| 350.5115200772392 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 281.8033924428203 \| 281.98601309215843 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 282.56595134222135 \| 277.4565795466672 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 408.89838018149516 \| 405.14531386840076 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 396.07662058160264 \| 393.4598228299578 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 317.8822887267849 \| 304.754931401036 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 265.8801304948243 \| 254.22961974295112 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 227.87390579965614 \| 222.19481980110393 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 427.36821778477025 \| 431.3766620314935 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 410.67994346825 \| 423.4666944003808 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 381.1968748374038 \| 381.77668006420424 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 292.5540046358546 \| 296.5439130720502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 321.04573768858114 \| 310.7423616656888 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 427.46148866769903 \| 426.162091037068 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 419.75580537687347 \| 421.88640120274334 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 337.3208051798903 \| 327.4912454675092 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 276.5638854539581 \| 262.988360558083 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 250.82791326036886 \| 245.07367032501736 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 435.8055824506086 \| 441.8803729460534 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 432.02638235921006 \| 450.33161016596273 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 402.25525939224883 \| 393.8564689669916 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 297.5337286675904 \| 297.0131881135074 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 343.8697037899545 \| 329.8194073407783 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 267.58912366821056 \| 256.91606054118375 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 150.81723692609629 \| 146.32172267858743 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 129.51029293209245 \| 122.72144394093334 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 147.627656359087 \| 141.68956350566188 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 87.55100546003591 \| 84.91293287692788 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 299.5931492743986 \| 305.884253766691 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 179.39026367843837 \| 181.64741311605096 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 173.93547669282367 \| 173.23972950980564 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 185.90234171599252 \| 182.80844545446686 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 128.08176696266082 \| 123.27722685662111 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 340.50674552770664 \| 338.9071088484576 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 225.4438318650432 \| 230.22899884832975 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 194.15123248528312 \| 185.02793973094865 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 200.74289714108176 \| 191.76606719670647 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 107.03564946728423 \| 106.82432377861258 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 371.31799283918406 \| 379.7555394732925 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 275.97762744310455 \| 276.71106853992995 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 261.6648679783462 \| 259.4127232060398 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 237.03108223577615 \| 233.92710216149527 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 172.13926800371152 \| 168.74390922407585 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 381.50199487767276 \| 383.9043681999597 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 307.9748883093411 \| 312.2403515462001 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 251.11319684705438 \| 243.17870127827277 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 236.3253127246763 \| 223.81250201769552 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 154.55693991756874 \| 153.11360584987685 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 407.11400078586615 \| 413.53709886086557 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 348.1705797722622 \| 360.09771155957367 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 321.8593280850388 \| 318.2882327401255 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 270.089032013835 \| 268.767323026064 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 238.07324557907788 \| 228.09842078362692 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 399.8172853171901 \| 401.0954526332136 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 363.4387330438581 \| 364.13111024232677 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 294.1752429133857 \| 283.7235663368415 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 256.8389394007649 \| 246.91771015606483 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 199.3378564292656 \| 192.40439590901758 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 425.5150965556111 \| 430.8190098707553 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 396.00437184073013 \| 411.3873625655787 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 369.92803661607815 \| 361.43244467343663 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 293.4277354412933 \| 295.2529537595746 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 288.0208673072841 \| 281.51896404878863 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 408.3005367220567 \| 408.96116482298913 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 396.90095962766304 \| 396.87385456176486 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 319.0534576137999 \| 302.50950358107764 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 270.3334977708081 \| 258.8506349486557 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 227.46824134365394 \| 222.23759438128766 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 438.24247309479694 \| 437.7975163205371 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 428.34012029699227 \| 433.3215899950434 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 386.52672049728875 \| 388.26216893354984 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 302.71976814728083 \| 302.3574867306459 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 327.39760662780986 \| 308.6348428844912 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 423.31308678262695 \| 426.6306972137279 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 412.6983690923106 \| 419.4961977664297 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 337.41003544742273 \| 324.2155049126126 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 278.7755890910794 \| 265.9194286636502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 251.55678254755364 \| 244.8843180141462 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 452.5930781172308 \| 457.7117122300742 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 445.05676260348116 \| 463.9304535499636 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 415.78302138389415 \| 406.29229555271456 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 308.0311067300895 \| 304.91354721414314 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 351.43943626809335 \| 329.4476923070317 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 295.1801525813241 \| 291.36521287398904 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 183.23250549178067 \| 182.35421238887605 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 151.56832453117747 \| 151.3422139154794 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 171.02111935180432 \| 160.72516856727913 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 74.05765122783826 \| 74.5885345035243 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 314.3587394591763 \| 319.2938677773619 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 224.57002084153177 \| 225.48868542008177 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.00964804143052 \| 215.39576159953486 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.1174237618258 \| 214.28437413525663 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 121.08920423648368 \| 119.55813661872644 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 362.2193857281911 \| 360.05005804275936 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 279.8840217430121 \| 279.5437918286659 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 227.76617121021982 \| 222.8655938229316 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 215.43141176970562 \| 207.71852284994702 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 121.35588364218539 \| 121.20636565046884 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 365.1545280898012 \| 373.37585444987326 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 304.360119952975 \| 309.1247297936263 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 287.2603904544586 \| 289.25547903162595 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 257.9852675272418 \| 257.59069234098115 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 188.35158496670232 \| 184.24683960154857 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 389.9744911369211 \| 388.43466897254166 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 345.9228295166513 \| 342.63034895210126 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 279.56334658247437 \| 271.2724375402088 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 245.66477202810066 \| 233.49688207371258 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 170.3270720653187 \| 166.23863845657382 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 400.0041140827554 \| 402.11182445396497 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 363.64641830327434 \| 375.9288663364792 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 341.5776139573363 \| 335.1160003213424 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 281.1811770268521 \| 280.21438270014005 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 247.78716118997716 \| 245.3269825179633 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 403.794126680488 \| 405.2353919019577 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 387.079178426863 \| 385.1461762057035 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 309.7847188173431 \| 298.0443968374749 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 262.4721750159666 \| 250.81679725428586 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 205.70866004479979 \| 202.9620839129557 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 413.380982988662 \| 418.40270594263103 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 398.450064800682 \| 409.6794973994029 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 372.26297458194466 \| 364.44415106552196 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 293.0818569905912 \| 292.85172400643984 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 296.46717085592087 \| 285.76362010612763 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 419.3186786037592 \| 426.08801580934437 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 408.1648467766632 \| 409.4122254207817 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 329.24396020457345 \| 313.5200995121138 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 274.61257504571876 \| 255.7801815432177 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 232.63806001220684 \| 230.03020843492314 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 435.0785891054788 \| 440.39101804225345 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 424.86925312752817 \| 435.18898057396825 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 393.000417896268 \| 395.11543361225256 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 297.7755459218185 \| 300.7208114715287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 331.71570861760534 \| 318.07127352552885 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 424.58602747137405 \| 425.84897078470715 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 422.66607285025725 \| 423.5524945535485 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 344.8625760048626 \| 331.6793888458635 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 282.0787281511649 \| 263.7895634445868 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 252.7301927385177 \| 245.41844170037427 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 437.0658069164588 \| 442.9101960063628 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 433.13788271434646 \| 452.3873572709863 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 404.0959191546953 \| 396.7077863894884 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 300.45502211883206 \| 301.3439134717943 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 344.11003202413934 \| 330.8897663350314 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 298.4364205341705 \| 291.6793556507056 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 187.6382133139633 \| 191.05409897308772 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 156.55822078636112 \| 154.178925976516 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 173.47765221825162 \| 169.30862508068464 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 74.5885345035243 \| 74.52689061607104 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 323.12233826013045 \| 328.53889207933514 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 236.75872140126316 \| 235.8378325547398 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 227.17836523816675 \| 226.75357076139966 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 224.07209453308036 \| 224.07209453308036 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 122.85572156047981 \| 121.11642183704716 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 361.3123326658092 \| 360.71014086458337 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 281.5287983927017 \| 281.94301754758345 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 232.7456696285686 \| 226.50976826432776 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 221.5612361744038 \| 214.96188822837055 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 121.38311528944315 \| 120.85441868178513 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 380.2579019244734 \| 389.2520157863988 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 316.95230660496924 \| 317.87597790618906 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 301.07968126657323 \| 298.02424098422983 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 267.2240756921594 \| 267.16353549228154 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 189.82761622494257 \| 186.736450261963 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 389.88665375406805 \| 387.9125133037077 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 348.70619958684887 \| 346.6750499749774 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 280.5472989906087 \| 271.22300822012187 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 250.02397620165968 \| 241.22532776331445 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 171.67817496107645 \| 166.95679280483972 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 412.626880230807 \| 417.60238657950777 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 374.8829313933945 \| 389.4448546468815 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 353.20410434172436 \| 345.7072490717473 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 292.51045924209586 \| 291.66621022138287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 251.6264062063495 \| 248.45110052911542 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 404.0155784550126 \| 401.90546837237514 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 384.4389015599863 \| 386.9684324594344 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 313.3731284132225 \| 298.17074251037894 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 264.19199737284265 \| 252.8982463999916 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 207.03696315185684 \| 202.86697323136772 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 428.2436763312506 \| 433.45005568619536 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 411.8516531869893 \| 428.2753623461049 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 384.9095037182509 \| 372.90888743000744 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 303.2438915629836 \| 302.05095952914337 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 301.8689122735564 \| 285.0363190513223 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 423.13592231504805 \| 420.3991500185611 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 407.44527331585493 \| 408.5064370765247 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 330.50050996167414 \| 316.8763979925965 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 274.6833786307413 \| 259.86098862141324 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 232.24019584158367 \| 226.52040268160232 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 444.4596314237808 \| 455.99558915752266 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 437.4245561244369 \| 455.98275147271966 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 397.3350686877605 \| 397.88875599028063 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 308.53809114394545 \| 307.1359822042007 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 331.32379843423774 \| 316.85293191675646 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 422.4622274366379 \| 425.0407156418684 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 420.9547052783101 \| 430.33779243510276 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 345.50265346504085 \| 332.094855328957 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 280.81715528243365 \| 264.6543640282054 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 252.25635200421783 \| 245.46235499490305 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 452.5524207341139 \| 461.7512032176736 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 445.2316469907137 \| 464.4523799578466 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 416.87264016717023 \| 409.17124592157046 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 309.42579489389846 \| 307.9734464665731 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 350.50782004300623 \| 330.98959545427294 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157767 Approved by: https://github.com/Skylion007	2025-08-28 22:43:59 +00:00
can-gaa-hou	c0ed87c82d	[Dynamo] Fix weakref.proxy error when `torch.compile` (#161508 ) Fixes #159258 The error occurs when we attempt to create a weak reference from a weak reference proxy. `e9d42b3880/torch/_dynamo/guards.py (L2910-L2915)` In fact, we shouldn't create a weak reference from another reference or proxy, as it would check in CPython. `f60f8225ed/Objects/weakrefobject.c (L410-L418)` However, `__weakrefoffset__` is not equal to 0 when the `guarded_object` is in `weakref.ProxyTypes`, and it will wrongly create a weak reference for the `weakref.ProxyTypes`. I think this could be a bug from CPython, but we can prevent it by adding more weakref type checks (`weakref.ProxyTypes` contains `weakref.ProxyType` and `weakref.CallableProxyType`) here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161508 Approved by: https://github.com/Lucaskabela, https://github.com/anijain2305, https://github.com/malfet	2025-08-28 22:34:18 +00:00
Aleksei Nikiforov	1069a08dac	Enable more nightly tests on s390x (#160893 ) Enable more nightly tests on s390x Pull Request resolved: https://github.com/pytorch/pytorch/pull/160893 Approved by: https://github.com/malfet	2025-08-28 22:20:55 +00:00
soulitzer	1190b7f73e	Support Triton kernels in SAC region (#161541 ) SAC interaction with triton kernel: - In eager, triton ops are not dispatchable, and so it is always ignored by SAC, i.e., always recomputed. - In compile, although we wrap triton kernels into HOPs, allowing us to intercept them, we still recompute by default rather than save by default, so that compile maintains the invariant of using less memory than eager. - If you want to do something else (e.g. save the output of your triton kernel) you should wrap it in a custom op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161541 Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/xmfan	2025-08-28 21:15:25 +00:00
PyTorch MergeBot	f46e4bcf43	Revert "Add ciflow/vllm to vLLM commit hash update PR(s) (#161678 )" This reverts commit 0e358050304c6a350dae2bce497bd1867ecc3c9f. Reverted https://github.com/pytorch/pytorch/pull/161678 on behalf of https://github.com/yangw-dev due to we want to keep the vllm pinn updated now, right now we have some failure ([comment](https://github.com/pytorch/pytorch/pull/161678#issuecomment-3234876332))	2025-08-28 20:42:19 +00:00
Ruben Rodriguez Buchillon	496052faf6	[inductor][decompose-k] make part of template heuristics (#161098 ) # why - enable it to go through commont template heuristics point - make easier to use in common extension point e.g. lookup table # what - break template heuristic into base + triton - move k_split generation logic into a templateheuristic for decompose k - register through normal mechanism - to make testing work, add a context manager to temporarily set template heuristics for a template/op to empty (effectively skipping it). This is used for decompose k test to disable triton choices # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D80670918](https://our.internmc.facebook.com/intern/diff/D80670918) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161098 Approved by: https://github.com/jansel ghstack dependencies: #161026, #161097	2025-08-28 20:14:48 +00:00
Ruben Rodriguez Buchillon	f641effe19	[inductor][ez] move template heuristics into dir (#161097 ) # why - simplify the expansion of heuristics beyond just triton (e.g. decomposeK) # what - move template heuristics and registry into its own folder - adjust imports accordingly # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D80670917](https://our.internmc.facebook.com/intern/diff/D80670917) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161097 Approved by: https://github.com/PaulZhang12, https://github.com/jansel ghstack dependencies: #161026	2025-08-28 20:14:48 +00:00
Ruben Rodriguez Buchillon	688acf0b83	[inductor][mm] restructure decompose k (#161026 ) # why - make it easier to integrate into lookup table later # what - current version generates templates on the fly and uses them to generate a single choice - lookup table and performance model work best when there is a stable set of templates (with predictable names) and those are then parametrized - this change makes it so that there is a single DecomposeK template with a stable name, and the k split is the only parametrization we do # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_decompose_k_dynamic_False_bfloat16_sizes1 -v ``` Differential Revision: [D80670913](https://our.internmc.facebook.com/intern/diff/D80670913) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161026 Approved by: https://github.com/PaulZhang12, https://github.com/jansel	2025-08-28 20:14:41 +00:00
Natalia Gimelshein	f0a517e333	Use vectorized stores for all dtypes (#161649 ) resurrecting #151818 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161649 Approved by: https://github.com/Skylion007	2025-08-28 20:06:29 +00:00
Kevin Fu	bacdd985a9	[PT2] Add fastResizeToZero to all static dispatch kernels (#161679 ) Summary: Add fastResizeToZero whenever we are reusing output tensors. Otherwise it keeps throwing warning ``` Warning: An output with one or more elements was resized since it had shape [10], which does not match the required output shape [181]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function _resize_output_check) ``` Test Plan: Run local replayer. ``` MODEL_TYPE=ads_mtml_offsite_cvr_oba_optout_dedicated_model MODEL_ENTITY_ID=786096203 SNAPSHOT_ID=11 HARDWARE_TYPE=1 ./sigrid/predictor/scripts/start_gpu_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} 3443 2>&1 \| tee ~/logs/${MODEL_TYPE}/predictor_${MODEL_ENTITY_ID}_${SNAPSHOT_ID} sigrid/predictor/scripts/start_gpu_replayer_localhost_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} 1000 ${MODEL_TYPE} /data/users/$USER/requests/filter_requests_ads_mtml_offsite_cvr_oba_optout_dedicated_model_100 localhost /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} false 3443 false 2>&1 \| tee ~/logs/${MODEL_TYPE}/replayer_${MODEL_ENTITY_ID}_${SNAPSHOT_ID} ``` Before: P1921177565 After: P1921178087 Rollback Plan: Differential Revision: D81177596 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161679 Approved by: https://github.com/henryoier	2025-08-28 19:58:40 +00:00
RajeshvShiyal	1621b5494c	Removed redundant dtype conversion in scaled_dot_product_attention docstring example (#161613 ) Suggested changes done for Fixes #161611. Removed the line attn_bias.to(query.dtype) entirely Fixes #161611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161613 Approved by: https://github.com/mikaylagawarecki	2025-08-28 19:58:07 +00:00
Avik Chaudhuri	69d91b94ba	kill allow_complex_guards_as_runtime_asserts (#160198 ) Summary: Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept). Test Plan: updated tests Rollback Plan: Differential Revision: D79903317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160198 Approved by: https://github.com/ezyang	2025-08-28 19:36:19 +00:00
Dmitry Nikolaev	b76f6d117a	[ROCm] fix numpy version detection and adjust fudge_factors for MI355 (#161429 ) This PR fixes: - Numpy >= 2.1 version detection (instead of python 3.13 version detection) to skip some tests (numpy 2.1 can be installed for older python versions) ``` test_quantization.py::TestDynamicQuantizedOps::test_qlinear test_quantization.py::TestDynamicQuantizedOps::test_qlinear_legacy test_quantization.py::TestQuantizedLinear::test_qlinear test_quantization.py::TestQuantizedLinear::test_qlinear_leaky_relu test_quantization.py::TestQuantizedLinear::test_qlinear_relu test_quantization.py::TestQuantizedLinear::test_qlinear_tanh test_quantization.py::TestQuantizedLinear::test_qlinear_with_input_q_dq_qweight_dq_output_fp32 ``` - A couple of SDPA tests on MI355 by adjusting fudge_factors: ``` test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32 test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161429 Approved by: https://github.com/jeffdaily	2025-08-28 19:32:09 +00:00
Karthick Panner Selvam	130e50afff	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos, https://github.com/atalman	2025-08-28 18:57:34 +00:00
Shangdi Yu	30ab87c884	[inductor] don't append None to choices (#161672 ) Summary: don't append None as a choice to choices in autotune Test Plan: See internal Diff Differential Revision: D81188644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161672 Approved by: https://github.com/angelayi	2025-08-28 18:48:50 +00:00
PyTorch MergeBot	049c08eda8	Revert "[dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#160934 )" This reverts commit 8f31aa97a3e1e17bed29b6cedf9884f0c6b145e9. Reverted https://github.com/pytorch/pytorch/pull/160934 on behalf of https://github.com/anijain2305 due to causes memory leak leading to OOMs ([comment](https://github.com/pytorch/pytorch/pull/160934#issuecomment-3234426359))	2025-08-28 17:56:36 +00:00
dolpm	affd071858	[export] serialization support for triton_kernel_wrapper_functional (#161314 ) Summary: att Test Plan: buck2 test mode/opt //caffe2/test:test_export -- test_triton_hop Rollback Plan: Differential Revision: D80827767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161314 Approved by: https://github.com/angelayi	2025-08-28 17:42:47 +00:00
angelayi	dac062f23b	Add aoti to mps benchmarks (#160741 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160741 Approved by: https://github.com/malfet, https://github.com/huydhn	2025-08-28 17:32:29 +00:00
Wang, Chuanqi	2a70d98abf	[CI] Migrate XPU build and test to python 3.10 (#161708 ) Follow #161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161708 Approved by: https://github.com/malfet	2025-08-28 17:27:11 +00:00
eqy	55c289d5c1	[cuBLASLt][FP8] `cuBLASLt` appears to support float8 rowwise-scaling on H100 (#161305 ) Following #157905 I think the macro around ``` TORCH_INTERNAL_ASSERT(use_rowwise == false, "rowwise scaled_gemm not supported with blaslt"); ``` was never updated and this would cause `float8` tests to fail. Also it appears the `Lt` accepts two inputs with `e4m3` and `e5m2` dtypes simultaneously, so removing that check here as well... CC @lw Pull Request resolved: https://github.com/pytorch/pytorch/pull/161305 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-28 17:04:25 +00:00
Nikita Shulga	2042d2174a	[MPS] Migrate round unary op to Metal (#161712 ) And actually use the right function, as [`torch.round`](https://docs.pytorch.org/docs/stable/generated/torch.round.html) doesn't use `std::round`, but rather `std::rint`, which can be easily seen by running something like ```python import torch print(torch.arange(-3., 3., step=.5, device='mps').round()) print(torch.arange(-3., 3., step=.5, device='mps').cpu().round()) ``` Before this change it printed ``` tensor([-3., -3., -2., -2., -1., -1., 0., 1., 1., 2., 2., 3.], device='mps:0') tensor([-3., -2., -2., -2., -1., -0., 0., 0., 1., 2., 2., 2.]) ``` But after this change results match Pull Request resolved: https://github.com/pytorch/pytorch/pull/161712 Approved by: https://github.com/dcci	2025-08-28 16:45:07 +00:00
Will Constable	4fd761fecc	[DTensor] Wrap sharding prop error with contextual exception (#161574 ) Mainly, this helps tell the user more info about the operator that failed to run if it fails during sharding propagation. Previously, only this exception would be raised: ``` RuntimeError: ('Attempted to flatten sharded dimension 1, ', 'but only the leftmost dim of a Flatten can be sharded.') ``` Now you get both the above exception as well as ``` The above exception was the direct cause of the following exception: RuntimeError: Sharding propagation failed for Op(op=aten.view.default, args_schema=Spec((Replicate(), Shard(dim=0), Shard(dim=1), Shard(dim=2)) on (8, 8, 4)), [64, 4] @ mesh: (1, 2, 2, 2)) ``` <stacktrace omitted> <details><summary>detailed error</summary> ``` ====================================================================== ERROR: test_linear (__main__.TestDTensor) ---------------------------------------------------------------------- Traceback (most recent call last): File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 668, in wrapper self._join_processes(fn) File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 932, in _join_processes self._check_return_codes(fn, elapsed_time) File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 972, in _check_return_codes raise RuntimeError(error) RuntimeError: Process 4 exited with error code 10 and exception: Traceback (most recent call last): File "/data/users/whc/pytorch/torch/distributed/tensor/_dispatch.py", line 150, in dispatch self.sharding_propagator.propagate(op_info) File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 309, in propagate OutputSharding, self.propagate_op_sharding(op_info.schema) File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 45, in __call__ return self.cache(args, kwargs) File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 329, in propagate_op_sharding_non_cached op_strategy = self.op_strategy_funcs[op_schema.op](strategy_schema) File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 673, in reshape_strategy input_tgt_placements, output_placements = propagate_shape_and_sharding( File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 601, in propagate_shape_and_sharding in_dim = get_in_dim_to_shard(cmd) File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 537, in get_in_dim_to_shard raise RuntimeError( RuntimeError: ('Attempted to flatten sharded dimension 1, ', 'but only the leftmost dim of a Flatten can be sharded.') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 816, in run_test getattr(self, test_name)() File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 670, in wrapper fn() File "/data/users/whc/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper method(args, *kwargs) File "/data/users/whc/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 490, in wrapper raise e File "/data/users/whc/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 487, in wrapper func(self, args, *kwargs) # type: ignore[misc] File "/data/users/whc/pytorch/test.py", line 60, in test_linear print("results: ", distributed_linear(distributed_input)) File "/data/users/whc/pytorch/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "/data/users/whc/pytorch/torch/nn/modules/module.py", line 1786, in _call_impl return forward_call(args, *kwargs) File "/data/users/whc/pytorch/torch/nn/modules/linear.py", line 134, in forward return F.linear(input, self.weight, self.bias) File "/data/users/whc/pytorch/torch/_compile.py", line 53, in inner return disable_fn(args, *kwargs) File "/data/users/whc/pytorch/torch/_dynamo/eval_frame.py", line 1005, in _fn return fn(args, **kwargs) File "/data/users/whc/pytorch/torch/distributed/tensor/_api.py", line 358, in __torch_dispatch__ return DTensor._op_dispatcher.dispatch( File "/data/users/whc/pytorch/torch/distributed/tensor/_dispatch.py", line 163, in dispatch raise RuntimeError( RuntimeError: Sharding propagation failed for Op(op=aten.view.default, args_schema=Spec((Replicate(), Shard(dim=0), Shard(dim=1), Shard(dim=2)) on (8, 8, 4)), [64, 4] @ mesh: (1, 2, 2, 2)) ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161574 Approved by: https://github.com/zpcore, https://github.com/XilunWu	2025-08-28 15:56:15 +00:00
PyTorch MergeBot	a8270dd124	Revert "kill allow_complex_guards_as_runtime_asserts (#160198 )" This reverts commit 196232bb935cb346f143d5c39e9a73c44121a033. Reverted https://github.com/pytorch/pytorch/pull/160198 on behalf of https://github.com/atalman due to dynamo/test_activation_checkpointing.py::ActivationCheckpointingViaTagsTestsCUDA::test_compile_selective_checkpoint_triton_kernel_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17289619543/job/49074475338) [HUD commit link](`196232bb93`) ([comment](https://github.com/pytorch/pytorch/pull/160198#issuecomment-3234013520))	2025-08-28 15:40:37 +00:00
Jane Xu	63632fc7ee	Add new_zeros dtype variant to the shim and as a stable op (#161597 ) In case we want this before 2.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161597 Approved by: https://github.com/mikaylagawarecki	2025-08-28 13:57:24 +00:00
PyTorch MergeBot	05d0f11dbd	Revert "Add test coverage to tf32 in max autotune mm configs (#161545 )" This reverts commit e9d34b2438d65d6d16109e2416f3698de20f85c2. Reverted https://github.com/pytorch/pytorch/pull/161545 on behalf of https://github.com/atalman due to inductor/test_max_autotune.py::TestMaxAutotuneRemoteCache::test_get_mm_configs_float32_precision_ieee [GH job link](https://github.com/pytorch/pytorch/actions/runs/17283985553/job/49058214260) [HUD commit link](`e9d34b2438`) ([comment](https://github.com/pytorch/pytorch/pull/161545#issuecomment-3233569771))	2025-08-28 13:46:47 +00:00
PyTorch MergeBot	ef0483d74c	Revert "Ensure large tensor int32 -> int64 indexing is enabled (#157767 )" This reverts commit b36a20d368733740a8507b3109d193c88930323a. Reverted https://github.com/pytorch/pytorch/pull/157767 on behalf of https://github.com/atalman due to need to revert https://github.com/pytorch/pytorch/pull/157767 internal tests ([comment](https://github.com/pytorch/pytorch/pull/157767#issuecomment-3233558168))	2025-08-28 13:44:41 +00:00
PyTorch MergeBot	5432966253	Revert "Remove test since it ooms on CI (#161644 )" This reverts commit 443452ca2f5beef58019f4e7e7e31c0526aee0fc. Reverted https://github.com/pytorch/pytorch/pull/161644 on behalf of https://github.com/atalman due to need to revert https://github.com/pytorch/pytorch/pull/157767 internal tests ([comment](https://github.com/pytorch/pytorch/pull/161644#issuecomment-3233550883))	2025-08-28 13:41:58 +00:00
PyTorch MergeBot	e9975f501c	Revert "Support Triton kernels in SAC region (#161541 )" This reverts commit 149c68071ca033d5e3427e63e05d9969bd4961e4. Reverted https://github.com/pytorch/pytorch/pull/161541 on behalf of https://github.com/malfet due to Broke some tests in trunk workflow, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=trunk%20%2F%20linux-jammy-cuda12.8 ([comment](https://github.com/pytorch/pytorch/pull/161541#issuecomment-3233457206))	2025-08-28 13:14:53 +00:00
xinan.lin	07f76517e7	[Inductor][WIndows] Fix Windows test case failure. (#161497 ) Fixes windows test case failures: - TritonCodeGenTests.test_inductor_sequence_nr - TritonCodeGenTests.test_indirect_device_assert - CompiledOptimizerTests.test_static_address_finalizer Pull Request resolved: https://github.com/pytorch/pytorch/pull/161497 Approved by: https://github.com/jansel	2025-08-28 12:40:42 +00:00
xinan.lin	3519969e4f	[Intel GPU] Enable tensor memory descriptor in triton template for XPU. (#161600 ) As Intel Triton now supports tensor descriptor, this PR updates the pinned Intel Triton version and introduces support for Triton MM template with tensor descriptor on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161600 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-08-28 12:39:58 +00:00
Tugsbayasgalan Manlaibaatar	5790b00975	[RELAND] Close some sources of fake tensor leakage (#161589 ) Reland of https://github.com/pytorch/pytorch/pull/159923 Couple of fixes: 1. When we run into an operation we didn't proxy, we end up emitting fake constants. We detect this and warn using the FQN of the lifted constant. We warn because some internal users complained it was regressing their exportability. 2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict 3. We modify yolov3 to fix the previous silent incorrect behaviour 4. We use strict export for levit_128 because it errors in non-strict due to more strict side effect checking When upgrading torchbench pin, opacus_cifar10 seems to not run on eager anymore. I verified this by pushing a temporary PR on master with new pin. So i added it to expect_fail list. Differential Revision: [D81133908](https://our.internmc.facebook.com/intern/diff/D81133908) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161589 Approved by: https://github.com/avikchaudhuri	2025-08-28 09:46:42 +00:00
Eddie Yan	2e77a08b95	[cuDNN][TF32] Account for TF32 in `test_super_resolution_cuda` (#161662 ) cuDNN seems to be dispatching to TF32 kernels on B200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161662 Approved by: https://github.com/Skylion007	2025-08-28 08:42:34 +00:00
Avik Chaudhuri	196232bb93	kill allow_complex_guards_as_runtime_asserts (#160198 ) Summary: Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept). Test Plan: updated tests Rollback Plan: Differential Revision: D79903317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160198 Approved by: https://github.com/ezyang	2025-08-28 07:59:29 +00:00
PyTorch MergeBot	fa76256603	Revert "[dynamic shapes] use prims_common contiguity in create_example_tensors (#160933 )" This reverts commit 33c3794533844236a6e30ba377e0a6802b279fc8. Reverted https://github.com/pytorch/pytorch/pull/160933 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160933#issuecomment-3232305708))	2025-08-28 07:39:26 +00:00
Gabriel Ferns	d2d4a3c539	Select Algorithm clear feedback savers (#161654 ) Add `clear_feedback_savers` and tests for the feedback functionality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161654 Approved by: https://github.com/masnesral	2025-08-28 06:56:03 +00:00
Ke Wen	95516ad7e6	[4/N][SymmMem] Add `get_remote_tensor` + move up `get_buffer` and `get_signal_pad` (#161533 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): `get_remote_tensor `: return a symmetric tensor given a peer rank. The difference between `get_buffer` API and `get_remote_tensor` API: - the former accepts an offset, whereas the latter doesn't - the latter returns a symmetric tensor at `hdl.offset` on `peer`. As a refactorization, this PR also moves the implementation of `get_buffer` and `get_signal_pad` to the `SymmetricMemory` level as their code is common to all backends. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161533 Approved by: https://github.com/ngimel ghstack dependencies: #161470, #161471, #161532	2025-08-28 06:47:35 +00:00
Ke Wen	ff9533970a	[3/N][SymmMem] Expose offset field from handle (#161532 ) As titled, so that kernels relying on direct pointers can use base address and `hdl.offset` to access remote memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161532 Approved by: https://github.com/ngimel ghstack dependencies: #161470, #161471	2025-08-28 06:39:12 +00:00
Ke Wen	b291dc9684	[2/N][SymmMem] Add MemPool allocator and tests (#161471 ) (Porting most of #161008) Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory. To end users, this PR supports a python UI as follows: ``` allocator = symm_mem.get_mempool_allocator(device) mempool = torch.cuda.MemPool(allocator) with torch.cuda.use_mem_pool(mempool): tensor = torch.arange(numel, dtype=dtype, device=device) ``` Added tests for both use cases above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471 Approved by: https://github.com/ngimel ghstack dependencies: #161470	2025-08-28 06:31:29 +00:00
Oguz Ulgen	0fd63fd88b	Guard config copy for pickle errors (#161659 ) Differential Revision: [D81168335](https://our.internmc.facebook.com/intern/diff/D81168335) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161659 Approved by: https://github.com/zou3519	2025-08-28 06:27:48 +00:00
Ke Wen	eec876deb6	[SymmMem] Isolate set_device tests to avoid hang (#161668 ) `test_symmetric_memory.py` hangs like this: ``` SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_False PASSED [5.6364s] SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_True ... ``` This set of tests parameterizes whether user sets the device before calling `symm_mem.emtpy`. However, such parametrization does not work well with `MultiProcContinuousTest` because the set device will "contaminate" the next test function. Solution is to move the "set device" tests to a separate test suite using the traditional `MultiProcessTestCase`, which would respawn processes every time. Hang is gone now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161668 Approved by: https://github.com/fegin	2025-08-28 05:43:49 +00:00
Yang Wang	c83b43d7a8	[1/2]Add summary report for vllm build (#161565 ) Demo Run https://github.com/pytorch/pytorch/actions/runs/17259533323?pr=161565 <img width="1538" height="720" alt="image" src="https://github.com/user-attachments/assets/64f6d7b4-cac6-4c12-863c-b15514bb8810" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161565 Approved by: https://github.com/huydhn	2025-08-28 05:25:55 +00:00
Mikayla Gawarecki	d3d9eb4777	Error when TORCH_STABLE_ONLY is defined in TensorBase.h (#161658 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161658 Approved by: https://github.com/albanD	2025-08-28 04:36:31 +00:00
PyTorch UpdateBot	a65db6dc4c	[vllm hash update] update the pinned vllm hash (#161363 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161363 Approved by: https://github.com/pytorchbot	2025-08-28 04:14:19 +00:00
soulitzer	149c68071c	Support Triton kernels in SAC region (#161541 ) SAC interaction with triton kernel: - In eager, triton ops are not dispatchable, and so it is always ignored by SAC, i.e., always recomputed. - In compile, although we wrap triton kernels into HOPs, allowing us to intercept them, we still recompute by default rather than save by default, so that compile maintains the invariant of using less memory than eager. - If you want to do something else (e.g. save the output of your triton kernel) you should wrap it in a custom op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161541 Approved by: https://github.com/drisspg, https://github.com/zou3519 ghstack dependencies: #160781	2025-08-28 03:54:46 +00:00
xinan.lin	bae01479c3	[Inductor UT] Re-enable test_torchinductor_opinfo.py on XPU. (#161477 ) The PR #160222 replaced @skipCUDAIf with @requires_cuda_and_triton in test_torchinductor_opinfo.py, which caused the CI jobs for other devices to skip this large test suite. We attempted to revert #160222 but ran into conflicts. I then opened #160936 to revert the changes from #160222, but that resulted in CPU CI job timeouts. I also filed issue #161132 for assistance, but haven’t received a response yet. To minimize the impact, this PR re-enables the test suite on XPU first. I will continue to seek help on re-enabling it for CPU afterwards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161477 Approved by: https://github.com/jansel	2025-08-28 03:29:21 +00:00
cyy	8939d151d0	Use std::apply for CPU code (#152526 ) The supported compilers are recent enough to enable std::apply in C++17. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152526 Approved by: https://github.com/ezyang	2025-08-28 02:47:54 +00:00
rzou	5edc3d814f	Add option for TorchDispatchMode to ignore torch.compile internals (#161648 ) If TorchDispatchMode.ignore_compile_internals() is True, then we turn off the TorchDispatchMode during the compilation process, instead turning it back on during runtime of the compiled artifact. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/161648 Approved by: https://github.com/bdhirsh	2025-08-28 02:41:33 +00:00
rzou	199c3633bf	Fix Inductor Periodic (#161617 ) Models are now passing accuracy. # of graph breaks is larger because these were not actually tested in CI (if the model fails accuracy we do not assert on # of graph breaks). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161617 Approved by: https://github.com/anijain2305	2025-08-28 02:36:08 +00:00
Gabriel Ferns	e9d34b2438	Add test coverage to tf32 in max autotune mm configs (#161545 ) Add a test to make sure that the configs are using the correct setting of tf32 to prevent regression. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161545 Approved by: https://github.com/coconutruben	2025-08-28 02:27:58 +00:00
Simon Fan	be1612201d	[export] Support AC HOP in pre-dispatch (#161479 ) Adds the pre-dispatch handling for the AC hop. This lets the HOP pre-dispatch export without actually pre-dispatch tracing into it,. However, this is not sufficient to support AC in export: - because the HOP body will still be in torch IR, so it will fail export verifiers - the exported module also can't be ran in eager because the AC HOP relies on partitioner to embed RNG state saving/restoring So it must be lowered by AOT Autograd into post-dispatch first before being executed, It suffices for my purposes though. If users had checkpoint API use in their exported model, the behavior goes from silently incorrect to now be validation error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161479 Approved by: https://github.com/ydwu4 ghstack dependencies: #161353	2025-08-28 01:46:25 +00:00
Simon Fan	15670f9075	[dtensor] support local_map as a decorator (#161353 ) And extract it out as a convenience function for dynamo to wrap Pull Request resolved: https://github.com/pytorch/pytorch/pull/161353 Approved by: https://github.com/zpcore	2025-08-28 01:46:25 +00:00
Huy Do	0e35805030	Add ciflow/vllm to vLLM commit hash update PR(s) (#161678 ) As it should be, otherwise, PR(s) like https://github.com/pytorch/pytorch/pull/161121 were merged without the signals it needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161678 Approved by: https://github.com/atalman	2025-08-28 01:35:04 +00:00
Shangdi Yu	92c2daebb6	Add inductor provenance tracking artifacts to cache (#161440 ) Summary: - Add inductor provenance tracking artifacts to cache - Update the tlparse version pin to `0.4.0`. The old tlparse version errors out on the new tlparse output. The lowest tlparse version that works is `0.3.42`. tlparse error: ``` thread 'main' panicked at src/parsers.rs:671:71: called `Result::unwrap()` on an `Err` value: Error("EOF while parsing a value", line: 1, column: 0) stack backtrace: 0: 0x55e4ff1c7f00 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h6d42cc84fc840290 1: 0x55e4ff1ee503 - core::fmt::write::h5af61a909e3ec64d 2: 0x55e4ff1c4c33 - std::io::Write::write_fmt::h5a7b54aa6e4a315d 3: 0x55e4ff1c7d52 - std::sys::backtrace::BacktraceLock::print::h555579e7396c26ac 4: 0x55e4ff1c8caf - std::panicking::default_hook::{{closure}}::h9128866118196224 5: 0x55e4ff1c8b1a - std::panicking::default_hook::h52e9e7314e0255f6 6: 0x55e4ff1c9652 - std::panicking::rust_panic_with_hook::h541791bcc774ef34 7: 0x55e4ff1c93fa - std::panicking::begin_panic_handler::{{closure}}::h6479a2f0137c7d19 8: 0x55e4ff1c8419 - std::sys::backtrace::__rust_end_short_backtrace::ha04e7c0fc61ded91 9: 0x55e4ff1c908d - rust_begin_unwind 10: 0x55e4fef7a030 - core::panicking::panic_fmt::h5764ee7030b7a73d 11: 0x55e4fef7a406 - core::result::unwrap_failed::h3ff7104a9ace307a 12: 0x55e4fefb3c56 - <tlparse::parsers::ArtifactParser as tlparse::parsers::StructuredLogParser>::parse::h20bc51a17ffc494a 13: 0x55e4fef9669a - tlparse::run_parser::h20c7729f151eec62 14: 0x55e4fef99a1b - tlparse::parse_path::he4892147f47fbade 15: 0x55e4fef7c760 - tlparse::main::hdc05613b32f4f53b 16: 0x55e4fef89263 - std::sys::backtrace::__rust_begin_short_backtrace::h15f188f3edf42596 17: 0x55e4fef8827d - std::rt::lang_start::{{closure}}::he2c21e32a442538e 18: 0x55e4ff1be0f0 - std::rt::lang_start_internal::h15895544e2012228 19: 0x55e4fef83975 - main 20: 0x7f0b3662a610 - __libc_start_call_main 21: 0x7f0b3662a6c0 - __libc_start_main_alias_2 22: 0x55e4fef7a610 - <unknown> 23: 0x0 - <unknown> ``` Test Plan: ``` buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r test_kernel_information_generation python test/dynamo/test_structured_trace.py -k test_chromium_event ``` Differential Revision: D80976585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161440 Approved by: https://github.com/oulgen	2025-08-28 01:16:02 +00:00
Paul de Supinski	768a1017c5	Allow parallel start NUMA binding (#161576 ) # Context In #161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`. However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff). The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start: > Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted. But on further reading, the Linux docs say [`sched_setaffinity` is per thread.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer. I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07) The upshot is that we actually can safely use the inheritance trick from #161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread. # This PR Remove restrictions against parallel start for NUMA binding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161576 Approved by: https://github.com/d4l3k	2025-08-28 01:15:58 +00:00
Lakshay Garg	0c4a79b7e0	Replace some calls to new with make_{unique,shared} (#160581 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160581 Approved by: https://github.com/malfet	2025-08-28 00:30:45 +00:00
Son Nguyen	9b02435e9f	Improve Scheduler init duration (#161491 ) Early exit merge_loops() if config.loop_ordering_after_fusion is false. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161491 Approved by: https://github.com/jansel	2025-08-28 00:27:51 +00:00
Will Constable	fd60117051	[C10D] add _summarize_ranks util (#160284 ) Prints ranges of ranks succinctly. e.g. For a strided list of ranks, summarizes down to start:stop:step ``` 0:4096:512 ``` Omits step if it's 1 ``` 0:8 ``` Note: endpoints are exclusive. This may not be intuitive to everyone, but in the first above the last rank is 3584, and in the second it is 7. Currently, does not support combinations of striding _and_ range. (e.g. can not generate a representation like "0:2, 4:6, ..., 12:14". Is this needed / useful? If so it could be added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160284 Approved by: https://github.com/XilunWu	2025-08-28 00:17:53 +00:00
Pian Pawakapan	97a548b640	[PGO] skip allowlist logging for empty graphs (#161530 ) Summary: reduces spurious logging Test Plan: test_pgo Rollback Plan: Differential Revision: D81060182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161530 Approved by: https://github.com/bobrenjc93, https://github.com/mlazos	2025-08-28 00:12:13 +00:00
PyTorch MergeBot	c55bdb26e1	Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 )" This reverts commit 378edb047f83dfb84c2d9c032bddebc5e0147b8f. Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/atalman due to new test is failing internally ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3230152168))	2025-08-27 23:45:12 +00:00
PyTorch MergeBot	903181bb6f	Revert "[2/N][SymmMem] Add MemPool allocator and tests (#161471 )" This reverts commit 4ed71d5412d58746d23f16689cab61da0e8149ef. Reverted https://github.com/pytorch/pytorch/pull/161471 on behalf of https://github.com/atalman due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/161471#issuecomment-3230069186))	2025-08-27 23:18:36 +00:00
David Berard	ba201082b6	[TorchScript] ProfilingExecutor - RemoveProfileNodesAndSpecializeTypes None handling (#161538 ) ProfilingGraphExecutor works like this: 1. do some unrelated JIT optimizations 2. Add profiling nodes to collect JIT information like tensor dtypes and shapes 3. Do some more unrelated JIT optimizations 4. Remove the profiling nodes and extract the tensor info, and then use the JIT tensor info to do optimizations. This PR is intended to fix a bug in Step 4, where the profiling nodes were removed. It was previously assumed that all the things that were profiled were either Tensors or Optional[Tensor]s - otherwise, step 2 would not have introduced a profiling node. However, we saw a case where step 3 would remove replace Optional[Tensor] inputs with `None` inputs (e.g. if a conditional that returned a Tensor or a None could be statically known to only follow the `None` branch). To fix this, we essentially just modify the RemoveProfileNodesAndSpecializeTypes assert so that it accepts Tensors, Optional[Tensor]s, or None (the new part). Note that this issue is probably somewhat uncommon (maybe why we didn't see it for the first 4 years that this code existed). I expect that, typically, any time that step 3 would convert `Optional[Tensor] -> None`, step 1 would have already done that. So it's difficult to reproduce in an end-to-end TorchScript workload. Differential Revision: [D81068172](https://our.internmc.facebook.com/intern/diff/D81068172) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161538 Approved by: https://github.com/nmacchioni	2025-08-27 23:12:15 +00:00
PyTorch MergeBot	8fc2467fe5	Revert "[3/N][SymmMem] Expose offset field from handle (#161532 )" This reverts commit 68d395d61e9d4601ab1e2bca56eb28253572c662. Reverted https://github.com/pytorch/pytorch/pull/161532 on behalf of https://github.com/atalman due to need to revert https://github.com/pytorch/pytorch/pull/161471 internal failure ([comment](https://github.com/pytorch/pytorch/pull/161532#issuecomment-3230016806))	2025-08-27 23:06:55 +00:00
drisspg	30edac5da6	Updates to CuTe DSL template renderer (#161117 ) # Summary This adds a few more render functions available to template writers, specifically get_output and modification. The reasons why are more clear in the next PR in this stack. <img width="1645" height="364" alt="Screenshot 2025-08-21 at 1 48 50 PM" src="https://github.com/user-attachments/assets/2d508fda-4273-43ef-9edf-086e592e9249" /> Majority of the new cod is around the OpOverrides for CuTe DSL. It is alot to test and most of the actual testing I have been doing is via score_mods to the flash_attention at the next layer of this stack. A bunch of score mods that me and Claude came up with , that exercise the actual ops. ``` Py def causal_mask(score, b, h, q_idx, kv_idx): """Causal attention mask.""" return torch.where(q_idx >= kv_idx, score, float("-inf")) def relative_bias(score, b, h, token_q, token_kv): """Relative position bias.""" return score + torch.abs(token_q - token_kv) def relative_bias_v2(score, b, h, token_q, token_kv): """Relative position bias with factor of 2.""" return score + 2 * torch.abs(token_q - token_kv) def times_two(score, b, h, q_idx, kv_idx): """Simple score modification that doubles the score.""" return score * 2 def alibi_bias(score, b, h, q_idx, kv_idx): """ALiBi (Attention with Linear Biases) - used in some modern models.""" # Different slopes for different heads slope = 2 ** (-8 * (h + 1) / 8) # Simplified version return score - slope * torch.abs(q_idx - kv_idx) def sliding_window(score, b, h, q_idx, kv_idx, window_size=256): """Sliding window attention - only attend to nearby tokens.""" return torch.where( torch.abs(q_idx - kv_idx) <= window_size, score, float("-inf") ) def block_diagonal(score, b, h, q_idx, kv_idx, block_size=64): """Block diagonal attention pattern.""" q_block = q_idx // block_size kv_block = kv_idx // block_size return torch.where(q_block == kv_block, score, float("-inf")) def additive_bias(score, b, h, q_idx, kv_idx): """Test simple addition with position-based bias.""" return score + (q_idx + kv_idx) * 0.01 def multiplicative_decay(score, b, h, q_idx, kv_idx): """Test multiplication with distance-based decay.""" distance = torch.abs(q_idx - kv_idx) return score * torch.exp(-0.1 * distance) def sine_wave_bias(score, b, h, q_idx, kv_idx): """Test trigonometric functions.""" return score + 0.1 * torch.sin(2 * math.pi * (q_idx - kv_idx) / 64) def log_distance_penalty(score, b, h, q_idx, kv_idx): """Test logarithmic operations.""" distance = torch.abs(q_idx - kv_idx).float() return score - torch.log(1 + distance) def alternating_mask(score, b, h, q_idx, kv_idx): """Test with alternating pattern - good for branch prediction.""" return torch.where((q_idx + kv_idx) % 2 == 0, score, float("-inf")) def head_specific_pattern(score, b, h, q_idx, kv_idx): """Different behavior per attention head.""" even_head = h % 2 == 0 causal = q_idx >= kv_idx return torch.where(even_head & causal, score, float("-inf")) def sparse_strided(score, b, h, q_idx, kv_idx, stride=4): """Sparse attention with strided pattern.""" return torch.where( (kv_idx % stride == 0) \| (q_idx == kv_idx), score, float("-inf") ) def causal_with_global(score, b, h, q_idx, kv_idx): """Causal mask but first few tokens are globally attended.""" is_causal = q_idx >= kv_idx is_global = kv_idx < 4 return torch.where(is_causal \| is_global, score, float("-inf")) def dilated_attention(score, b, h, q_idx, kv_idx, dilation_rate=2): """Dilated attention pattern - exponentially increasing gaps.""" distance = torch.abs(q_idx - kv_idx) is_attended = (distance == 0) \| ((distance > 0) & ((distance & (distance - 1)) == 0)) return torch.where(is_attended, score, float("-inf")) ``` Example outputs: ``` [Test Suite] Config: batch=4, heads=32, seq_q=8192, seq_kv=8192, dim=128 [Test 1: none] [No score_mod, flash='enabled'] Found flash_attncute: True [No score_mod, flash='disabled'] Found flash_attncute: False ✓ Outputs match between flash enabled/disabled ✓ Output matches eager SDPA (rtol=0.001, atol=0.001) [Test 2: causal] [With score_mod, flash='enabled'] Found flash_attncute: True [With score_mod, flash='disabled'] Found flash_attncute: False ✗ Outputs differ between flash modes: Tensor-likes are not close! Mismatched elements: 17879 / 134217728 (0.0%) Greatest absolute difference: 0.0078125 at index (0, 15, 15, 60) (up to 0.001 allowed) Greatest relative difference: 2.5 at index (3, 22, 153, 126) (up to 0.001 allowed) [Test 3: rel_bias] [With score_mod, flash='enabled'] Found flash_attncute: True [With score_mod, flash='disabled'] Found flash_attncute: False ✗ Outputs differ between flash modes: Tensor-likes are not close! Mismatched elements: 12836 / 134217728 (0.0%) Greatest absolute difference: 0.015625 at index (0, 3, 2775, 84) (up to 0.001 allowed) Greatest relative difference: 11.8125 at index (3, 28, 4095, 76) (up to 0.001 allowed) [Test 4: rel_bias_v2] ``` This is bfloat16 and there are no major differences. The list of pointwise ops here isn't exhaustive but it is fairly covering Pull Request resolved: https://github.com/pytorch/pytorch/pull/161117 Approved by: https://github.com/mlazos	2025-08-27 23:01:31 +00:00
Avik Chaudhuri	12c0cf3fab	switch prefer_deferred_runtime_asserts_over_guards in export (#160111 ) Summary: In preparation for checking shape guards in export, this PR effectively switches `prefer_deferred_runtime_asserts_over_guards` to `False`, matching Dynamo. Actually that's a lie: we switch it to `allow_complex_guards_as_runtime_asserts`, which is `False` by default but can be controlled via an internally API to be `True`. This makes the two flags synchronized, so we should be able to kill `allow_complex_guards_as_runtime_asserts` at this point. Test Plan: updated tests Rollback Plan: Differential Revision: D79734206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160111 Approved by: https://github.com/tugsbayasgalan	2025-08-27 22:51:10 +00:00
Zain Rizvi	6b051d7de3	[BE] Refactor trymerge for readability (#161637 ) Two changes: - Extract getting the last_commit's sha into it's own function - Rename merge_changes to merge_changes_locally to better explain it's functionality Pull Request resolved: https://github.com/pytorch/pytorch/pull/161637 Approved by: https://github.com/seemethere, https://github.com/malfet ghstack dependencies: #161558	2025-08-27 22:44:00 +00:00
rebeccajae	ee0ec21191	Ensure that tensors are contiguous before using no-graph MPS impl (#161641 ) Fixes #161640 Check if tensors are contiguous before using the no-graph implementation. Using the script in the issue above with this change I get expected results. ``` MPS contiguous result sample: tensor([ 1.3600, -2.9516, 1.3207, -3.5132, 1.7061], device='mps:0') MPS non-contig result sample: tensor([ 1.3600, -2.9516, 1.3207, -3.5132, 1.7061], device='mps:0') CPU non-contig result sample: tensor([ 1.3600, -2.9516, 1.3207, -3.5132, 1.7061]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161641 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-27 22:31:57 +00:00
Xinran / Allan Rui	7da02bf8af	Skip const folding with symbolic expression (#161437 ) Summary: When performing constant folding, we must skip over operators that have symbolic `fill_value`. Test Plan: CI Rollback Plan: Reviewed By: kalpit-meta-1 Differential Revision: D80965936 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161437 Approved by: https://github.com/StellarrZ	2025-08-27 22:09:58 +00:00
William Wen	1041805c1e	[dynamo, nested graph breaks] prevent excessive recompilations (#159786 ) Nested continuation function code objects are now unique w.r.t. stack trace below (and including) the current code object. Without this change, e.g. in the added test, `f3` would be recompiled on the second graph break. Followup: we can skip guards on continuation functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159786 Approved by: https://github.com/anijain2305 ghstack dependencies: #159329, #159678, #159817, #160138	2025-08-27 21:53:37 +00:00
William Wen	6562646dab	[dynamo, nested graph breaks] clean up comments and codegen (#160138 ) Fix comments to reflect that we no longer codegen cells to be sent to resume function as inputs - they are instead codegen'd after the unsupported instruction in order to build resume functions that are closures. Also simplify some codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160138 Approved by: https://github.com/anijain2305 ghstack dependencies: #159329, #159678, #159817	2025-08-27 21:53:37 +00:00
William Wen	d0a242e547	[dynamo, nested graph breaks] support nested closures (#159817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159817 Approved by: https://github.com/anijain2305 ghstack dependencies: #159329, #159678	2025-08-27 21:53:37 +00:00
William Wen	3f8090809f	[dynamo, nested graph breaks] support nested graph breaks x context managers (#159678 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159678 Approved by: https://github.com/anijain2305 ghstack dependencies: #159329	2025-08-27 21:53:37 +00:00
William Wen	10d93325b1	[dynamo, nested graph breaks] support very simple nested graph breaks (#159329 ) e.g. this graph breaks once now: ```python import torch torch._dynamo.config.nested_graph_breaks = True def inner(x): x = x + 1 torch._dynamo.graph_break() return x + 2 @torch.compile(backend="eager") def outer(x): return inner(x) print(outer(torch.ones(3))) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159329 Approved by: https://github.com/anijain2305	2025-08-27 21:53:37 +00:00
Animesh Jain	68fa882dad	[dynamo] Correctly track mutation class source for MutableMappingVariable (#161568 ) Fixes https://github.com/pytorch/pytorch/issues/161505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161568 Approved by: https://github.com/Lucaskabela, https://github.com/malfet	2025-08-27 21:47:17 +00:00
Yu, Guangye	b9c6aa1e17	Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" (#161628 ) This reverts commit ae1a706444d6c0a6019ffc936c8b36574335a5d5. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161628 Approved by: https://github.com/atalman ghstack dependencies: #161625, #161626, #161627	2025-08-27 21:37:14 +00:00
Yu, Guangye	b7b9fb9962	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" (#161627 ) This reverts commit c1145852a5eac96f5551b5d1805109ce4dc5e1fa. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161627 Approved by: https://github.com/atalman ghstack dependencies: #161625, #161626	2025-08-27 21:37:14 +00:00
Yu, Guangye	c03d8d4082	Revert "Generalize torch._C._set_allocator_settings to be generic (#156175 )" (#161626 ) This reverts commit 908c5cc4c0f22d141776bde47c296b5186691855. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161626 Approved by: https://github.com/atalman ghstack dependencies: #161625	2025-08-27 21:37:14 +00:00
clr	40f46b09c7	async_compile: Fix the wait method to actually wait (#161561 ) This method never triggered. It's used in 2 tests and they pass, so no serious concern. Note that I did introduce and fix a latent bug, which is if we called shutdown_compile_workers, jobs would crash with this change due to ready_future being finished if we called wait. However we only call wait in tests so that bug is fine. The other behaviour, is that if you called shutdown, I believe we may potentially block on your first triton compile after that, until the pool was ready. This should correctly switch to direct mode, until the pool is ready on later warmups. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161561 Approved by: https://github.com/masnesral ghstack dependencies: #161452	2025-08-27 21:35:31 +00:00
clr	0d6597138c	inductor: Log the specific triton kernel that fails (#161452 ) Added a optional name argument to SubprocPool.submit. We record this in a dictionary, and when raising exceptions, add the name. We manage the lifecycle the same as the pending futures. Added a specific testcase to make sure this logs correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161452 Approved by: https://github.com/masnesral	2025-08-27 21:35:31 +00:00
Yu, Guangye	06ddaf1e0a	Revert "Back out "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" (#160999 )" (#161625 ) This reverts commit a818fa77e3a72271f144514ef349c5a666313205. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161625 Approved by: https://github.com/atalman	2025-08-27 21:34:12 +00:00
Blaine Burton Rister	26d0ff1cba	[AOTI-FX] Enhance launch grid FloorDiv replacement using sympy.together. (#161582 ) # Feature 2d launch grids with dynamic shapes can contain sympy expressions like `floor(x / 128 + y / 128)`. This breaks the dynamic shapes tracer which only supports `FloorDiv`, and not `floor`. To handle this case, call `sympy.together` prior to pattern matching to convert this to `floor((x + y) / 128)`. Then, we can recognize the pattern and map it to `FloorDiv(x + y, 128)`. # Test plan Added a custom Triton test exposing this. The test calls a 2d autotuned kernel with dynamic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161582 Approved by: https://github.com/nandesuka	2025-08-27 21:31:28 +00:00
zhxchen17	c36d18d7e8	[rfc] aot precompile with custom backend api (#161383 ) Adding a new feature to torch.compile(fullgraph=True) which "aot_compile" a function with given example inputs. On user side it should look like: ``` def foo(x, y): return x + y compiled_fn = torch.compile(fullgraph=True).aot_compile(((torch.randn(3, 4), torch.randn(3, 4)), {})) ``` This is different from the traditional `torch.compile` workflow where compiled object will be a drop-in replacement for the original eager model: ``` tensor input -> torch.compile() -> tensor output (and populates the cache entry) ``` `aot_compile` will instead return a compiled function as result, and it's purely functional and doesn't populate the compile cache entry in dynamo: ``` tensor input -> aot_compile() -> compiled function ``` The aot compiled function will be savable and loadable on disk as well: ``` torch.compile(fullgraph=True).aot_compile(...).save_compiled_function('my/path') compiled_fn = torch.compiler.load_compiled_function("my/path") ``` Right now we treat compiler backend as a blackbox and it needs to implement the following interface to make compile artifacts serialzable: ``` class SerializableCallable: def save_compile_artifacts(): .... def load_compile_artifacts(): .... ``` We haven't implemented this for inductor yet, but this shouldn't be an issue since we gate this feature through `torch._dynamo.config.aot_compile` (which defaults to False), and this will be left as follow up PR to the current PR. Differential Revision: [D80914270](https://our.internmc.facebook.com/intern/diff/D80914270/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161383 Approved by: https://github.com/tugsbayasgalan	2025-08-27 21:26:25 +00:00
PyTorch MergeBot	014b98dd09	Revert "Add inductor backend to device interface; make minifier_tests more device agnostic (#151314 )" This reverts commit 77bc959fe122bfd131e339ca36cab445a1860806. Reverted https://github.com/pytorch/pytorch/pull/151314 on behalf of https://github.com/atalman due to sorry change is faling internally ([comment](https://github.com/pytorch/pytorch/pull/151314#issuecomment-3229774015))	2025-08-27 21:21:19 +00:00
PyTorch MergeBot	38ed57d446	Revert "Updates to CuTe DSL template renderer (#161117 )" This reverts commit 1750cc80374a9dd22fc26701c0602ae11a62baf0. Reverted https://github.com/pytorch/pytorch/pull/161117 on behalf of https://github.com/atalman due to will need to revert to unblock revert of https://github.com/pytorch/pytorch/pull/151314 ([comment](https://github.com/pytorch/pytorch/pull/161117#issuecomment-3229754295))	2025-08-27 21:17:25 +00:00
Benjamin Glass	007935a802	[cpp_wrapper] Swap to new PyBind11 simple GIL header (#161063 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161063 Approved by: https://github.com/Skylion007 ghstack dependencies: #160754	2025-08-27 21:15:01 +00:00
Benjamin Glass	cbc53b7696	Update pybind11 submodule to 3.0.1 (#160754 ) Upgrade to PyBind11 v3. This allows us to strip out our own (possibly broken?) handling of the C++ ABI when building extensions, in favor of the more-complete PyBind11 internal handling. Fixes a few test failures due to https://github.com/pybind/pybind11/issues/5774, which effectively makes the `__qualname__` attribute of functions platform-dependent. Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/160754 Approved by: https://github.com/Skylion007	2025-08-27 21:15:01 +00:00
Zain Rizvi	624bc36163	Ensure the comment id is always passed in to trymerge (#161558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161558 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-08-27 19:53:28 +00:00
Wang, Chuanqi	06c7516994	[BE] Upgrade XPU support package to 2025.2 (#158733 ) Including below changes, - Add XPU support package 2025.2 build and test in CI for both Linux and Windows - Keep XPU support package 2025.1 build in CI to ensure no break issue until PyTorch 2.9 release - Upgrade XPU support package from 2025.1 to 2025.2 in CD for both Linux and Windows - Rename Linux CI job name & image name to n & n-1 - Update XPU runtime pypi packages dependencies of CD wheels - Remove deprecated support package version docker image build Pull Request resolved: https://github.com/pytorch/pytorch/pull/158733 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-08-27 19:33:38 +00:00
William Wen	2efcf9d081	[dynamo] Fix graph break registry loading in fbcode (#161550 ) Summary: Add `torch/_dynamo/graph_break_registry.json` as an internal dependency. Minor related fixes. Test Plan: Test on OSS. Rollback Plan: Differential Revision: D81078973 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161550 Approved by: https://github.com/Lucaskabela, https://github.com/anijain2305	2025-08-27 19:25:15 +00:00
drisspg	443452ca2f	Remove test since it ooms on CI (#161644 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161644 Approved by: https://github.com/BoyuanFeng	2025-08-27 19:11:29 +00:00
Roman Bobniev	47ecd2042f	[ONNX] Fix index_put_ usage (#161263 ) Summary: It's hard to understand how it's working in most of our models, but in general it looks like `aten::copy_` is replaced incorrectly. There are two schemas for `aten::copy_`: 1. `aten::copy_.Tensor(Tensor(a!) self, Tensor other) -> Tensor(a!)` 2. `aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)` According to the logic in the comments we don't need one of the parameters for `aten::index_put_`. It seems logic has been inferred from ordinary `aten::copy` where there could be a third parameter which is `non_blocking` flag. Depending on the execution environment the sliced copying can be replaced either by first schema or by second schema with explicitly setting default parameter to `False`. If first schema is selected it will lead to the crash (which is easily to catch in our prod env). In case of the second schema selection, there is no crash, but the third parameter is treated as `accumulate` parameter of the `index_put_` function which doesn't make sense. So, in any case usage of the third parameter must be removed from the `aten::copy_` replacement. For more details and check this post: https://fb.workplace.com/groups/1405155842844877/permalink/25337687649165028/ Test Plan: The test fails in production envirounment only. In the test env `non_blocking` flag is mapped as `False` to the `acumulate` flag, which doesn't cause test to fail, but has no sense in terms of flags mapping. The export works without errors, before the fix it was failing with accessing by index out of bounds vector, like this: ``` 1095 _C._jit_onnx_log("Torch IR graph at exception: ", graph) File ~/.bento/kernels/bento_kernel_gaia_ml/1578/bento_kernel_gaia_ml_binary-inplace#link-tree/torch/onnx/utils.py:636, in _optimize_graph(graph, operator_export_type, _disable_torch_constant_prop, fixed_batch_size, params_dict, dynamic_axes, input_names, module) 629 _C._jit_pass_lower_all_tuples(graph) 630 # in _jit_pass_onnx, symbolic functions are called for each node for conversion. 631 # However, there are nodes that cannot be converted without additional context. 632 # For example, the number of outputs from split (and whether it is static or dynamic) is unknown 633 # until the point where it is unpacked by listUnpack node. 634 # This pass does a preprocess, and prepares the nodes such that enough context can be received 635 # by the symbolic function. --> 636 _C._jit_pass_onnx_remove_inplace_ops_for_onnx(graph, module) 637 _C._jit_pass_onnx_preprocess(graph) 639 # onnx does not support tuples, so try to remove them RuntimeError: vector::_M_range_check: __n (which is 2) >= this->size() (which is 2) ``` The test script: ``` import torch as th import tempfile class CopyTest(th.nn.Module): def forward( self, input_th: th.Tensor ): to_fill = th.ones((3, 3)) to_fill[:, 0] = input_th[:, 0] return to_fill m = CopyTest() test_tensor = th.zeros((3, 3)) with tempfile.NamedTemporaryFile() as f: th.onnx.export( m, (test_tensor,), f, export_params=True, opset_version=17, do_constant_folding=True, input_names=["input"], output_names=["features"], dynamo=False, ) ``` The exported model test: ``` import torch import onnx import onnxruntime model_name = '/home/ironsided/test_model.onnx' onnx_model = onnx.load(model_name) onnx.checker.check_model(onnx_model) example_inputs = (torch.zeros(3, 3),) onnx_inputs = [tensor.numpy(force=True) for tensor in example_inputs] print(f"Input length: {len(onnx_inputs)}") print(f"Sample input: {onnx_inputs}") ort_session = onnxruntime.InferenceSession( model_name, providers=["CPUExecutionProvider"] ) onnxruntime_input = {input_arg.name: input_value for input_arg, input_value in zip(ort_session.get_inputs(), onnx_inputs)} # ONNX Runtime returns a list of outputs onnxruntime_outputs = ort_session.run(None, onnxruntime_input)[0] print(onnxruntime_outputs) ``` The produced result is correct: ``` Input length: 1 Sample input: [array([[0., 0., 0.], [0., 0., 0.], [0., 0., 0.]], dtype=float32)] [[0. 1. 1.] [0. 1. 1.] [0. 1. 1.]] ``` Rollback Plan: Differential Revision: D80797028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161263 Approved by: https://github.com/justinchuby, https://github.com/jermenkoo	2025-08-27 18:53:13 +00:00
drisspg	1750cc8037	Updates to CuTe DSL template renderer (#161117 ) # Summary This adds a few more render functions available to template writers, specifically get_output and modification. The reasons why are more clear in the next PR in this stack. <img width="1645" height="364" alt="Screenshot 2025-08-21 at 1 48 50 PM" src="https://github.com/user-attachments/assets/2d508fda-4273-43ef-9edf-086e592e9249" /> Majority of the new cod is around the OpOverrides for CuTe DSL. It is alot to test and most of the actual testing I have been doing is via score_mods to the flash_attention at the next layer of this stack. A bunch of score mods that me and Claude came up with , that exercise the actual ops. ``` Py def causal_mask(score, b, h, q_idx, kv_idx): """Causal attention mask.""" return torch.where(q_idx >= kv_idx, score, float("-inf")) def relative_bias(score, b, h, token_q, token_kv): """Relative position bias.""" return score + torch.abs(token_q - token_kv) def relative_bias_v2(score, b, h, token_q, token_kv): """Relative position bias with factor of 2.""" return score + 2 * torch.abs(token_q - token_kv) def times_two(score, b, h, q_idx, kv_idx): """Simple score modification that doubles the score.""" return score * 2 def alibi_bias(score, b, h, q_idx, kv_idx): """ALiBi (Attention with Linear Biases) - used in some modern models.""" # Different slopes for different heads slope = 2 ** (-8 * (h + 1) / 8) # Simplified version return score - slope * torch.abs(q_idx - kv_idx) def sliding_window(score, b, h, q_idx, kv_idx, window_size=256): """Sliding window attention - only attend to nearby tokens.""" return torch.where( torch.abs(q_idx - kv_idx) <= window_size, score, float("-inf") ) def block_diagonal(score, b, h, q_idx, kv_idx, block_size=64): """Block diagonal attention pattern.""" q_block = q_idx // block_size kv_block = kv_idx // block_size return torch.where(q_block == kv_block, score, float("-inf")) def additive_bias(score, b, h, q_idx, kv_idx): """Test simple addition with position-based bias.""" return score + (q_idx + kv_idx) * 0.01 def multiplicative_decay(score, b, h, q_idx, kv_idx): """Test multiplication with distance-based decay.""" distance = torch.abs(q_idx - kv_idx) return score * torch.exp(-0.1 * distance) def sine_wave_bias(score, b, h, q_idx, kv_idx): """Test trigonometric functions.""" return score + 0.1 * torch.sin(2 * math.pi * (q_idx - kv_idx) / 64) def log_distance_penalty(score, b, h, q_idx, kv_idx): """Test logarithmic operations.""" distance = torch.abs(q_idx - kv_idx).float() return score - torch.log(1 + distance) def alternating_mask(score, b, h, q_idx, kv_idx): """Test with alternating pattern - good for branch prediction.""" return torch.where((q_idx + kv_idx) % 2 == 0, score, float("-inf")) def head_specific_pattern(score, b, h, q_idx, kv_idx): """Different behavior per attention head.""" even_head = h % 2 == 0 causal = q_idx >= kv_idx return torch.where(even_head & causal, score, float("-inf")) def sparse_strided(score, b, h, q_idx, kv_idx, stride=4): """Sparse attention with strided pattern.""" return torch.where( (kv_idx % stride == 0) \| (q_idx == kv_idx), score, float("-inf") ) def causal_with_global(score, b, h, q_idx, kv_idx): """Causal mask but first few tokens are globally attended.""" is_causal = q_idx >= kv_idx is_global = kv_idx < 4 return torch.where(is_causal \| is_global, score, float("-inf")) def dilated_attention(score, b, h, q_idx, kv_idx, dilation_rate=2): """Dilated attention pattern - exponentially increasing gaps.""" distance = torch.abs(q_idx - kv_idx) is_attended = (distance == 0) \| ((distance > 0) & ((distance & (distance - 1)) == 0)) return torch.where(is_attended, score, float("-inf")) ``` Example outputs: ``` [Test Suite] Config: batch=4, heads=32, seq_q=8192, seq_kv=8192, dim=128 [Test 1: none] [No score_mod, flash='enabled'] Found flash_attncute: True [No score_mod, flash='disabled'] Found flash_attncute: False ✓ Outputs match between flash enabled/disabled ✓ Output matches eager SDPA (rtol=0.001, atol=0.001) [Test 2: causal] [With score_mod, flash='enabled'] Found flash_attncute: True [With score_mod, flash='disabled'] Found flash_attncute: False ✗ Outputs differ between flash modes: Tensor-likes are not close! Mismatched elements: 17879 / 134217728 (0.0%) Greatest absolute difference: 0.0078125 at index (0, 15, 15, 60) (up to 0.001 allowed) Greatest relative difference: 2.5 at index (3, 22, 153, 126) (up to 0.001 allowed) [Test 3: rel_bias] [With score_mod, flash='enabled'] Found flash_attncute: True [With score_mod, flash='disabled'] Found flash_attncute: False ✗ Outputs differ between flash modes: Tensor-likes are not close! Mismatched elements: 12836 / 134217728 (0.0%) Greatest absolute difference: 0.015625 at index (0, 3, 2775, 84) (up to 0.001 allowed) Greatest relative difference: 11.8125 at index (3, 28, 4095, 76) (up to 0.001 allowed) [Test 4: rel_bias_v2] ``` This is bfloat16 and there are no major differences. The list of pointwise ops here isn't exhaustive but it is fairly covering Pull Request resolved: https://github.com/pytorch/pytorch/pull/161117 Approved by: https://github.com/mlazos	2025-08-27 18:39:09 +00:00
Sandeep Narendranath Karjala	ec585ceab4	[inductor] structured-log graph execution order + test (#160448 ) Summary: - Emit a structured trace per compiled graph execution to reconstruct execution order in TLParse. - Adds debug.log_graph_execution(name) called from `CompiledFxGraph.__call__`, producing an artifact named inductor_graph_execution with payload {"graph": "graph_<id>"}. Testing: - Add inline test to verify structure and output Pull Request resolved: https://github.com/pytorch/pytorch/pull/160448 Approved by: https://github.com/xmfan	2025-08-27 18:12:46 +00:00
Yidi Wu	16ce6a4aad	[hop] move insert_deferred_runtime_asserts under subtracer (#161416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161416 Approved by: https://github.com/pianpwk ghstack dependencies: #160548	2025-08-27 17:43:02 +00:00
Yang Wang	3345a7ff8a	[VLLM][FLASHINFER UPDATE] (#161537 ) VLLM build x torch fails due to flashinfer build fail, detected that vllm team recently changed the point to flashinfer Pull Request resolved: https://github.com/pytorch/pytorch/pull/161537 Approved by: https://github.com/huydhn	2025-08-27 17:41:26 +00:00
Huy Do	55e6ea105c	Fix running the benchmark jobs twice (#161619 ) I made a mistake in https://github.com/pytorch/pytorch/pull/160935 removing this condition check. This ran the benchmark job twice for schedule jobs, i.e. https://github.com/pytorch/pytorch/actions/runs/17266546494. This was missed during testing because `pull_request` and `workflow_dispatch` were working ok. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161619 Approved by: https://github.com/anijain2305	2025-08-27 17:18:10 +00:00
lakshayg	a3fa1b8c2a	Set USE_NVSHMEM only if USE_DISTRIBUTED is set (#161451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161451 Approved by: https://github.com/eqy	2025-08-27 17:11:19 +00:00
Chris Leonard	620d52e882	Fix sort doc error (#161539 ) Fixes #129298. Updated torch.sort documentation so that the 'stable' parameter is a Keyword Argument. This is how it's implemented in PyTorch. @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/161539 Approved by: https://github.com/soulitzer	2025-08-27 17:01:53 +00:00
PyTorch MergeBot	69c7b16e6f	Revert "Back out "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" (#161002 )" This reverts commit a03cc53e6f6e2fe67316cb8c74c25f5b953f445b. Reverted https://github.com/pytorch/pytorch/pull/161002 on behalf of https://github.com/guangyey due to This PR breaks CI TestCudaMallocAsync::test_allocator_settings ([comment](https://github.com/pytorch/pytorch/pull/161002#issuecomment-3228980897))	2025-08-27 16:52:22 +00:00
Guilherme Leobas	379ebdaf5e	[OrderedDict] Implement `OrderedDict.popitem(last=...)` (#155153 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155153 Approved by: https://github.com/anijain2305 ghstack dependencies: #160156, #155072, #155152	2025-08-27 15:46:40 +00:00
Guilherme Leobas	7c8f049d54	[OrderedDict] Implement `OrderedDict.move_to_end(key, last=False)` (#155152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155152 Approved by: https://github.com/anijain2305 ghstack dependencies: #160156, #155072	2025-08-27 15:46:40 +00:00
Guilherme Leobas	e3718c4855	[dict] Implement dict.__ior__ and fix return type in dict.__or__ (#155072 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155072 Approved by: https://github.com/anijain2305 ghstack dependencies: #160156	2025-08-27 15:46:40 +00:00
Guilherme Leobas	2d44969bbd	Wrap class definitions in `set_fullgraph(False)` in `test_dict`/`test_ordered_dict` (#160156 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160156 Approved by: https://github.com/zou3519	2025-08-27 15:46:40 +00:00
Irem Yuksel	a2af6a9d6b	Run WoArm64 CI every 4 hours (#161504 ) Since WoArm64 isn’t part of CI yet, this PR schedules the workflow to increase visibility and insights. It will execute every 4 hours and still support manual runs via the `ciflow/win-arm64` tag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161504 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-08-27 15:46:34 +00:00
PyTorch MergeBot	28af843ee0	Revert "Fix index_add for int64 input + zerodim index (#161511 )" This reverts commit d51486616cb3fe54bc298669a88059be56c1fb22. Reverted https://github.com/pytorch/pytorch/pull/161511 on behalf of https://github.com/clee2000 due to broke test_indexing.py::TestIndexingCPU::test_index_add_zerodim_index_floating_alpha_cpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/17257089116/job/48971728595) [HUD commit link](`d51486616c`) on dynamo? ([comment](https://github.com/pytorch/pytorch/pull/161511#issuecomment-3228705842))	2025-08-27 15:38:11 +00:00
Karthick Panner Selvam	378edb047f	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos	2025-08-27 14:49:20 +00:00
FFFrog	d2db6c86b0	[OpenReg] Add Develop Notes for Integrating New Backend into PyTorch (#158644 ) To facilitate the integration of the new backend, we plan to publish a new development note that details all the key components,hoping to speed up the development of other accelerators. This PR is the beginning of this note, and involve the part of registration of operators and we will gradually improve it and keep in sync with OpenReg's code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158644 Approved by: https://github.com/albanD	2025-08-27 14:47:25 +00:00
Animesh Jain	a3c1cbdbc6	[dynamo][higher order ops] Refactor for out spec (#161354 ) Preparing for the next PR to add more info in the output spec. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161354 Approved by: https://github.com/zou3519	2025-08-27 14:41:18 +00:00
Ting Lu	9632f4ea9f	[CD] [aarch64] Add CUDA 13.0 sbsa nightly build (#161257 ) https://github.com/pytorch/pytorch/issues/159779 CUDA SBSA build for CUDA 13.0 1. Supported archs: sm_80 to sm_120. Including support for Thor (sm_110), SPARK (sm_121), GB300 (sm_103). "This release adds support of SM110 GPUs for arm64-sbsa on Linux." from 13.0 release notes https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html 2. Use -compress-mode=size for binary size reduction, 13.0 wheel is 2.18 GB, when compared with 12.9 3.28 GB, that is 1.1 GB of savings and ~33.5% smaller. 3. Refactored the libs_to_copy list with common libs, and version_specific_libs. TODO: add the other CUDA archs in the existing support matrix of x86 to SBSA build as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/161257 Approved by: https://github.com/nWEIdia, https://github.com/atalman	2025-08-27 14:38:07 +00:00
Animesh Jain	3d406429b0	[dynamo][vllm] Support typing.get_type_hints (#161362 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161362 Approved by: https://github.com/Skylion007, https://github.com/StrongerXi, https://github.com/jansel	2025-08-27 09:55:31 +00:00
Shangdi Yu	9a12bab0d3	Add debug handle to inductor provenance tracking (#161110 ) Summary: Use debug handle on kernel names to distinguish different calls to the same kernel. Previous kernel name: kernel_name New kernel name: kernel_name:debug_handle We add the debug handle to the tlparse artifacts: `inductor_provenance_tracking_node_mappings` and `inductor_provenance_tracking_kernel_stack_traces`. We also add debug handles in the comments of the generated code so we can map to them in the provenance tracking highlighter tool: https://github.com/pytorch/tlparse/pull/134 Example output code is below. If a kernel doesn't have a debug handle, the `[Provenance debug handles]` comment line will not be written. ``` # Topologically Sorted Source Nodes: [y, z], Original ATen: [aten.addmm, aten.gelu] # [Provenance debug handles] triton_poi_fused_addmm_gelu_2:3 stream0 = get_raw_stream(0) triton_poi_fused_addmm_gelu_2.run(buf4, primals_5, 300, stream=stream0) ``` The debug handles will also be used by downstream profilers such as zoomer. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing ``` Rollback Plan: Differential Revision: D78994959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161110 Approved by: https://github.com/angelayi	2025-08-27 04:56:11 +00:00
Manuel Candales	d51486616c	Fix index_add for int64 input + zerodim index (#161511 ) Fixes #161446 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161511 Approved by: https://github.com/malfet	2025-08-27 04:11:10 +00:00
Animesh Jain	07a4e9fea8	[benchmarks] Skip mobilenetv3_large_100 in CI for accuracy (#161570 ) To keep the CI green - https://github.com/pytorch/pytorch/issues/161419 Its unclear if this is a real failure. And debugging it is non trivial. Skipping for now to keep the CI greenst Pull Request resolved: https://github.com/pytorch/pytorch/pull/161570 Approved by: https://github.com/BoyuanFeng, https://github.com/zou3519	2025-08-27 03:44:04 +00:00
Michael Lazos	be55d7ac9e	Revert "[Dynamo] Allow inlining into AO quantization modules (#152934 )" (#161567 ) This reverts commit 20e2ca3e29ce9eb33eef17db077696222c175764. Fixes https://github.com/pytorch/pytorch/issues/157434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161567 Approved by: https://github.com/Lucaskabela	2025-08-27 03:33:04 +00:00
William Wen	8b78ba07b1	[dynamo, nested graph breaks] add nested graph break tests (#144516 ) Note: nested graph break tests (and wrapped tests) are xfailed/skipped for now - we will iteratively enable the tests as more of the nested graph break implementation is complete. Differential Revision: [D81084809](https://our.internmc.facebook.com/intern/diff/D81084809) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144516 Approved by: https://github.com/anijain2305	2025-08-27 03:00:56 +00:00
drisspg	b36a20d368	Ensure large tensor int32 -> int64 indexing is enabled (#157767 ) Fixes: #https://github.com/pytorch/pytorch/issues/157446 I think that this delta is worth the switch form block-ptrs especially since they are deprecated ## Perf Summary A is nightly B is this diff, so `negative` means this diff improves perf TOP 5 differences <img width="805" height="754" alt="Screenshot 2025-08-24 at 5 49 49 PM" src="https://github.com/user-attachments/assets/aa359cdf-ee9a-427d-be72-1b9aef6f3115" /> <details> <summary><strong>Full perf table (click to expand)</strong></summary> \| attn_type \| dtype \| shape(B,Hq,M,Hkv,N,D) \| TFlops Version A \| TFlops Version B \| \| --- \| --- \| --- \| --- \| --- \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 258.38834144791923 \| 258.6353685004612 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.2192450677751 \| 140.12393320464972 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 122.32683823617003 \| 118.51603755647925 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.48556906165314 \| 137.24259849208627 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 86.59814488695922 \| 84.59431398586257 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 288.52679758135764 \| 292.9174195871856 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 172.25541683643277 \| 172.94326459828508 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 164.40864610599826 \| 165.035129576335 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 176.54876886433945 \| 175.08057670028145 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 125.22491679812626 \| 121.06201152859151 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 339.11952481874283 \| 339.0132835601695 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 227.58583240284406 \| 228.21824999409597 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 185.98569659868966 \| 182.32850843255093 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 188.9495725191772 \| 180.31385312481657 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 106.25789530994302 \| 106.55084959448476 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 357.6430536888533 \| 363.30843452247274 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 262.3241154406613 \| 265.73250045488 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 249.30498953911416 \| 249.35928192833785 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 224.74126243851808 \| 223.71776504077988 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 168.26977014013707 \| 165.47991483333809 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 382.8178701785897 \| 384.34752965862685 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 308.1449710013853 \| 311.0653716044644 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 251.96365252505072 \| 243.92283557225903 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 226.69316232745368 \| 215.22769268913356 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 153.34142545296405 \| 151.9312673939401 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 396.0998000753126 \| 398.35036286102473 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 333.5198415274966 \| 344.6354466169716 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 310.5955933379696 \| 305.66347819546 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 260.4012412689896 \| 259.758666997307 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 234.13034252182635 \| 227.61676497283614 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 396.17615538477196 \| 401.1419104525502 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 359.98648311998414 \| 360.8285563463094 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 291.97720707257736 \| 281.41694809965253 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 250.1703628419691 \| 238.556760291579 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 199.50782826294306 \| 191.52327358439223 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 411.0632004785396 \| 413.6362648405517 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 382.9404387613185 \| 397.74886235657607 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 357.0998545146633 \| 350.5115200772392 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 281.8033924428203 \| 281.98601309215843 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 282.56595134222135 \| 277.4565795466672 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 408.89838018149516 \| 405.14531386840076 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 396.07662058160264 \| 393.4598228299578 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 317.8822887267849 \| 304.754931401036 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 265.8801304948243 \| 254.22961974295112 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 227.87390579965614 \| 222.19481980110393 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 427.36821778477025 \| 431.3766620314935 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 410.67994346825 \| 423.4666944003808 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 381.1968748374038 \| 381.77668006420424 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 292.5540046358546 \| 296.5439130720502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 321.04573768858114 \| 310.7423616656888 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 427.46148866769903 \| 426.162091037068 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 419.75580537687347 \| 421.88640120274334 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 337.3208051798903 \| 327.4912454675092 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 276.5638854539581 \| 262.988360558083 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 250.82791326036886 \| 245.07367032501736 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 435.8055824506086 \| 441.8803729460534 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 432.02638235921006 \| 450.33161016596273 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 402.25525939224883 \| 393.8564689669916 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 297.5337286675904 \| 297.0131881135074 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 343.8697037899545 \| 329.8194073407783 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 267.58912366821056 \| 256.91606054118375 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 150.81723692609629 \| 146.32172267858743 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 129.51029293209245 \| 122.72144394093334 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 147.627656359087 \| 141.68956350566188 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 87.55100546003591 \| 84.91293287692788 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 299.5931492743986 \| 305.884253766691 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 179.39026367843837 \| 181.64741311605096 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 173.93547669282367 \| 173.23972950980564 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 185.90234171599252 \| 182.80844545446686 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 128.08176696266082 \| 123.27722685662111 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 340.50674552770664 \| 338.9071088484576 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 225.4438318650432 \| 230.22899884832975 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 194.15123248528312 \| 185.02793973094865 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 200.74289714108176 \| 191.76606719670647 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 107.03564946728423 \| 106.82432377861258 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 371.31799283918406 \| 379.7555394732925 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 275.97762744310455 \| 276.71106853992995 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 261.6648679783462 \| 259.4127232060398 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 237.03108223577615 \| 233.92710216149527 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 172.13926800371152 \| 168.74390922407585 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 381.50199487767276 \| 383.9043681999597 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 307.9748883093411 \| 312.2403515462001 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 251.11319684705438 \| 243.17870127827277 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 236.3253127246763 \| 223.81250201769552 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 154.55693991756874 \| 153.11360584987685 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 407.11400078586615 \| 413.53709886086557 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 348.1705797722622 \| 360.09771155957367 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 321.8593280850388 \| 318.2882327401255 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 270.089032013835 \| 268.767323026064 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 238.07324557907788 \| 228.09842078362692 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 399.8172853171901 \| 401.0954526332136 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 363.4387330438581 \| 364.13111024232677 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 294.1752429133857 \| 283.7235663368415 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 256.8389394007649 \| 246.91771015606483 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 199.3378564292656 \| 192.40439590901758 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 425.5150965556111 \| 430.8190098707553 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 396.00437184073013 \| 411.3873625655787 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 369.92803661607815 \| 361.43244467343663 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 293.4277354412933 \| 295.2529537595746 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 288.0208673072841 \| 281.51896404878863 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 408.3005367220567 \| 408.96116482298913 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 396.90095962766304 \| 396.87385456176486 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 319.0534576137999 \| 302.50950358107764 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 270.3334977708081 \| 258.8506349486557 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 227.46824134365394 \| 222.23759438128766 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 438.24247309479694 \| 437.7975163205371 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 428.34012029699227 \| 433.3215899950434 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 386.52672049728875 \| 388.26216893354984 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 302.71976814728083 \| 302.3574867306459 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 327.39760662780986 \| 308.6348428844912 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 423.31308678262695 \| 426.6306972137279 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 412.6983690923106 \| 419.4961977664297 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 337.41003544742273 \| 324.2155049126126 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 278.7755890910794 \| 265.9194286636502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 251.55678254755364 \| 244.8843180141462 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 452.5930781172308 \| 457.7117122300742 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 445.05676260348116 \| 463.9304535499636 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 415.78302138389415 \| 406.29229555271456 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 308.0311067300895 \| 304.91354721414314 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 351.43943626809335 \| 329.4476923070317 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 295.1801525813241 \| 291.36521287398904 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 183.23250549178067 \| 182.35421238887605 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 151.56832453117747 \| 151.3422139154794 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 171.02111935180432 \| 160.72516856727913 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 74.05765122783826 \| 74.5885345035243 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 314.3587394591763 \| 319.2938677773619 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 224.57002084153177 \| 225.48868542008177 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.00964804143052 \| 215.39576159953486 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.1174237618258 \| 214.28437413525663 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 121.08920423648368 \| 119.55813661872644 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 362.2193857281911 \| 360.05005804275936 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 279.8840217430121 \| 279.5437918286659 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 227.76617121021982 \| 222.8655938229316 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 215.43141176970562 \| 207.71852284994702 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 121.35588364218539 \| 121.20636565046884 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 365.1545280898012 \| 373.37585444987326 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 304.360119952975 \| 309.1247297936263 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 287.2603904544586 \| 289.25547903162595 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 257.9852675272418 \| 257.59069234098115 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 188.35158496670232 \| 184.24683960154857 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 389.9744911369211 \| 388.43466897254166 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 345.9228295166513 \| 342.63034895210126 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 279.56334658247437 \| 271.2724375402088 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 245.66477202810066 \| 233.49688207371258 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 170.3270720653187 \| 166.23863845657382 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 400.0041140827554 \| 402.11182445396497 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 363.64641830327434 \| 375.9288663364792 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 341.5776139573363 \| 335.1160003213424 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 281.1811770268521 \| 280.21438270014005 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 247.78716118997716 \| 245.3269825179633 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 403.794126680488 \| 405.2353919019577 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 387.079178426863 \| 385.1461762057035 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 309.7847188173431 \| 298.0443968374749 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 262.4721750159666 \| 250.81679725428586 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 205.70866004479979 \| 202.9620839129557 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 413.380982988662 \| 418.40270594263103 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 398.450064800682 \| 409.6794973994029 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 372.26297458194466 \| 364.44415106552196 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 293.0818569905912 \| 292.85172400643984 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 296.46717085592087 \| 285.76362010612763 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 419.3186786037592 \| 426.08801580934437 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 408.1648467766632 \| 409.4122254207817 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 329.24396020457345 \| 313.5200995121138 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 274.61257504571876 \| 255.7801815432177 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 232.63806001220684 \| 230.03020843492314 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 435.0785891054788 \| 440.39101804225345 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 424.86925312752817 \| 435.18898057396825 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 393.000417896268 \| 395.11543361225256 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 297.7755459218185 \| 300.7208114715287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 331.71570861760534 \| 318.07127352552885 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 424.58602747137405 \| 425.84897078470715 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 422.66607285025725 \| 423.5524945535485 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 344.8625760048626 \| 331.6793888458635 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 282.0787281511649 \| 263.7895634445868 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 252.7301927385177 \| 245.41844170037427 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 437.0658069164588 \| 442.9101960063628 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 433.13788271434646 \| 452.3873572709863 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 404.0959191546953 \| 396.7077863894884 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 300.45502211883206 \| 301.3439134717943 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 344.11003202413934 \| 330.8897663350314 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 298.4364205341705 \| 291.6793556507056 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 187.6382133139633 \| 191.05409897308772 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 156.55822078636112 \| 154.178925976516 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 173.47765221825162 \| 169.30862508068464 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 74.5885345035243 \| 74.52689061607104 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 323.12233826013045 \| 328.53889207933514 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 236.75872140126316 \| 235.8378325547398 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 227.17836523816675 \| 226.75357076139966 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 224.07209453308036 \| 224.07209453308036 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 122.85572156047981 \| 121.11642183704716 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 361.3123326658092 \| 360.71014086458337 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 281.5287983927017 \| 281.94301754758345 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 232.7456696285686 \| 226.50976826432776 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 221.5612361744038 \| 214.96188822837055 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 121.38311528944315 \| 120.85441868178513 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 380.2579019244734 \| 389.2520157863988 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 316.95230660496924 \| 317.87597790618906 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 301.07968126657323 \| 298.02424098422983 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 267.2240756921594 \| 267.16353549228154 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 189.82761622494257 \| 186.736450261963 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 389.88665375406805 \| 387.9125133037077 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 348.70619958684887 \| 346.6750499749774 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 280.5472989906087 \| 271.22300822012187 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 250.02397620165968 \| 241.22532776331445 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 171.67817496107645 \| 166.95679280483972 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 412.626880230807 \| 417.60238657950777 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 374.8829313933945 \| 389.4448546468815 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 353.20410434172436 \| 345.7072490717473 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 292.51045924209586 \| 291.66621022138287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 251.6264062063495 \| 248.45110052911542 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 404.0155784550126 \| 401.90546837237514 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 384.4389015599863 \| 386.9684324594344 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 313.3731284132225 \| 298.17074251037894 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 264.19199737284265 \| 252.8982463999916 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 207.03696315185684 \| 202.86697323136772 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 428.2436763312506 \| 433.45005568619536 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 411.8516531869893 \| 428.2753623461049 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 384.9095037182509 \| 372.90888743000744 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 303.2438915629836 \| 302.05095952914337 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 301.8689122735564 \| 285.0363190513223 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 423.13592231504805 \| 420.3991500185611 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 407.44527331585493 \| 408.5064370765247 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 330.50050996167414 \| 316.8763979925965 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 274.6833786307413 \| 259.86098862141324 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 232.24019584158367 \| 226.52040268160232 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 444.4596314237808 \| 455.99558915752266 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 437.4245561244369 \| 455.98275147271966 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 397.3350686877605 \| 397.88875599028063 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 308.53809114394545 \| 307.1359822042007 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 331.32379843423774 \| 316.85293191675646 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 422.4622274366379 \| 425.0407156418684 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 420.9547052783101 \| 430.33779243510276 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 345.50265346504085 \| 332.094855328957 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 280.81715528243365 \| 264.6543640282054 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 252.25635200421783 \| 245.46235499490305 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 452.5524207341139 \| 461.7512032176736 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 445.2316469907137 \| 464.4523799578466 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 416.87264016717023 \| 409.17124592157046 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 309.42579489389846 \| 307.9734464665731 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 350.50782004300623 \| 330.98959545427294 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157767 Approved by: https://github.com/Skylion007	2025-08-27 02:45:20 +00:00
PyTorch MergeBot	de58505890	Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 )" This reverts commit cddcaa19035d6414a351be7c7b16c47d5a0c3466. Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/karthickai due to This is breaking tests on Rocm ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3226541063))	2025-08-27 02:36:42 +00:00
atalman	6913529ff8	Move non inductor workflows to Python 3.9 -> 3.10 (#161182 ) Related to: https://github.com/pytorch/pytorch/issues/161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161182 Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/seemethere	2025-08-27 02:32:24 +00:00
Gabriel Ferns	4b4cdcfe3a	Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387 ) - Fix Conv exhaustive. - Fix AMD config pruning. - Expand exhaustive test suite. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159387 Approved by: https://github.com/coconutruben	2025-08-27 01:54:50 +00:00
Ke Wen	68d395d61e	[3/N][SymmMem] Expose offset field from handle (#161532 ) As titled, so that kernels relying on direct pointers can use base address and `hdl.offset` to access remote memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161532 Approved by: https://github.com/ngimel ghstack dependencies: #161470, #161471	2025-08-27 00:49:06 +00:00
Ke Wen	4ed71d5412	[2/N][SymmMem] Add MemPool allocator and tests (#161471 ) (Porting most of #161008) Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory. To end users, this PR supports a python UI as follows: ``` allocator = symm_mem.get_mempool_allocator(device) mempool = torch.cuda.MemPool(allocator) with torch.cuda.use_mem_pool(mempool): tensor = torch.arange(numel, dtype=dtype, device=device) ``` Added tests for both use cases above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471 Approved by: https://github.com/ngimel ghstack dependencies: #161470	2025-08-27 00:49:06 +00:00
Ke Wen	8dd5aa9689	[1/N][SymmMem] Add offset to handle, cache on base address (#161470 ) For the kernels that need peer pointers directly, the rendezvous handle should allow user to get the offset of tensor wrt to base allocation address. Thus the need to add an `offset` field to SymmMem handle. But we don't want to cache all the handles just bc they have different offsets, hence the search and cache logic below: (i) At rendezvous, the search key is still `x.storage().data_ptr()`, like now, but it should do search in 2 parts - one is just dictionary lookup, like today, if that failed, it needs to search `allocations_` to see if the storage ptr falls in one of the segments. This is possible as we have all segments recorded during alloc. (ii) If this segment hasn't been rendezvoused, we rendezvous it, cache it in the `symm_mem_` map with its base address as key. (iii) We still need to return a handle for the current tensor, with a corresponding offset. This handle will be a shallow copy of the base handle, with the offset adjusted. Some impl details: (i.1) If we find a matching allocation, we can immediately use the allocation base address to do a re-search in `symm_mem_`. (iii.1) To make the handle copy shallow, we move the common information -- base ptrs, base signal pad, etc -- to a structure referenced by both handles. The structure is called `NVSHMEMPeerAllocInfo`. A copy of handle just adds one more `intrusive_ptr` to it. The handle copy constructor accepts an `offset` argument. Test: Existing tests should not fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161470 Approved by: https://github.com/ngimel	2025-08-27 00:49:06 +00:00
Angela Yi	8ff9485815	[export] Update unflattening dynamo.disable (#161306 ) Summary: Doing inline disabling causes recompiles with the reason "Cache line invalidated because L['___stack0'] got deallocated" Test Plan: CI Rollback Plan: Differential Revision: D80816956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161306 Approved by: https://github.com/pianpwk	2025-08-27 00:27:16 +00:00
William Wen	b074cbaedd	[dynamo] allow resume functions to have name in both freevars and varnames (#161544 ) fixes https://github.com/pytorch/pytorch/issues/161542 Differential Revision: [D81073109](https://our.internmc.facebook.com/intern/diff/D81073109) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161544 Approved by: https://github.com/StrongerXi, https://github.com/anijain2305	2025-08-27 00:25:16 +00:00
Scott Wolchok	80bf883d21	Replace manual cache in _python_dispatch.get_alias_info with functools.cache (#161286 ) In addition to being more code, the manual cache was doing an extra dictionary lookup on each cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161286 Approved by: https://github.com/wconstab	2025-08-27 00:17:51 +00:00
Blaine Burton Rister	9de9d25f8d	[Inductor-FX] Support custom triton kernels (#161474 ) # Feature Add support for custom Triton kernels to the FX backend. This turned out not to require any new features, except for a minor change to handle `tl.constexpr` arguments which are not part of the autotuning config. # Caveat This may not cover every possible case. For example, we might need more features for autotuning custom Triton code. This PR entirely skips the [custom codegen ](https://github.com/pytorch/pytorch/blob/main/torch/_higher_order_ops/triton_kernel_wrap.py#L1034-L1039) for user-defined grid functions, but there may be edge cases requiring this logic. However, this PR seems to do a reasonable job as many of the grids end up being written into Inductor/Triton metadata and don't require special codegen. As a follow up, I'm planning to test this against all of AOTI's custom Triton kernel tests. # Test plan Added a CI test using a custom Triton kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161474 Approved by: https://github.com/angelayi	2025-08-27 00:15:19 +00:00
Malay Bag	dbc903a94a	[APS IR] Minfor fix - use GetAttrKey in get_keystr to match with flat args path in unflatten (#161453 ) Summary: While passing path info to [_check_input_constraints_for_graph](https://www.internalfb.com/code/fbsource/[6b5b2dc35902a26ce265e3c0ae5189a3faba1d38]/fbcode/caffe2/torch/export/unflatten.py?lines=594), GetAttrKey is used to specify path str. To match with that get_keystr should also use GetAttrKey. Test Plan: Existing tests ``` buck run mode/opt caffe2/test:test_export -- -r unflatten ``` ``` Ran 413 tests in 204.533s OK (skipped=1, expected failures=13) ``` Rollback Plan: Differential Revision: D80984083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161453 Approved by: https://github.com/tugsbayasgalan	2025-08-27 00:05:20 +00:00
PyTorch MergeBot	1b34e04485	Revert "Update pybind11 submodule to 3.0.1 (#160754 )" This reverts commit 660b0b8128181d11165176ea3f979fa899f24db1. Reverted https://github.com/pytorch/pytorch/pull/160754 on behalf of https://github.com/atalman due to please see https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226051449 ([comment](https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226078102))	2025-08-26 23:35:22 +00:00
PyTorch MergeBot	1ce423274d	Revert "[cpp_wrapper] Swap to new PyBind11 simple GIL header (#161063 )" This reverts commit 74c4c758afa8c28162f00a456c185552e1159fd3. Reverted https://github.com/pytorch/pytorch/pull/161063 on behalf of https://github.com/atalman due to sorry broke vllm tests please see https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226051449 ([comment](https://github.com/pytorch/pytorch/pull/161063#issuecomment-3226065212))	2025-08-26 23:31:23 +00:00
PyTorch MergeBot	4e630f0629	Revert "[Inductor] Update Outer Reduction Heuristic (#159093 )" This reverts commit ca9fe0107e165a4a4147325ff6d34235ebde447f. Reverted https://github.com/pytorch/pytorch/pull/159093 on behalf of https://github.com/PaulZhang12 due to Addressing internal implications then relanding ([comment](https://github.com/pytorch/pytorch/pull/159093#issuecomment-3225942525))	2025-08-26 22:37:56 +00:00
Karthick Panner Selvam	cddcaa1903	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos	2025-08-26 22:33:23 +00:00
soulitzer	1e4dfeeb06	Add early_stop kwarg to torch.utils.checkpoint (#160781 ) We already have a context manager "set_checkpoint_early_stop". This PR adds a kwarg that toggles the same setting. It is also useful to have a kwarg version of the setting in addition to the context manager because is annoying to apply a context manager when the AC is being applied via CheckpointWrapper. Similar to the "debug" kwarg and the corresponding "set_checkpoint_debug_enabled" context manager, the context manager defaults to None and overrides the local setting when non-None. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160781 Approved by: https://github.com/tianyu-l	2025-08-26 22:32:35 +00:00
angelayi	4d078cfc4e	[fx] Add is_fx_symbolic_tracing flag (#161385 ) Fixes https://github.com/pytorch/pytorch/issues/135276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161385 Approved by: https://github.com/pianpwk	2025-08-26 22:26:27 +00:00
Ti-Tai Wang	da838f65af	[ONNX] Drop draft_export in exporter API (#161454 ) If onnx exporter fallbacks to draft_export with big models, this is taking forever for users, and possibly spam the printout, which keeps users from their stack trace with strict=False. We could consider make another API for draft_export as debugging tool, or combine it with report=True when "model is small"? Pull Request resolved: https://github.com/pytorch/pytorch/pull/161454 Approved by: https://github.com/justinchuby	2025-08-26 22:13:43 +00:00
gaoyufeng	cde54fe4e9	fix-unpin-memory-tensor-param (#160992 ) Fixes #160983 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160992 Approved by: https://github.com/ngimel	2025-08-26 21:55:25 +00:00
soulitzer	e06d1d6610	[BE] Improve torch.inference_mode docs and error message (#161164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161164 Approved by: https://github.com/sfc-gh-sbekman, https://github.com/janeyx99	2025-08-26 20:58:56 +00:00
Hashem Hashemi	b2db293abc	[ROCm] No-fence global reduce (#161180 ) This change removes need for fences in global_reduce by converting the stores to reduce_buffer[] into atomics+return. This is crucial for perf in architectures with split caches (e.g. MI300), where fences are inherently costly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161180 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-26 20:43:59 +00:00
PyTorch MergeBot	6686974ddd	Revert "[dynamo, nested graph breaks] add nested graph break tests (#144516 )" This reverts commit 9a756c2d710a0680bac93ab0b42db519ec2dc6cf. Reverted https://github.com/pytorch/pytorch/pull/144516 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/144516#issuecomment-3225659358))	2025-08-26 20:40:17 +00:00
eqy	3d82256a86	[FP8][cuBLAS][SM100] cuBLAS doesn't support rowwise-scaling on `sm110` or `sm120` either (#161236 ) See also #160693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161236 Approved by: https://github.com/Skylion007	2025-08-26 20:40:11 +00:00
PyTorch MergeBot	a4fb65701b	Revert "[dynamo, nested graph breaks] support very simple nested graph breaks (#159329 )" This reverts commit 8dab6d4c414bf997297804008c3da893e69cd51f. Reverted https://github.com/pytorch/pytorch/pull/159329 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/159329#issuecomment-3225617445))	2025-08-26 20:24:10 +00:00
PyTorch MergeBot	6afd766401	Revert "[dynamo, nested graph breaks] support nested graph breaks x context managers (#159678 )" This reverts commit 02fa5bf6d80fa4baa6bb6dd2fa6a16d88852da91. Reverted https://github.com/pytorch/pytorch/pull/159678 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/159678#issuecomment-3225597425))	2025-08-26 20:16:36 +00:00
PyTorch MergeBot	a7aa480e55	Revert "[dynamo, nested graph breaks] support nested closures (#159817 )" This reverts commit ef0ef6f93f7ef6d16d71a6997b72185504acd4b6. Reverted https://github.com/pytorch/pytorch/pull/159817 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/159817#issuecomment-3225586996))	2025-08-26 20:13:33 +00:00
PyTorch MergeBot	9f6e1b8730	Revert "[ROCm] SDPA fix mem fault when dropout is enabled (#154864 )" This reverts commit 3caddd4daa5b1a167663c07219e065e86247ad76. Reverted https://github.com/pytorch/pytorch/pull/154864 on behalf of https://github.com/atalman due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154864#issuecomment-3225554119))	2025-08-26 20:03:59 +00:00
PyTorch MergeBot	caf98fde0d	Revert "[dynamo, nested graph breaks] clean up comments and codegen (#160138 )" This reverts commit ac6316caaa74513cbcf3c7f9269bc23cd74749db. Reverted https://github.com/pytorch/pytorch/pull/160138 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/160138#issuecomment-3225546707))	2025-08-26 20:01:26 +00:00
PyTorch MergeBot	46576f5a16	Revert "[dynamo, nested graph breaks] prevent excessive recompilations (#159786 )" This reverts commit 67d31f6b281d3b15b205756fc7ebc450cdde1dab. Reverted https://github.com/pytorch/pytorch/pull/159786 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/159786#issuecomment-3225535752))	2025-08-26 19:54:22 +00:00
Charlie West-Taylor	77bc959fe1	Add inductor backend to device interface; make minifier_tests more device agnostic (#151314 ) Tried to decouple the always cpu <=> c++, cuda <=> triton assumption. Tried to keep it relatively simple by just guarding things more specifically, at the moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151314 Approved by: https://github.com/eellison	2025-08-26 19:40:37 +00:00
Jeff Daily	262640fd22	[ROCm][CI] restore test_flex_attention tests (#161519 ) Reverts #161450 and targets specific subtests to skip on MI200. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161519 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-26 19:31:30 +00:00
Zhengxu Chen	74124d1b46	[reland] [dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#161514 ) Summary: convert_frame.compile_frame used to take a callback transform function which will capture the frame object it has, but the frame information is not passed directly into compile_frame function. This PR changes the signature of compile_frame so that frame information is directly passed in the function without taking a callback. This makes it easier to build fullgraph capture API on top of compile_frame. Test Plan: CI Rollback Plan: Differential Revision: D81041296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161514 Approved by: https://github.com/tugsbayasgalan	2025-08-26 19:16:05 +00:00
Joshua Su	a03cc53e6f	Back out "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" (#161002 ) Summary: reverting this diff since it caused S551328. Please see D80217492 for dertails. Test Plan: NA Rollback Plan: Differential Revision: D80553588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161002 Approved by: https://github.com/jingsh, https://github.com/izaitsevfb	2025-08-26 19:04:13 +00:00
Yidi Wu	00efeabc29	[hop] make materialize_as_graph disable pre-existing dispatch modes (#161220 ) For materializing_as_subgraph, we just want to trace a graph. The handling of different modes should register their own logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161220 Approved by: https://github.com/Lucaskabela	2025-08-26 18:52:38 +00:00
Arsh Zahed	d4703fb91c	[dtensor] Add propagate_tensor_meta function that skips cache if _are_we_tracing (#161334 ) Fixes an issue where the log softmax handler checked the tensor metadata cache without checking for tracing or symints. Probably best to merge this after #160798, but not strictly blocking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161334 Approved by: https://github.com/xmfan	2025-08-26 18:46:58 +00:00
Tom Ritchford	cd87f30295	DOC: Clarify documentation for torch.matmul and fix a typo (#161424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161424 Approved by: https://github.com/AlannaBurke	2025-08-26 18:30:57 +00:00
Lucas Kabela	f0e0a6897e	type misc init and tools for dynamo (#161293 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/161293 Approved by: https://github.com/anijain2305	2025-08-26 17:38:49 +00:00
vishalgoyal316	d2bd55d8de	Typo correction in variable name inital_grad of Class TestFullyShardG… (#161501 ) Typo correction in variable name inital_grad of Class TestFullyShardGradientScaler implementation. Fixes #161480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161501 Approved by: https://github.com/soulitzer	2025-08-26 17:16:42 +00:00
Yidi Wu	6598f00c18	[dynamo] auto lift unbacked symbol in tensor's storage_offset (#161199 ) ```python import torch torch._dynamo.config.capture_scalar_outputs = True class M(torch.nn.Module): def forward(self, idx, x): u0 = idx.item() x0 = x.select(0, u0) def fn(): return x0.sin() return torch.cond(x0.sum() > 0, fn, fn) m = M() out = torch.compile(m, fullgraph=True)(torch.tensor(0, dtype=torch.int64, device="cuda"), torch.randn(3, 3, device="cuda")) print(out) ``` Before the PR, we didn't track the storage_offset symbol of a tensor. After https://github.com/pytorch/pytorch/pull/157605, we create an unbacked_symint for stroage_offset for the result of select. So when we try to lift the free basic symbols of x0 during speculating fn, we found a free symbol that's not bound to a proxy. This PR tracks the symbols of storage_offset and associated it with a proxy using torch.ops.aten.storage_offest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161199 Approved by: https://github.com/zou3519 ghstack dependencies: #161198	2025-08-26 17:06:54 +00:00
Yidi Wu	ba6ce66698	[dynamo] lift backed symint output of item() (#161198 ) Before the change in this PR, we have an error for the following code ```python import torch torch._dynamo.config.capture_scalar_outputs = True class M(torch.nn.Module): def forward(self, idx, x): u0 = idx.item() x0 = x.select(0, u0) def fn(): return x0.sin() return torch.cond(x0.sum() > 0, fn, fn) m = M() out = torch.compile(m, fullgraph=True)(torch.tensor(0, dtype=torch.int64), torch.randn(3, 3)) ``` The error is caused when speculate fn, and tries to lift symbol of x0.storage_offset() but found the symbols doesn't have a source associated with it. What really happens is that, when input tensor is a scalar tensor of int type and resides on CPU, we have a short cut that creates a norm symint when .item() is called see https://github.com/pytorch/pytorch/pull/126245. However, previously, we only track the unbacked symint output of an operation because we believe all the backed symint must have a source associated with it and has already bee lifted as input at the top-level. Now this invariant no longer holds, so we end up an error saying the symbol doesn't have source (because only input and symbols derided from inputs have source and result of .item() doesn't have a source). In this PR, we start to also track the normal symint with the proxy that created it (i.e. in this case the proxy .item()). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161198 Approved by: https://github.com/zou3519	2025-08-26 17:06:54 +00:00
PaulZhang12	ca9fe0107e	[Inductor] Update Outer Reduction Heuristic (#159093 ) Update outer reduction heuristics for significant speedups. HuggingFace: <img width="572" height="705" alt="Screenshot 2025-08-20 at 12 44 51 AM" src="https://github.com/user-attachments/assets/4872a23b-d136-423a-b2e6-187895bccba1" /> Average ~20% speedup on a kernel by kernel basis TorchBench: <img width="572" height="705" alt="Screenshot 2025-08-20 at 12 45 10 AM" src="https://github.com/user-attachments/assets/b8357b6d-6107-4104-b906-292a17d14d48" /> Average ~40% speedup on a kernel by kernel basis <img width="1705" height="729" alt="Screenshot 2025-08-21 at 5 50 32 PM" src="https://github.com/user-attachments/assets/a9715a2b-9e6c-4b33-ba9f-7870dc561e31" /> Differential Revision: [D80835998](https://our.internmc.facebook.com/intern/diff/D80835998) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159093 Approved by: https://github.com/jansel	2025-08-26 16:12:07 +00:00
AmdSampsa	f9df4ec2af	SDPA skip logic for ROCm (#160522 ) Skips some test for flex and eff attention if they are not supported by the hardware Pull Request resolved: https://github.com/pytorch/pytorch/pull/160522 Approved by: https://github.com/drisspg, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-26 15:51:07 +00:00
Catherine Lee	a72803f1e3	[ez][CI] GIve the linux check job a name that isn't linux-job (#161413 ) Reason: The default name is linux-job, which gets put in the linux category on HUD, but this isn't really a linux related job. Renaming it like this will make it go into the "other" category on HUD Other options: Change the grouping code in test-infra Pull Request resolved: https://github.com/pytorch/pytorch/pull/161413 Approved by: https://github.com/huydhn, https://github.com/seemethere	2025-08-26 15:18:35 +00:00
Jeff Daily	10e67f5ec3	forward fix #161102 (#161465 ) PR #161102 caused tf32 to be the default precision for flex attention. This PR forward-fixes the broken logic and restores ROCm MI200 CI flex attention test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161465 Approved by: https://github.com/jeffdaily, https://github.com/eqy Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-26 15:11:54 +00:00
PyTorch MergeBot	818ba434c7	Revert "Ensure large tensor int32 -> int64 indexing is enabled (#157767 )" This reverts commit fc69c2bc67672c3b2d0c62c1821895f09288f1c0. Reverted https://github.com/pytorch/pytorch/pull/157767 on behalf of https://github.com/atalman due to internal failure, sorry will revert ([comment](https://github.com/pytorch/pytorch/pull/157767#issuecomment-3224341111))	2025-08-26 14:12:06 +00:00
Ting Lu	ae8d319fd4	Update NVSHMEM to 3.3.24 and fix download link (#161321 ) https://github.com/pytorch/pytorch/issues/159779 Update NVSHMEM 3.3.24 for [PyTorch CUDA13 Binary Cannot Be Built with SM_75 with NVSHMEM](https://github.com/pytorch/pytorch/issues/160980) Enabled back sm_75 for NVSHMEM Fixed the NVSHMEM download link for the issue with 3.3.20 download in issue - [[CD] nvshem-3.3.9 wheels for aarch64 is not manylinux2_28 compliant](https://github.com/pytorch/pytorch/issues/160425) Todo: Should also enable back build ARM with NVSHMEM since it is compatible with manylinux2_28 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161321 Approved by: https://github.com/Skylion007, https://github.com/atalman	2025-08-26 13:26:18 +00:00
PyTorch MergeBot	e795450a35	Revert "[dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900 )" This reverts commit 447d34b5f80fb7350f79decd855cb599cab39083. Reverted https://github.com/pytorch/pytorch/pull/160900 on behalf of https://github.com/atalman due to reverting since can't land existing diff internally, will need to reland it ([comment](https://github.com/pytorch/pytorch/pull/160900#issuecomment-3224029031))	2025-08-26 12:45:59 +00:00
David Berard	8c506e6310	[easy][test] Add repeat_interleave opinfo that exercises binary search fusion (#161445 ) This adds a configuration that would have caught the need for https://github.com/pytorch/pytorch/pull/159961 when https://github.com/pytorch/pytorch/pull/158462 was landed. Notably: * the test has output_size kwarg specified * the input is 1D plus a size-1 dimension (otherwise, if there are non-size-1 dimensions, then the fusion won't occur) Differential Revision: [D80981715](https://our.internmc.facebook.com/intern/diff/D80981715) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161445 Approved by: https://github.com/eellison, https://github.com/v0i0	2025-08-26 12:32:24 +00:00
PyTorch MergeBot	4a1aca11c2	Revert "[inductor] structured-log graph execution order + test (#160448 )" This reverts commit 995397d47a0e27394ee1010f158e181eb304100a. Reverted https://github.com/pytorch/pytorch/pull/160448 on behalf of https://github.com/atalman due to internal failure please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/160448#issuecomment-3223939035))	2025-08-26 12:20:37 +00:00
Chuanhao Zhuge	e9d42b3880	[small][muon] Use addmm for Newton–Schulz orthogonalization (#161379 ) A performance optimization. Using `torch.addmm`, which fuses `matrix multiply + scale + add` into one op. Benchmark In a QWEN-like 0.5B model training we observed average `optimizer.step()` latency speedup: matmul ~44.5 ms -> addmm ~27.4 ms: a 1.62× speedup. matmul <img width="1403" height="600" alt="Screenshot 2025-08-24 at 3 15 37 PM" src="https://github.com/user-attachments/assets/a77a68d4-da3c-473a-97f0-e6ef0a3b46d9" /> addmm <img width="1426" height="602" alt="Screenshot 2025-08-24 at 3 13 42 PM" src="https://github.com/user-attachments/assets/e493af36-44d3-4026-9f7c-fd0f9cdbc7e5" /> Testing End-to-end training: We used a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curves show consistency between normal matmul and addmm. <img width="1035" height="434" alt="Screenshot 2025-08-24 at 2 56 21 PM" src="https://github.com/user-attachments/assets/b96b13e3-0a01-4908-853c-d917b41f3d75" /> Unit test: ```python # dummy model and data model0 = Linear(10, 10, bias=False) model1 = copy.deepcopy(model0) inputs = torch.randn(8, 10) targets = torch.randn(8, 10) loss = MSELoss() lr = 1e-3 wd = 0.1 momentum = 0.95 opt_ref_muon = Muon( params=model0.parameters(), lr=lr, weight_decay=wd, momentum=momentum, nesterov=nesterov, adjust_lr_fn="original", ) opt_exp_muon = Muon( params=model1.parameters(), lr=lr, weight_decay=wd, momentum=momentum, nesterov=nesterov, adjust_lr_fn="original", use_addmm=True, ) out_ref = model0(inputs) loss_ref = loss(out_ref, targets) opt_ref_muon.zero_grad() loss_ref.backward() opt_ref_muon.step() out_exp = model1(inputs) loss_exp = loss(out_exp, targets) opt_exp_muon.zero_grad() loss_exp.backward() opt_exp_muon.step() for p_ref, p_exp in zip(model0.parameters(), model1.parameters()): torch.testing.assert_close(p_ref, p_exp) ``` shows numeric difference, but this is expected on bf16 precision: ``` Mismatched elements: 96 / 100 (96.0%) Greatest absolute difference: 8.985400199890137e-05 at index (1, 9) (up to 1e-06 allowed) Greatest relative difference: 0.007370449136942625 at index (0, 6) (up to 1e-05 allowed) ``` ~~Introduced a flag that allows users to opt in, as there are numerical differences relative to the original implementation.~~ Update: since `addmm` fuses the math ops, there are fewer intermediate roundings and is therefore more numerically accurate compared to the original form. Based on this, we opt to make `addmm` the default and only option. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161379 Approved by: https://github.com/janeyx99	2025-08-26 09:17:28 +00:00
Tsung-Hsien Lee	8cfc119491	[pytorch] Simplify codes using `std::all_of()` for `_check_tensors_share_device_and_dtype()` (#161411 ) Summary: These two nested loops of checks could be simplified with `std::all_of()` to make it more compact. Test Plan: OSS CI & tests Rollback Plan: Differential Revision: D80946082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161411 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-08-26 08:56:24 +00:00
Tsung-Hsien Lee	e7e270a33a	[pytorch] Merge two nested if statement checks into one (#161387 ) Summary: This reduces the code indentation level by one. Test Plan: OSS CI & tests Rollback Plan: Differential Revision: D80915357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161387 Approved by: https://github.com/janeyx99	2025-08-26 08:45:36 +00:00
Nikhil Patel	6aef9f3a69	[Inductor][Tritonparse] Call `jit_post_compile_hook` within Inductor Triton Kernel compile path (#161443 ) Summary: Since Inductor skips JIT compilation for Triton kernels, we need to manually invoke `knobs.runtime.jit_post_compile_hook` if one exists. Here, we do this to enable Tritonparse to extract launch metadata from Inductor launched kernels. We can control whether or not Inductor will run the hook with a new `TORCHINDUCTOR_RUN_JIT_POST_COMPILE_HOOK=1 ` config variable. Reviewed By: davidberard98 Differential Revision: D80624932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161443 Approved by: https://github.com/FindHao	2025-08-26 06:24:42 +00:00
Xilun Wu	7376111d59	[BE] fix compute_global_tensor_shape test (#161441 ) Fixes #161154 Test `pytest test/distributed/tensor/test_utils.py -s -k test_compute_global_tensor_shape_1D` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161441 Approved by: https://github.com/kwen2501	2025-08-26 03:22:29 +00:00
PyTorch MergeBot	92ab184824	Revert "[Inductor] Prune configs that require more shared memory than the hardware limit (#161040 )" This reverts commit b2e06e0194c3fa8f7578a1b48751cc027394fb67. Reverted https://github.com/pytorch/pytorch/pull/161040 on behalf of https://github.com/jeffdaily due to still failing on rocm, see https://hud.pytorch.org/failure?name=rocm%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(default%2C%203%2C%206%2C%20linux.rocm.gpu.2)&jobName=undefined&failureCaptures=inductor%2Ftest_triton_heuristics.py%3A%3ATestTritonHeuristics%3A%3Atest_prune_configs_over_shared_memory_limit_do_pruning_True ([comment](https://github.com/pytorch/pytorch/pull/161040#issuecomment-3222430129))	2025-08-26 03:15:32 +00:00
Zesheng Zong	8c442e4fd3	Fix LBFGS warning convert a tensor with requires_grad=True to a scalar (#160389 ) Fixes #160197 ## Test Result ```python In [1]: import warnings ...: warnings.simplefilter('error') ...: import torch ...: print(torch.__version__) ...: a, b = torch.rand((2, 32, 32)) ...: a.requires_grad_() ...: optimizer = torch.optim.LBFGS([a]) ...: loss_fn = lambda x, y: (x-y).pow(2).mean() ...: ...: def closure(): ...: optimizer.zero_grad() ...: loss = loss_fn(a, b) ...: loss.backward() ...: return loss ...: ...: for i in range(100): ...: optimizer.step(closure) ...: print(i, loss_fn(a, b)) ...: 2.9.0a0+gitf33f3f8 0 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 1 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 2 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 3 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 4 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 5 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 6 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 7 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 8 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 9 tensor(5.8066e-11, grad_fn=<MeanBackward0>) 10 tensor(5.8066e-11, grad_fn=<MeanBackward0>) ... ``` ```bash pytest test/test_optim.py -vv ... test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_NAdam_cuda_float32 PASSED [2.7192s] [ 99%] test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_RAdam_cuda_float32 PASSED [2.5370s] [ 99%] test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_RMSprop_cuda_float32 PASSED [2.0190s] [ 99%] test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_Rprop_cuda_float32 PASSED [1.8554s] [ 99%] test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_SGD_cuda_float32 PASSED [2.0433s] [ 99%] test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_SparseAdam_cuda_float32 PASSED [1.1788s] [100%] ================== 1471 passed, 242 skipped in 2440.52s (0:40:40) ============ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160389 Approved by: https://github.com/janeyx99 Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-08-26 03:07:47 +00:00
angelayi	e34b6a0103	Add meta for add.Scalar (#161332 ) Fixes https://github.com/pytorch/pytorch/issues/161076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161332 Approved by: https://github.com/Skylion007	2025-08-26 02:26:51 +00:00
RajeshvShiyal	f795e92802	space added between type and checking for typechecking (#161352 ) space added between type and checking for "typechecking" Fixes #161282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161352 Approved by: https://github.com/malfet	2025-08-26 02:07:33 +00:00
Huy Do	becd6cd744	Increase timeout value when pushing to ghcr.io (#161444 ) Seeing this timing out a lots in trunk now https://github.com/pytorch/pytorch/actions/runs/17165552358/job/48705069047. The benchmark image is the largest one we have on CI, so it's probably over the 30 minutes limit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161444 Approved by: https://github.com/atalman	2025-08-26 01:51:16 +00:00
FFFrog	ec21cafd85	[OpenReg] Refactor and Optimize the OpenReg for Preparation of Docs (#159640 ) As the title stated. Changes: - Fixed a bug where abs_stub could not be triggered - Refactor registration to prepare for documentation - Add meta, fallback for openreg Pull Request resolved: https://github.com/pytorch/pytorch/pull/159640 Approved by: https://github.com/albanD	2025-08-26 01:44:21 +00:00
PyTorch MergeBot	908b0ccb1f	Revert "Increase timeout value when pushing to ghcr.io (#161444 )" This reverts commit b9e9e92817fd7d1a778f074105603efb07e05004. Reverted https://github.com/pytorch/pytorch/pull/161444 on behalf of https://github.com/huydhn due to Reland this to generate a different has value for the benchmark Docker image ([comment](https://github.com/pytorch/pytorch/pull/161444#issuecomment-3222257119))	2025-08-26 01:41:59 +00:00
amdfaa	85adf80cf1	Disable inductor/test_flex_attention.py (#161450 ) Currently inductor/test_flex_attention.py is causing rocm pytorch mi250 shard 1 to go over the timeout limit. This PR is for disabling that test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161450 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-26 01:28:51 +00:00
Benjamin Glass	74c4c758af	[cpp_wrapper] Swap to new PyBind11 simple GIL header (#161063 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161063 Approved by: https://github.com/Skylion007 ghstack dependencies: #160754	2025-08-26 01:21:18 +00:00
Benjamin Glass	660b0b8128	Update pybind11 submodule to 3.0.1 (#160754 ) Upgrade to PyBind11 v3. This allows us to strip out our own (possibly broken?) handling of the C++ ABI when building extensions, in favor of the more-complete PyBind11 internal handling. Fixes a few test failures due to https://github.com/pybind/pybind11/issues/5774, which effectively makes the `__qualname__` attribute of functions platform-dependent. Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/160754 Approved by: https://github.com/Skylion007	2025-08-26 01:21:18 +00:00
Yiming Zhou	089ad1d88b	[1/n][export] Refactor PT2 Archive weight saving and loading (#160394 ) Summary: We split the refactoring in two parts for forward compatibility concerns First, we land the deserialization (loading part) Then, we land the serialization (saving part) Save weights and constants as individual files in PT2 archive. Each weight/constant will be saved as raw bytes, unless it is a custom object (TorchBind object) or a non-fake tensor subclass, for these two special cases we still save them using pickle. The metadata of saved tensors along with the file name will be saved as `PayloadMeta`. The mapping from FQN to `PayloadMeta` will be saved as `PayloadConfig` under `WEIGHTS_CONFIG_FORMAT` and `CONTANTS_CONFIG_FORMAT` This changes the serialization in python side when calling `torch.export.save()`. For deserialization in python `torch.export.load()`, we make it BC-safe by allowing loading legacy format weights/constants. For deserialization in C++ `torch/nativert/ModelRunner.cpp`, we make this a BC breaking change as currently the OSS ModelRunner API is not being used. The file structure ``` ├── archive_format ├── archive_version ├── byteorder ├── .data │ ├── serialization_id │ └── version ├── data │ ├── sample_inputs │ │ └── model.pt │ ├── constants │ │ ├── tensor_0 │ │ ├── tensor_1 │ │ └── model_constants_config.json │ └── weights │ ├── weight_0 │ ├── weight_1 │ ├── weight_2 │ ├── weight_3 │ └── model_weights_config.json └── models └── model.json ``` Test Plan: CI Rollback Plan: Differential Revision: D80035490 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160394 Approved by: https://github.com/SherlockNoMad	2025-08-26 01:15:42 +00:00
William Wen	67d31f6b28	[dynamo, nested graph breaks] prevent excessive recompilations (#159786 ) Nested continuation function code objects are now unique w.r.t. stack trace below (and including) the current code object. Without this change, e.g. in the added test, `f3` would be recompiled on the second graph break. Followup: we can skip guards on continuation functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159786 Approved by: https://github.com/anijain2305 ghstack dependencies: #157971, #159281, #144516, #159329, #159678, #159817, #160138	2025-08-26 00:58:38 +00:00
William Wen	ac6316caaa	[dynamo, nested graph breaks] clean up comments and codegen (#160138 ) Fix comments to reflect that we no longer codegen cells to be sent to resume function as inputs - they are instead codegen'd after the unsupported instruction in order to build resume functions that are closures. Also simplify some codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160138 Approved by: https://github.com/anijain2305 ghstack dependencies: #157971, #159281, #144516, #159329, #159678, #159817	2025-08-26 00:58:38 +00:00
William Wen	ef0ef6f93f	[dynamo, nested graph breaks] support nested closures (#159817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159817 Approved by: https://github.com/anijain2305 ghstack dependencies: #157971, #159281, #144516, #159329, #159678	2025-08-26 00:58:28 +00:00
William Wen	02fa5bf6d8	[dynamo, nested graph breaks] support nested graph breaks x context managers (#159678 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159678 Approved by: https://github.com/anijain2305 ghstack dependencies: #157971, #159281, #144516, #159329	2025-08-26 00:58:18 +00:00
William Wen	8dab6d4c41	[dynamo, nested graph breaks] support very simple nested graph breaks (#159329 ) e.g. this graph breaks once now: ```python import torch torch._dynamo.config.nested_graph_breaks = True def inner(x): x = x + 1 torch._dynamo.graph_break() return x + 2 @torch.compile(backend="eager") def outer(x): return inner(x) print(outer(torch.ones(3))) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159329 Approved by: https://github.com/anijain2305 ghstack dependencies: #157971, #159281, #144516	2025-08-26 00:58:07 +00:00
William Wen	9a756c2d71	[dynamo, nested graph breaks] add nested graph break tests (#144516 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144516 Approved by: https://github.com/anijain2305 ghstack dependencies: #157971, #159281	2025-08-26 00:57:58 +00:00
William Wen	504a6445a4	[dynamo, nested graph breaks] use CALL_FUNCTION_EX when calling resume function (#159281 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159281 Approved by: https://github.com/anijain2305 ghstack dependencies: #157971	2025-08-26 00:57:48 +00:00
William Wen	2df9b437e3	[dynamo, nested graph breaks] implement new resume frame stack/locals/cell layout convention (#157971 ) The comments/conventions are not exactly correct here, as the implementation at this PR is partial. They will be fixed in #160138. No tests added, since there shouldn't be any overall semantic changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157971 Approved by: https://github.com/anijain2305	2025-08-26 00:57:39 +00:00
rzou	4e19c1906a	Get Inductor periodic CI green (#161297 ) I'll file hi-pri issues for the things that need looking into. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/161297 Approved by: https://github.com/angelayi	2025-08-26 00:49:49 +00:00
Nikhil Patel	332fa5b388	[Inductor][Triton] Fix SCALING_ROWWISE misclassification for scalar scales (#160450 ) Summary: In `tuned_scaled_mm()`, we unsqeeuze any scalar scale from [] -> [1, 1]. Later, when we are determining how to set the `SCALING_ROWWISE` kernel attribute, we check whether the scale has 2 dimensions. However, since we previously unsqueezed any scalar scales, this will always evaluate to True. Test Plan: Run the following tests in test/inductor/test_fp8.py: test_tensorwise_scaling_tma_template test_rowwise_scaling_tma_template Rollback Plan: Differential Revision: D80108117 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160450 Approved by: https://github.com/eellison	2025-08-26 00:24:55 +00:00
Huy Do	b9e9e92817	Increase timeout value when pushing to ghcr.io (#161444 ) Seeing this timing out a lots in trunk now https://github.com/pytorch/pytorch/actions/runs/17165552358/job/48705069047. The benchmark image is the largest one we have on CI, so it's probably over the 30 minutes limit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161444 Approved by: https://github.com/atalman	2025-08-25 23:52:59 +00:00
Tsung-Hsien Lee	e6aa7287f8	[pytorch] Leverage `unordered_map.try_emplace()` to simplify code (#161388 ) Summary: Because [`unordered_map.try_emplace()`](https://en.cppreference.com/w/cpp/container/unordered_map/try_emplace.html) does not invoke value's constructor if key is already existed, this matches with the previous the behavior on checking the key's existence first, and then instantiate the value. Test Plan: OSS CI & tests Rollback Plan: Differential Revision: D80916349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161388 Approved by: https://github.com/janeyx99	2025-08-25 23:33:59 +00:00
atalman	94b9569c4a	Forward fix periodic vision build (#161408 ) Trying to forward fix: https://github.com/pytorch/pytorch/issues/161358 use SM 80 architecture by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/161408 Approved by: https://github.com/zou3519, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-08-25 23:28:22 +00:00
morrison-turnansky	2cf7ac2fb7	Issue 160495 inductor complex float (#160736 ) Avoiding calling tensor.view(tensor.real.dtype) when tensor.ndim =0 fixes the issue. Called a reshape. Fixes #160495 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160736 Approved by: https://github.com/ngimel	2025-08-25 23:23:13 +00:00
zhxchen17	447d34b5f8	[dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900 ) convert_frame.compile_frame used to take a callback transform function which will capture the frame object it has, but the frame information is not passed directly into compile_frame function. This PR changes the signature of compile_frame so that frame information is directly passed in the function without taking a callback. This makes it easier to build fullgraph capture API on top of compile_frame. @exported-using-ghexport Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801/) Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160900 Approved by: https://github.com/tugsbayasgalan, https://github.com/anijain2305	2025-08-25 23:16:21 +00:00
Wenyuan Chi	b2e06e0194	[Inductor] Prune configs that require more shared memory than the hardware limit (#161040 ) Summary: This diff removes configs that require more shared memory than the hardware limit, which causes the following compilation error: ``` No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help. ``` Test Plan: ``` buck2 test mode/dev-nosan fbcode//caffe2/test/inductor:max_autotune -- test_max_autotune_prune_choices -v 1,stderr ``` Rollback Plan: Differential Revision: D80594562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161040 Approved by: https://github.com/eellison	2025-08-25 23:09:09 +00:00
drisspg	fc69c2bc67	Ensure large tensor int32 -> int64 indexing is enabled (#157767 ) Fixes: #https://github.com/pytorch/pytorch/issues/157446 I think that this delta is worth the switch form block-ptrs especially since they are deprecated ## Perf Summary A is nightly B is this diff, so `negative` means this diff improves perf TOP 5 differences <img width="805" height="754" alt="Screenshot 2025-08-24 at 5 49 49 PM" src="https://github.com/user-attachments/assets/aa359cdf-ee9a-427d-be72-1b9aef6f3115" /> <details> <summary><strong>Full perf table (click to expand)</strong></summary> \| attn_type \| dtype \| shape(B,Hq,M,Hkv,N,D) \| TFlops Version A \| TFlops Version B \| \| --- \| --- \| --- \| --- \| --- \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 258.38834144791923 \| 258.6353685004612 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.2192450677751 \| 140.12393320464972 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 122.32683823617003 \| 118.51603755647925 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.48556906165314 \| 137.24259849208627 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 86.59814488695922 \| 84.59431398586257 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 288.52679758135764 \| 292.9174195871856 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 172.25541683643277 \| 172.94326459828508 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 164.40864610599826 \| 165.035129576335 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 176.54876886433945 \| 175.08057670028145 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 125.22491679812626 \| 121.06201152859151 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 339.11952481874283 \| 339.0132835601695 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 227.58583240284406 \| 228.21824999409597 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 185.98569659868966 \| 182.32850843255093 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 188.9495725191772 \| 180.31385312481657 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 106.25789530994302 \| 106.55084959448476 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 357.6430536888533 \| 363.30843452247274 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 262.3241154406613 \| 265.73250045488 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 249.30498953911416 \| 249.35928192833785 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 224.74126243851808 \| 223.71776504077988 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 168.26977014013707 \| 165.47991483333809 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 382.8178701785897 \| 384.34752965862685 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 308.1449710013853 \| 311.0653716044644 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 251.96365252505072 \| 243.92283557225903 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 226.69316232745368 \| 215.22769268913356 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 153.34142545296405 \| 151.9312673939401 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 396.0998000753126 \| 398.35036286102473 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 333.5198415274966 \| 344.6354466169716 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 310.5955933379696 \| 305.66347819546 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 260.4012412689896 \| 259.758666997307 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 234.13034252182635 \| 227.61676497283614 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 396.17615538477196 \| 401.1419104525502 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 359.98648311998414 \| 360.8285563463094 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 291.97720707257736 \| 281.41694809965253 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 250.1703628419691 \| 238.556760291579 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 199.50782826294306 \| 191.52327358439223 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 411.0632004785396 \| 413.6362648405517 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 382.9404387613185 \| 397.74886235657607 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 357.0998545146633 \| 350.5115200772392 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 281.8033924428203 \| 281.98601309215843 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 282.56595134222135 \| 277.4565795466672 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 408.89838018149516 \| 405.14531386840076 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 396.07662058160264 \| 393.4598228299578 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 317.8822887267849 \| 304.754931401036 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 265.8801304948243 \| 254.22961974295112 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 227.87390579965614 \| 222.19481980110393 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 427.36821778477025 \| 431.3766620314935 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 410.67994346825 \| 423.4666944003808 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 381.1968748374038 \| 381.77668006420424 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 292.5540046358546 \| 296.5439130720502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 321.04573768858114 \| 310.7423616656888 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 427.46148866769903 \| 426.162091037068 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 419.75580537687347 \| 421.88640120274334 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 337.3208051798903 \| 327.4912454675092 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 276.5638854539581 \| 262.988360558083 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 250.82791326036886 \| 245.07367032501736 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 435.8055824506086 \| 441.8803729460534 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 432.02638235921006 \| 450.33161016596273 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 402.25525939224883 \| 393.8564689669916 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 297.5337286675904 \| 297.0131881135074 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 343.8697037899545 \| 329.8194073407783 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 267.58912366821056 \| 256.91606054118375 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 150.81723692609629 \| 146.32172267858743 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 129.51029293209245 \| 122.72144394093334 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 147.627656359087 \| 141.68956350566188 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 87.55100546003591 \| 84.91293287692788 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 299.5931492743986 \| 305.884253766691 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 179.39026367843837 \| 181.64741311605096 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 173.93547669282367 \| 173.23972950980564 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 185.90234171599252 \| 182.80844545446686 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 128.08176696266082 \| 123.27722685662111 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 340.50674552770664 \| 338.9071088484576 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 225.4438318650432 \| 230.22899884832975 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 194.15123248528312 \| 185.02793973094865 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 200.74289714108176 \| 191.76606719670647 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 107.03564946728423 \| 106.82432377861258 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 371.31799283918406 \| 379.7555394732925 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 275.97762744310455 \| 276.71106853992995 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 261.6648679783462 \| 259.4127232060398 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 237.03108223577615 \| 233.92710216149527 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 172.13926800371152 \| 168.74390922407585 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 381.50199487767276 \| 383.9043681999597 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 307.9748883093411 \| 312.2403515462001 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 251.11319684705438 \| 243.17870127827277 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 236.3253127246763 \| 223.81250201769552 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 154.55693991756874 \| 153.11360584987685 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 407.11400078586615 \| 413.53709886086557 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 348.1705797722622 \| 360.09771155957367 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 321.8593280850388 \| 318.2882327401255 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 270.089032013835 \| 268.767323026064 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 238.07324557907788 \| 228.09842078362692 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 399.8172853171901 \| 401.0954526332136 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 363.4387330438581 \| 364.13111024232677 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 294.1752429133857 \| 283.7235663368415 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 256.8389394007649 \| 246.91771015606483 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 199.3378564292656 \| 192.40439590901758 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 425.5150965556111 \| 430.8190098707553 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 396.00437184073013 \| 411.3873625655787 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 369.92803661607815 \| 361.43244467343663 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 293.4277354412933 \| 295.2529537595746 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 288.0208673072841 \| 281.51896404878863 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 408.3005367220567 \| 408.96116482298913 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 396.90095962766304 \| 396.87385456176486 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 319.0534576137999 \| 302.50950358107764 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 270.3334977708081 \| 258.8506349486557 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 227.46824134365394 \| 222.23759438128766 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 438.24247309479694 \| 437.7975163205371 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 428.34012029699227 \| 433.3215899950434 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 386.52672049728875 \| 388.26216893354984 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 302.71976814728083 \| 302.3574867306459 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 327.39760662780986 \| 308.6348428844912 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 423.31308678262695 \| 426.6306972137279 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 412.6983690923106 \| 419.4961977664297 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 337.41003544742273 \| 324.2155049126126 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 278.7755890910794 \| 265.9194286636502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 251.55678254755364 \| 244.8843180141462 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 452.5930781172308 \| 457.7117122300742 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 445.05676260348116 \| 463.9304535499636 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 415.78302138389415 \| 406.29229555271456 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 308.0311067300895 \| 304.91354721414314 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 351.43943626809335 \| 329.4476923070317 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 295.1801525813241 \| 291.36521287398904 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 183.23250549178067 \| 182.35421238887605 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 151.56832453117747 \| 151.3422139154794 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 171.02111935180432 \| 160.72516856727913 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 74.05765122783826 \| 74.5885345035243 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 314.3587394591763 \| 319.2938677773619 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 224.57002084153177 \| 225.48868542008177 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.00964804143052 \| 215.39576159953486 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.1174237618258 \| 214.28437413525663 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 121.08920423648368 \| 119.55813661872644 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 362.2193857281911 \| 360.05005804275936 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 279.8840217430121 \| 279.5437918286659 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 227.76617121021982 \| 222.8655938229316 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 215.43141176970562 \| 207.71852284994702 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 121.35588364218539 \| 121.20636565046884 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 365.1545280898012 \| 373.37585444987326 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 304.360119952975 \| 309.1247297936263 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 287.2603904544586 \| 289.25547903162595 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 257.9852675272418 \| 257.59069234098115 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 188.35158496670232 \| 184.24683960154857 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 389.9744911369211 \| 388.43466897254166 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 345.9228295166513 \| 342.63034895210126 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 279.56334658247437 \| 271.2724375402088 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 245.66477202810066 \| 233.49688207371258 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 170.3270720653187 \| 166.23863845657382 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 400.0041140827554 \| 402.11182445396497 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 363.64641830327434 \| 375.9288663364792 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 341.5776139573363 \| 335.1160003213424 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 281.1811770268521 \| 280.21438270014005 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 247.78716118997716 \| 245.3269825179633 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 403.794126680488 \| 405.2353919019577 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 387.079178426863 \| 385.1461762057035 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 309.7847188173431 \| 298.0443968374749 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 262.4721750159666 \| 250.81679725428586 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 205.70866004479979 \| 202.9620839129557 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 413.380982988662 \| 418.40270594263103 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 398.450064800682 \| 409.6794973994029 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 372.26297458194466 \| 364.44415106552196 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 293.0818569905912 \| 292.85172400643984 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 296.46717085592087 \| 285.76362010612763 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 419.3186786037592 \| 426.08801580934437 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 408.1648467766632 \| 409.4122254207817 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 329.24396020457345 \| 313.5200995121138 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 274.61257504571876 \| 255.7801815432177 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 232.63806001220684 \| 230.03020843492314 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 435.0785891054788 \| 440.39101804225345 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 424.86925312752817 \| 435.18898057396825 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 393.000417896268 \| 395.11543361225256 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 297.7755459218185 \| 300.7208114715287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 331.71570861760534 \| 318.07127352552885 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 424.58602747137405 \| 425.84897078470715 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 422.66607285025725 \| 423.5524945535485 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 344.8625760048626 \| 331.6793888458635 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 282.0787281511649 \| 263.7895634445868 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 252.7301927385177 \| 245.41844170037427 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 437.0658069164588 \| 442.9101960063628 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 433.13788271434646 \| 452.3873572709863 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 404.0959191546953 \| 396.7077863894884 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 300.45502211883206 \| 301.3439134717943 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 344.11003202413934 \| 330.8897663350314 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 298.4364205341705 \| 291.6793556507056 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 187.6382133139633 \| 191.05409897308772 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 156.55822078636112 \| 154.178925976516 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 173.47765221825162 \| 169.30862508068464 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 74.5885345035243 \| 74.52689061607104 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 323.12233826013045 \| 328.53889207933514 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 236.75872140126316 \| 235.8378325547398 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 227.17836523816675 \| 226.75357076139966 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 224.07209453308036 \| 224.07209453308036 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 122.85572156047981 \| 121.11642183704716 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 361.3123326658092 \| 360.71014086458337 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 281.5287983927017 \| 281.94301754758345 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 232.7456696285686 \| 226.50976826432776 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 221.5612361744038 \| 214.96188822837055 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 121.38311528944315 \| 120.85441868178513 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 380.2579019244734 \| 389.2520157863988 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 316.95230660496924 \| 317.87597790618906 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 301.07968126657323 \| 298.02424098422983 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 267.2240756921594 \| 267.16353549228154 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 189.82761622494257 \| 186.736450261963 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 389.88665375406805 \| 387.9125133037077 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 348.70619958684887 \| 346.6750499749774 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 280.5472989906087 \| 271.22300822012187 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 250.02397620165968 \| 241.22532776331445 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 171.67817496107645 \| 166.95679280483972 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 412.626880230807 \| 417.60238657950777 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 374.8829313933945 \| 389.4448546468815 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 353.20410434172436 \| 345.7072490717473 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 292.51045924209586 \| 291.66621022138287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 251.6264062063495 \| 248.45110052911542 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 404.0155784550126 \| 401.90546837237514 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 384.4389015599863 \| 386.9684324594344 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 313.3731284132225 \| 298.17074251037894 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 264.19199737284265 \| 252.8982463999916 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 207.03696315185684 \| 202.86697323136772 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 428.2436763312506 \| 433.45005568619536 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 411.8516531869893 \| 428.2753623461049 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 384.9095037182509 \| 372.90888743000744 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 303.2438915629836 \| 302.05095952914337 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 301.8689122735564 \| 285.0363190513223 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 423.13592231504805 \| 420.3991500185611 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 407.44527331585493 \| 408.5064370765247 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 330.50050996167414 \| 316.8763979925965 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 274.6833786307413 \| 259.86098862141324 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 232.24019584158367 \| 226.52040268160232 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 444.4596314237808 \| 455.99558915752266 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 437.4245561244369 \| 455.98275147271966 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 397.3350686877605 \| 397.88875599028063 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 308.53809114394545 \| 307.1359822042007 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 331.32379843423774 \| 316.85293191675646 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 422.4622274366379 \| 425.0407156418684 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 420.9547052783101 \| 430.33779243510276 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 345.50265346504085 \| 332.094855328957 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 280.81715528243365 \| 264.6543640282054 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 252.25635200421783 \| 245.46235499490305 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 452.5524207341139 \| 461.7512032176736 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 445.2316469907137 \| 464.4523799578466 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 416.87264016717023 \| 409.17124592157046 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 309.42579489389846 \| 307.9734464665731 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 350.50782004300623 \| 330.98959545427294 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157767 Approved by: https://github.com/Skylion007	2025-08-25 22:51:00 +00:00
Michael Lazos	adecb0c9e8	[Cutlass-EVT] Fix buffer size issues (#161335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161335 Approved by: https://github.com/henrylhtsang ghstack dependencies: #161398	2025-08-25 22:08:30 +00:00
Michael Lazos	d57c79e609	[Cutlass] Fix regression from f7ad69f (#161398 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161398 Approved by: https://github.com/henrylhtsang	2025-08-25 22:08:30 +00:00
atalman	1a566c4909	Remove Python 3.9 nightly builds (#161427 ) Please see https://github.com/pytorch/pytorch/issues/161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161427 Approved by: https://github.com/huydhn	2025-08-25 22:05:40 +00:00
Michael Lazos	37a34022b5	[Pattern Matcher] improve error msg (#161423 ) Updates pattern matcher error message Pull Request resolved: https://github.com/pytorch/pytorch/pull/161423 Approved by: https://github.com/mengluy0125, https://github.com/masnesral	2025-08-25 21:48:54 +00:00
Huy Do	763053dc53	Always run OIDC auth on B200 to be able to upload artifacts to S3 (#161436 ) Reported by @drisspg , in its current form, the OIDC auth step wasn't run when the previous test step failed. We need this to always run to be able to upload artifacts to S3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161436 Approved by: https://github.com/nWEIdia, https://github.com/drisspg	2025-08-25 21:05:20 +00:00
Daniel Galvez	cf94cadbee	[CUDAGraph] Add getter for cuda graph exec (#161294 ) This is far simpler than #155164 since we never destroy the cudaGraphExec_t. The request comes from TRT-LLM specifically. The motivation is that some power users would like to mutate specific kernel parameters via APIs like `cudaGraphExec*SetParams` after a cuda graph has been instantiated. For example, a common request has been to be able to change the sequence length of attention kernels, after having captured a graph for the largest possible sequence length. It turns out that the host overhead you eliminate via cuda graphs in LLM inference ends up causing an increase in computation time when you size your kernels to the maximum possible sequence length (which I believe is done in both TRT-LLM and vLLM). Attention is the most problematic kernel because its computation time is quadratic in the sequence length, rather than linear. This can work if your attention kernel can work for arbitrary shapes (this is not the case for all attention implementations! Many of them specialize with templates), and you have a persistent kernel that allocates only as many blocks as you have SM's (so you don't have to figure out how many blocks to allocate for a specific sequence length). Using a conditional SWITCH node is a better generic approach to this problem, but that requires more infrastructure work. Note that this requires knowledge of the exact location of the value in your kernel's parameter buffer to mutate. It won't work with arbitrary stream capture code whose kernels you don't know before hand. So I expect this code path to be rarely used. Testing: ``` pytest -s -k raw_graph_exec test/test_cuda.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161294 Approved by: https://github.com/ngimel, https://github.com/BoyuanFeng, https://github.com/eellison, https://github.com/eqy	2025-08-25 20:57:37 +00:00
Sandeep Narendranath Karjala	995397d47a	[inductor] structured-log graph execution order + test (#160448 ) Summary: - Emit a structured trace per compiled graph execution to reconstruct execution order in TLParse. - Adds debug.log_graph_execution(name) called from `CompiledFxGraph.__call__`, producing an artifact named inductor_graph_execution with payload {"graph": "graph_<id>"}. Testing: - Add inline test to verify structure and output Pull Request resolved: https://github.com/pytorch/pytorch/pull/160448 Approved by: https://github.com/xmfan	2025-08-25 20:12:18 +00:00
Chen	ffa1ce7650	Fix the parity of original and exported module parameters (#160600 ) ## Problem Fixing parameter mismatch issue during torch.export with strict mode (see "How to reproduce the issue" section below): When there are two attribute mapping to the same tensor, the strict mode will 1. Have a standard param buffer table to standardize the name (bug happens [here](`f861dc1826/torch/export/_trace.py (L356)`)! when 2 parameter have same id(param), the latter name will overwrite the previous name) 2. [Update](`f861dc1826/torch/export/_trace.py (L1481)`) exported signature with updated standard FQN (problematic) 3. When getting exported_program.module(), it will call [_unlift_exported_program_lifted_states](`f861dc1826/torch/export/exported_program.py (L1297)`) to recover attribute from exported signature where the parameter name is defined and standardized Then the named_parameter of this module will have overwritten name instead of original name ## How to reproduce the issue? reproduce issue shared by @taotaohuang001 torch version: 2.8.0 ```python import torch from torch import nn # ---- Toy model with embedding weight sharing (aliasing) ---- class Toy(nn.Module): def __init__(self): super().__init__() self.embedding_layers = nn.ModuleDict() tbl = nn.Embedding(100, 8) self.embedding_layers["ActorId"] = tbl # Alias: reuse the SAME module instance for another feature self.embedding_layers["RootActorId"] = self.embedding_layers["ActorId"] self.proj = nn.Linear(16, 1) def forward(self, feats: dict[str, torch.Tensor]): e1 = self.embedding_layers["ActorId"](feats["ActorId"]) e2 = self.embedding_layers["RootActorId"](feats["RootActorId"]) return self.proj(torch.cat([e1, e2], dim=-1)) torch.manual_seed(0) m = Toy().eval() # Show pre-export parameter names (canonicalized; shared weight appears once) print("PRE-EXPORT named_parameters:") print([name for name, _ in m.named_parameters()]) # Sanity: the two feature names point to the same weight object w1 = m.embedding_layers["ActorId"].weight w2 = m.embedding_layers["RootActorId"].weight print("PRE-EXPORT alias -> same object:", w1 is w2, "\| same storage:", w1.data_ptr() == w2.data_ptr()) # Example inputs (dict structure will be captured by export) ex_in = { "ActorId": torch.randint(0, 100, (4,)), "RootActorId": torch.randint(0, 100, (4,)), } # ---- Export (in memory) and materialize the runnable module ---- ep = torch.export.export(m, (ex_in,), strict=True) gm = ep.module() # GraphModule with new (canonical) parameter names print("\nPOST-EXPORT named_parameters (GraphModule):") post_names = [name for name, _ in gm.named_parameters()] print(post_names) # Prove alias persists after export: run fwd/bwd and check a single grad tensor exists out = gm(ex_in).sum() out.backward() # Find the embedding weight in the exported module by shape (100, 8) emb_names = [name for name, p in gm.named_parameters() if p.shape == torch.Size([100, 8])] print("\nEmbedding param (post-export) canonical name:", emb_names[0] if emb_names else "<not found>") # Show that only one grad exists for the shared table for name, p in gm.named_parameters(): if p.grad is not None and p.shape == torch.Size([100, 8]): print("Grad present on shared embedding weight:", name, "\| grad shape:", tuple(p.grad.shape)) break ``` And you will see parameters are different before and after export ``` PRE-EXPORT named_parameters: ['embedding_layers.ActorId.weight', 'proj.weight', 'proj.bias'] PRE-EXPORT alias -> same object: True \| same storage: True POST-EXPORT named_parameters (GraphModule): ['embedding_layers.RootActorId.weight', 'proj.weight', 'proj.bias'] Embedding param (post-export) canonical name: embedding_layers.RootActorId.weight Grad present on shared embedding weight: embedding_layers.RootActorId.weight \| grad shape: (100, 8) ``` ## Solution Fixing this issue by making sure latter named parameter will not overwrite the `param_buffer_table` when original model's named parameter already maps to certain parameter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160600 Approved by: https://github.com/angelayi	2025-08-25 19:40:06 +00:00
PyTorch MergeBot	3e210f90c2	Revert "[dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900 )" This reverts commit 1113e7de30da95973c1eac7921601f9a0e94f2db. Reverted https://github.com/pytorch/pytorch/pull/160900 on behalf of https://github.com/atalman due to executorch failure ([comment](https://github.com/pytorch/pytorch/pull/160900#issuecomment-3221372096))	2025-08-25 18:56:18 +00:00
Scott Wolchok	660b5656a4	Inline is_read_only_alias_match in _correct_storage_aliasing (#161285 ) Drives down the overhead of return_and_correct_storage_aliasing slightly. Hopefully you'll agree it doesn't compromise readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161285 Approved by: https://github.com/wconstab ghstack dependencies: #161231, #161234, #161235, #161240, #161284	2025-08-25 18:35:21 +00:00
Scott Wolchok	0e0bb4f1fd	Remove unnecessary len() call in _correct_storage_aliasing.is_read_only_alias_match (#161284 ) Containers are truthy iff they're non-empty. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161284 Approved by: https://github.com/Skylion007, https://github.com/wconstab ghstack dependencies: #161231, #161234, #161235, #161240	2025-08-25 18:35:21 +00:00
Scott Wolchok	b048f0e189	Improve efficiency of _python_dispatch.return_and_correct_aliasing (#161240 ) get_write_alias() call count reduction explained briefly in code comment. We don't need to check write_aliases against None in the final outs_to_return calculation because we just did that check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161240 Approved by: https://github.com/wconstab ghstack dependencies: #161231, #161234, #161235	2025-08-25 18:35:21 +00:00
Scott Wolchok	c35538d3c5	Minor cleanup of DeviceMesh.__eq__ (#161235 ) `self is other` means the same thing as `id(self) == id(other)`, but it's one operator instead of 3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161235 Approved by: https://github.com/wconstab, https://github.com/zpcore, https://github.com/fduwjj ghstack dependencies: #161231, #161234	2025-08-25 18:35:21 +00:00
Scott Wolchok	cfafd98c53	Use comparison key in OpSchema to avoid duplicate work between `__hash__` and `__eq__` (#161234 ) The performance cost of `dict` lookups keyed by `OpSchema` is a significant minority of DTensor overhead. With this change we shave a net ~1% off the total running time of the benchmark from #160580, as measured by using cProfile and comparing cumulative time spent in propagate + OpSchema's `__post_init__`. (`__post_init__` grew from 2.5% to 6.4% (+3.9%) and propagate shrank from 12.5% to 7.8% (-4.7%)). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161234 Approved by: https://github.com/wconstab ghstack dependencies: #161231	2025-08-25 18:35:21 +00:00
Scott Wolchok	5d6434b132	Fix OpSchema equality check (#161231 ) `__eq__` didn't compare lists of DTensorSpec, but `__hash__` did (and it looks like attention was paid to hash, so I made comparison follow suit). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161231 Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/zpcore	2025-08-25 18:35:21 +00:00
xinan.lin	2f0de0ff93	[Inductor] Update Intel Triton for PyTorch 2.9. (#161050 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161050 Approved by: https://github.com/anmyachev, https://github.com/EikanWang, https://github.com/jansel	2025-08-25 17:18:19 +00:00
angelayi	c081481bbe	[aoti-fx] Output OpOverload fallbacks (#161195 ) Updates the inductor-wrapper-fxir code to use the kernel.op_overload when generating extern kernel calls. This way we can keep the IR consistent with using ATen ops. TODO: we're also inserting torch.empty_strided calls -- need to turn this into aten too Pull Request resolved: https://github.com/pytorch/pytorch/pull/161195 Approved by: https://github.com/blaine-rister	2025-08-25 17:03:05 +00:00
PyTorch MergeBot	df571ae7ad	Revert "Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387 )" This reverts commit 3ea6cc8c2d443d6104159d50e8328c144f6caa39. Reverted https://github.com/pytorch/pytorch/pull/159387 on behalf of https://github.com/jeffdaily due to breaks ROCm, AttributeError: 'torch._C._CudaDeviceProperties' object has no attribute 'shared_memory_per_block_optin' ([comment](https://github.com/pytorch/pytorch/pull/159387#issuecomment-3220989480))	2025-08-25 16:50:03 +00:00
Animesh Jain	9e1c954134	[dynamo] Pass requires_grad to nn.Parameter construction (#161364 ) Fixes https://github.com/pytorch/pytorch/issues/161191 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161364 Approved by: https://github.com/Skylion007, https://github.com/StrongerXi	2025-08-25 16:49:28 +00:00
Tom Ritchford	83283ce7f5	docstring_linter: Fix #151692 and other issues (#156596 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156596 Approved by: https://github.com/eellison	2025-08-25 16:04:14 +00:00
Hashem Hashemi	ab8d60f4c8	[ROCm] Unroll loads in global_reduce (#161181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161181 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-25 15:45:49 +00:00
Xuehai Pan	af3265d20f	[BE][CI] fix `pkg=<pin>` to `pkg==<pin>` in pip requirement specs (#160811 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160811 Approved by: https://github.com/seemethere	2025-08-25 15:31:21 +00:00
Eddie Yan	f391afe9bf	[cuDNN][convolution] remove redundant conv3d 64bit test (#161177 ) turns out it's the same as ``` @onlyCUDA @largeTensorTest("40GB") @largeTensorTest("24GB", "cpu") @tf32_on_and_off(0.005) def test_conv3d_64bit_indexing(self, device): x = torch.rand(1, 32, 512, 512, 256) m = torch.nn.Conv3d(32, 1, kernel_size=1, padding=0, stride=1, bias=False) yref = m(x) y = m.to(device=device)(x.to(device=device)) self.assertEqual(yref, y) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161177 Approved by: https://github.com/Skylion007	2025-08-25 15:01:05 +00:00
zhxchen17	1113e7de30	[dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900 ) convert_frame.compile_frame used to take a callback transform function which will capture the frame object it has, but the frame information is not passed directly into compile_frame function. This PR changes the signature of compile_frame so that frame information is directly passed in the function without taking a callback. This makes it easier to build fullgraph capture API on top of compile_frame. @exported-using-ghexport Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801/) Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160900 Approved by: https://github.com/tugsbayasgalan, https://github.com/anijain2305	2025-08-25 14:53:54 +00:00
PyTorch MergeBot	40c0e700a4	Revert "[AMD] Fix AMD User Defined Kernel Autotune (#160671 )" This reverts commit 431846a6323c6f1d02da49e311ac694324f386f4. Reverted https://github.com/pytorch/pytorch/pull/160671 on behalf of https://github.com/atalman due to new test is failing: inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_rocm_triton_autotuning_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17172795679/job/48725235301) [HUD commit link](`431846a632`) ([comment](https://github.com/pytorch/pytorch/pull/160671#issuecomment-3220442141))	2025-08-25 14:07:48 +00:00
zeshengzong	510825e5fe	Optimize `dynamo` typing (#147499 ) Optimize dynamo methods type annotation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147499 Approved by: https://github.com/anijain2305	2025-08-25 13:20:45 +00:00
PyTorch MergeBot	ab7787fb82	Revert "[inductor] Windows inductor use intel-openmp. (#160258 )" This reverts commit 41673110cd7c5960824cc74a6fcaeda1a8bc7a23. Reverted https://github.com/pytorch/pytorch/pull/160258 on behalf of https://github.com/malfet due to Reverting to fix https://github.com/pytorch/pytorch/issues/160898 and https://github.com/pytorch/pytorch/issues/160962 ([comment](https://github.com/pytorch/pytorch/pull/160258#issuecomment-3220158145))	2025-08-25 12:57:47 +00:00
PyTorch MergeBot	1eccfb157a	Revert "[BE] Remove intel-openmp dependency in setup.py (#160976 )" This reverts commit e4839470470168648dee5997f57347bb8541ea2b. Reverted https://github.com/pytorch/pytorch/pull/160976 on behalf of https://github.com/malfet due to This PR is doing something strange ([comment](https://github.com/pytorch/pytorch/pull/160976#issuecomment-3220120462))	2025-08-25 12:46:12 +00:00
Raman Kumar	4651aaac47	Fix typo: 'complext' (#160335 ) minor fix for a typo: `complext` to `complex` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160335 Approved by: https://github.com/Skylion007	2025-08-25 10:37:59 +00:00
Liang Wang	037c43d3b2	[tgif] fix getattr_recursive with ModuleList (#161204 ) Summary: This change updates `getattr_recursive` to handle qualnames with ModuleList that contain digit indices, for example, `op_instances.1.value_model.feature_weights` Test Plan: TBA Rollback Plan: Reviewed By: jiayisuse Differential Revision: D80503985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161204 Approved by: https://github.com/jiayisuse	2025-08-25 10:08:47 +00:00
Dmitry Rogozhkin	eb5549a431	xpu: fix cpp_extension compatibility with oneapi dpc++ 2025.2 compiler (#161012 ) Intel oneapi DPC++ compiler has changed (fixed) parsing of `-fsycl-host-compiler-options` option in the respect of treating arguments with escaped quotes. This commit adds an if branches depending on compiler versions. Fixes: https://github.com/intel/torch-xpu-ops/issues/1938 CC: @chuanqi129 @EikanWang @guangyey Pull Request resolved: https://github.com/pytorch/pytorch/pull/161012 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-25 09:29:53 +00:00
FFFrog	56ebed627a	[OpenReg] Add OSX/Windows Support for OpenReg (#159441 ) As the title stated. Changes: - Abstract platform-specific APIs - Add OSX/Windows support - Set default symbol visibility to "hidden" Co-authored-by: @can-gaa-hou Original PR:https://github.com/pytorch/pytorch/pull/159029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159441 Approved by: https://github.com/albanD Co-authored-by: jiahaochen666 <jiahaochen535@gmail.com>	2025-08-25 08:03:27 +00:00
Liao, Wei	80df27a612	port distributed pipeline test files for Intel GPU (#159033 ) In this PR we will port all distributed pipeline test files. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. instantiate_device_type_tests() 2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend 3. use "requires_accelerator_dist_backend()" to replace requires_nccl() 4. use "get_default_backend_for_device()" to get backend 5. enabled XPU for some test path Pull Request resolved: https://github.com/pytorch/pytorch/pull/159033 Approved by: https://github.com/guangyey, https://github.com/kwen2501	2025-08-25 05:24:27 +00:00
Will Constable	e3d68dfae2	[DTensor] Make default RNG semantics match user-passed generator (#160482 ) Previously, DTensor kept its own copy of the generator state after the first time a random operator was called on a DTensor. This copy would evolve independently from the generator outside of DTensor. After adding support for users to pass a specific generator into random operators (e.g. `uniform_(..., generator=)`), it was determined (in discussion on #159991) to change the semantics so that any random operations performed on DTensor would evolve the state of the publicly visible generators (either the default one or user-passed one). The upsides are (1) it is now possible to call torch.manual_seed() at any point in the program and have a consistent effect on DTensor, (2) DTensor ops have an observable effect on the generator. The downside is that users are now responsible for seeding their generator before using DTensor, ensuring all ranks use the same seed. Fixes #159991 confirmed docs rendered OK <img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160482 Approved by: https://github.com/wanchaol	2025-08-25 04:21:19 +00:00
Natalia Gimelshein	726dce3c94	[nccl symm mem] don't use arg for mempool, correctly use symmetric registration in hooks (#161238 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/161238 Approved by: https://github.com/kwen2501, https://github.com/syed-ahmed	2025-08-25 03:09:32 +00:00
Chuanhao Zhuge	74280d0913	[muon] Introduce Muon optimizer to PyTorch (#160213 ) A single-device version of Muon. Algorithm refers Keller Jordan's [Muon blogpost](https://kellerjordan.github.io/posts/muon/), and optionally incorporates [Moonshot's](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf) learning rate adjustment strategy. This implementation maintains a minimalist API and is consistent with other optimizer conventions. PyTorch team prefers to handle parameter filtering at a higher level, with the Muon optimizer performing only the msign computation for orthogonalization on all parameters it receives. Users are responsible for grouping parameters for different optimizers as needed. An example usage is shown below, and a more detailed example will be added to the [PyTorch examples](https://github.com/pytorch/examples) directory. Usage ```python model = MyModelForCausalLM # filter out your params manually muon_params = [...] adamw_params = [...] muon = Muon( params = muon_params lr=lr, wd=wd, ) adamw = AdamW( params = adamw_params lr=lr, wd=wd, ) # in training loop loss = model(input) loss.backward() muon.step() adamw.step() muon.zero_grad() adamw.zero_grad() ``` ~~Additional usage~~ ~~Users are also able to pass in self-defined `msign` function for orthogonalization, and learning rate adjustment function. Interface defined below:~~ ```python ~~AdjustLrFn: TypeAlias = Callable[[float, torch.Size], float]~~ ~~MsignFn: TypeAlias = Callable[[Tensor, BaseMsignFnConfig], Tensor]~~ ``` As discussed with team and in comment, we prefer to make the interface simpler and cleaner, thus we removed the callback interface, and canonicalize the original NS algorithm for Muon. The only configs available to users are `ns_steps`, `coefficients`, and `eps`, configurable through kwargs. By default, we use 5-step Newton-Schulz, with coefficients proposed by [Keller](https://kellerjordan.github.io/posts/muon/). We use LR adjustment proposed by [Moonshot](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf), which grafts learning rate from AdamW. Testing ~~1. Unit tests: the newly introduced Muon is covered in `test/test_optim.py`. We updated the test cases to pass named parameters to the optimizer under test. Additionally, we introduced a new test case to verify that when the user provides an empty FQN list, Muon correctly falls back to AdamW behavior.~~ As discussed, in order not to complicate the codebase, we prefer not to include reference implementation into PyTorch. We also updated the interface so we don't need to test the FQN based filtering. Muon is covered by the existing `test_optim.py` unit test. 2. End-to-end test: we added a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curve is compared against the Moonshot implementation to confirm behavioral consistency. <img width="1102" height="472" alt="Screenshot 2025-07-29 at 1 04 12 AM" src="https://github.com/user-attachments/assets/ceab0733-497d-4070-8032-02ae7995c64c" /> Numerics We evaluate our implementation with existing implementation to confirm numerical consistency. As discussed, our implementation closely follows the algorithm described in [Keller's post](https://kellerjordan.github.io/posts/muon/), while incorporating the learning rate adjustment from [Moonlight](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf). This captures a key insight that allows users to reuse hyper-parameters tuned for `adamW`, making Muon a drop-in swap. As expected, the numerics difference mainly comes from `adjust_lr`, a max of ~5% relative diff in an example unit test setup below. ```python # dummy model and data model0 = Linear(10, 10, bias=False) model1 = copy.deepcopy(model0) inputs = torch.randn(8, 10) targets = torch.randn(8, 10) loss = MSELoss() lr = 1e-3 wd = 0.1 momentum = 0.95 opt_ref_muon = KellySingleDeviceMuon( params=model0.parameters(), lr=lr, weight_decay=wd, momentum=momentum, ) opt_exp_muon = Muon( params=model1.parameters(), lr=lr, weight_decay=wd, momentum=momentum, ) out_ref = model0(inputs) loss_ref = loss(out_ref, targets) opt_ref_muon.zero_grad() loss_ref.backward() opt_ref_muon.step() out_exp = model1(inputs) loss_exp = loss(out_exp, targets) opt_exp_muon.zero_grad() loss_exp.backward() opt_exp_muon.step() for p_ref, p_exp in zip(model0.parameters(), model1.parameters()): torch.testing.assert_close(p_ref, p_exp) ``` As explained above, including this `adjust_lr` is preferable. This is validated by an e2e training runs on training a qwen-2-like 0.5b model, where the curves show that training with `adjust_lr` converges more effectively than without. <img width="1179" height="464" alt="Screenshot 2025-08-18 at 10 12 33 AM" src="https://github.com/user-attachments/assets/e797d3da-c2f0-4187-b99e-5d48b7437c3c" /> Performance Training for one epoch of openwebtext-100k on eight H100 GPUs with DDP: - adamw_ddp finishes in 13.12 min - pytorch_muon_ddp finishes in 13.45 min Muon runs ~20s slower compared to AdamW. Assuming no other changes, Muon is 2.5% slower than AdamW. AdamW: Optimizer.step() takes ~13.5 ms, step time ~930 ms <img width="726" height="590" alt="Screenshot 2025-07-29 at 1 56 14 AM" src="https://github.com/user-attachments/assets/ebcd7e1c-d129-4b20-9396-39f568edf03d" /> Muon: Optimizer.step() takes ~54 ms, step time ~960 ms <img width="751" height="597" alt="Screenshot 2025-07-29 at 2 02 20 AM" src="https://github.com/user-attachments/assets/72f5b904-ebd5-4502-a6ff-d3e9e5a6da81" /> Note We restrict the implementation to accept only 2D parameters. An alternative approach is to allow parameters with more than two dimensions and apply orthogonalization over the last two dimensions. We opt not to go with this approach as it can be error-prone. For example, with a kernel shaped `[in_channel, height, width, out_channel]`, applying orthogonalization to the last two dimensions is not meaningful. Since Muon is designed to operate orthogonalization on 2D matrices, preserving this assumption keeps the implementation clean and sound. Next Steps 1. Add `MuP` 2. Open-source optimized triton kernel for symmetric matmul. A preliminary benchmark found 1.23x - 1.48x speedup on small - large (n = 256 -> 16384) matrices. 3. Open-source unsharded Muon co-designed with FSDP2. **** Pull Request resolved: https://github.com/pytorch/pytorch/pull/160213 Approved by: https://github.com/janeyx99	2025-08-24 08:03:04 +00:00
Ting Lu	1de4540449	Use -compress-mode=size for CUDA 13 build for binary size reduction (#161316 ) https://github.com/pytorch/pytorch/issues/159779 CUDA 13 added the support for --compress-mode flag for nvcc across all drivers of CUDA 13.X toolkits, enabling the possibility to use --compress-mode=size for significant size reduction (~71% less for CUDA Math APIs for example). https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/ Why we have to add for CUDA 13 only, quote from @ptrblck : Any usage of --compress-mode=size/balance will drop the support of older CUDA drivers and will bump the min. driver requirement to CUDA 12.4. https://github.com/pytorch/pytorch/pull/157791#issuecomment-3058027353 Default for CUDA 13 will be --compress-mode=balance which gives smaller binaries than LZ4 speed mode used in previous CUDA versions. Related - https://github.com/pytorch/pytorch/pull/157791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161316 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007	2025-08-24 03:28:29 +00:00
Aidyn-A	3e5b021f21	[ATen][CPU][Sparse] Use Third-Party Eigen for sparse add and addmm (#155357 ) This pull request adds the following ops for sparse matrices using Eigen library: ```python add(a_csr, b_csr) add(a_csc, b_csc) addmm(c_csr, a_csr, b_csr) addmm(c_csr, a_csr, b_csc) addmm(c_csr, a_csc, b_csc) addmm(c_csr, a_csc, b_csr) addmm(c_csc, a_csr, b_csr) addmm(c_csc, a_csr, b_csc) addmm(c_csc, a_csc, b_csc) addmm(c_csc, a_csc, b_csr) ``` Currently, the operations for sparse matrices on CPU are available through MKL only. The non-existence of MKL on `aarch64` causes the unavailability of these ops on any machines with ARM based CPUs, including Apple Silicon, AWS Graviton and NVIDIA Grace. This PR addresses this issue by using Eigen as a backend for the above ops. This is a re-factored version of my previous PR #101814. The main difference with the old one, this does not enable Eigen by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155357 Approved by: https://github.com/pearu, https://github.com/eqy Co-authored-by: Eli Uriegas <eliuriegas@meta.com>	2025-08-23 19:03:55 +00:00
Nikita Shulga	4acdbb8311	[MPS] Fix index_copy for strided indices (#161333 ) By passing strides to strided variant of the tensor Fixes https://github.com/pytorch/pytorch/issues/160993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161333 Approved by: https://github.com/huydhn, https://github.com/wdvr ghstack dependencies: #161206, #161267	2025-08-23 14:38:57 +00:00
PyTorch MergeBot	f912c93344	Revert "Move non inductor workflows to Python 3.9 -> 3.10 (#161182 )" This reverts commit e20f6d798606f3245686e950c43635bbe526232d. Reverted https://github.com/pytorch/pytorch/pull/161182 on behalf of https://github.com/zou3519 due to broke dynamo_wrapped tests, those are a bit finicky to fix (there is probably more than one failure!) ([comment](https://github.com/pytorch/pytorch/pull/161182#issuecomment-3216953097))	2025-08-23 13:00:42 +00:00
Paul de Supinski	33346b5814	Support NUMA Binding for Callable Entrypoints, Take 2 (#161183 ) # Context In #160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See #160006 for a full description of the challenges. Although #160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set. # This PR On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](https://github.com/pytorch/pytorch/issues/160006#issuecomment-3162018836) Specifically, for each local rank `i`: 1. The parent process sets its own CPU affinity to what local rank `i`'s should be. 2. Then, the parent spawns the subprocess for local rank `i`. 3. Finally, the parent resets its own CPU affinity to what it was originally. There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for both `str` and `Callable`, and it's pretty simple. This required a bit of refactoring: 1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`. 2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.). 3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process. 4. Use the context manager for both `str` and `Callable` paths # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161183 Approved by: https://github.com/d4l3k	2025-08-23 07:23:22 +00:00
Chong Gu	431846a632	[AMD] Fix AMD User Defined Kernel Autotune (#160671 ) Summary: AMD specific kwargs need to be removed from the guard, otherwise a keyerror will be raised when executing the kernel. Test Plan: ``` buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR ``` can succeed after this change. Rollback Plan: Differential Revision: D80285441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160671 Approved by: https://github.com/muchulee8	2025-08-23 07:23:09 +00:00
Malay Bag	cd31be28ec	Reland D80238201: [Torch.Export] Add flat arg paths in error message (#160919 ) Summary: [The diff was reverted due to CLA error, in the process of retrieving account] Previous error message ``` RuntimeError: Expected input at args.<unknown location>.shape[0] to be equal to 4096, but got 7680. If you meant for this dimension to be dynamic, please re-export and specify dynamic_shapes (e.g. with Dim.DYNAMIC) ``` New error message ``` RuntimeError: Expected input at args.[0].supervision_input.weight.shape[0] to be equal to 4096, but got 7680. If you meant for this dimension to be dynamic, please re-export and specify dynamic_shapes (e.g. with Dim.DYNAMIC) ``` Test Plan: ``` buck test mode/opt apf/rec/ir/tests:ir_export_deserialize_test ``` https://www.internalfb.com/intern/testinfra/testrun/4785074906254375 ``` buck run mode/opt caffe2/test:test_export -- -r unflatten ``` ``` Ran 413 tests in 208.414s OK (skipped=1, expected failures=13) ``` Rollback Plan: Differential Revision: D80487367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160919 Approved by: https://github.com/angelayi	2025-08-23 07:20:58 +00:00
PyTorch MergeBot	710514a2a5	Revert "Enable output padding when only outermost dim is dynamic (#159404 )" This reverts commit f15ada5c6fad97a7dcbfa4673f067b6942dda640. Reverted https://github.com/pytorch/pytorch/pull/159404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/159404#issuecomment-3216517032))	2025-08-23 07:17:30 +00:00
Xu Han	22df59efc0	[inductor] add MSVC language pack check. (#161298 ) Check MSVC's language pack: https://github.com/pytorch/pytorch/issues/157673#issuecomment-3051682766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161298 Approved by: https://github.com/angelayi	2025-08-23 07:06:48 +00:00
Angel Li	3a4140bf8e	[FlexAttention] fixing learnable bias assertion error in inductor (#161170 ) Users encountered unexpected behaviour when using FlexAttention with learnable biases, including assertion errors (#157677) We traced the root cause to the registration of subgraph buffers—this caused inconsistencies in the naming and ultimately incorrect retrieval later on. This problem only arose if the model was compiled as a whole (ie using @torch.compile) since only then would there be naming conflicts. In this PR, we register the buffers with the base graph to solve this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161170 Approved by: https://github.com/drisspg	2025-08-23 06:24:22 +00:00
Yang Wang	6443ea337d	enable more tests (#161192 ) Enable more vllm test against pytorch main, add schedule to run the test every 12 hours. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161192 Approved by: https://github.com/huydhn	2025-08-23 06:01:22 +00:00
Justin Chu	36ac916929	[ONNX] Fix lower opset version support in dynamo=True (#161056 ) After we switched to constructing the registry with the specified opset version in dynamo=True, support for opset<18 was broken because there would be no torchlib ops registered for these opsets. I updated the registry creation logic to always use opset 18 if the requested opset is lower, and use the version converter (as designed) to target those opsets. This requires onnxscript>=0.4 (https://github.com/pytorch/pytorch/pull/161312) Fixes https://github.com/onnx/onnx/issues/7235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161056 Approved by: https://github.com/titaiwangms	2025-08-23 05:04:36 +00:00
PyTorch UpdateBot	7131bfab89	[vllm hash update] update the pinned vllm hash (#161227 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161227 Approved by: https://github.com/pytorchbot	2025-08-23 04:25:16 +00:00
PyTorch UpdateBot	ac8d9418ae	[audio hash update] update the pinned audio hash (#161331 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161331 Approved by: https://github.com/pytorchbot	2025-08-23 04:21:03 +00:00
Justin Chu	38a492d40d	[ONNX] Remove unused _onnx_supported_ops (#161322 ) Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161322 Approved by: https://github.com/titaiwangms	2025-08-23 02:42:25 +00:00
Kurt Mohler	394728bab2	[MPS] Update `avg_pool3d` kernel to use `opmath_t` (#161071 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161071 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #161011	2025-08-23 02:36:22 +00:00
Kurt Mohler	121afd6a8f	[MPS] Update `avg_pool2d` to use Metal kernel when `ceil_mode=True` (#161011 ) Fixes #160743 The MPS impl of `avg_pool2d` seems to only give incorrect results when `ceil_mode=True`. I wrote a performance measurement script (`0ee6e58643/avg_pool_mps/perf_2d.py`) which tests a bunch of different cases and also marks the cases where MPS and CPU results do not match. I found that if I update `avg_pool2d` to use the new Metal kernel in all cases, that fixes all the mismatches, but it also decreases performance for some of the `ceil_mode=False` cases. So I opted to only run the new Metal kernel when `ceil_mode=True`, which does not significantly decrease performance in any of the cases tested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161011 Approved by: https://github.com/malfet	2025-08-23 02:36:22 +00:00
Blaine Burton Rister	d228a776e9	[Inductor-FX] Support Tensorbox outputs (#161245 ) # Problem The FX converter previously supported graph outputs which were `StorageBox`, but not `TensorBox`. The latter seems to show up in certain cases when the output is a slice/view of the input. # Fix This PR generalizes the code to handle `MutableBox` instead of `StorageBox` specifically. # Test Added a CI test exposing the issue. The test case was found by intentionally breaking `TensorBox(ReinterpretView` support in https://github.com/pytorch/pytorch/pull/161258. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161245 Approved by: https://github.com/angelayi	2025-08-23 02:04:13 +00:00
can-gaa-hou	cee72119b2	[Test] Adding a testcase for constant_pad_nd (#161259 ) Fixes #161066 This PR adds a simple testcase for constant_pad_nd on MPS as mentioned in https://github.com/pytorch/pytorch/pull/161149#issuecomment-3211701274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161259 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-23 01:00:50 +00:00
PyTorch MergeBot	47d267364c	Revert "[SymmMem] Support rendezvous on slice of a tensor (#160825 )" This reverts commit 9d9cc9897ac44a1a8df38211b03d8342a8af48c3. Reverted https://github.com/pytorch/pytorch/pull/160825 on behalf of https://github.com/kwen2501 due to Change of course; use storage_ptr as key ([comment](https://github.com/pytorch/pytorch/pull/160825#issuecomment-3215951048))	2025-08-22 23:41:55 +00:00
Justin Chu	0d9da384ef	Bump onnxscript to 0.4.0 in CI (#161312 ) Use onnxscript apis for torch 2.9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161312 Approved by: https://github.com/titaiwangms, https://github.com/malfet	2025-08-22 23:23:08 +00:00
Aaron Pollack	f521e82a4e	Update pyrefly config for better codenav (#161200 ) This fixes behavior in codenav by switching from `replace_imports_with_any` to `ignore-missing-imports` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161200 Approved by: https://github.com/aorenste, https://github.com/albanD	2025-08-22 23:05:07 +00:00
Ivan Zaitsev	bcfe1b2d71	Add initial bc-linter configuration (#161319 ) Preparation for https://github.com/pytorch/test-infra/pull/7016 Currently merging this PR is a noop change for PyTorch repo (bc-linter is not looking at the config yet). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161319 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi	2025-08-22 22:54:25 +00:00
Justin Chu	419a2dbf5f	[ONNX] Remove enable_fake_mode and exporter_legacy (#161222 ) Remove enable_fake_mode and exporter_legacy entirely. Even though this is bc breaking, `enable_fake_mode` is no longer compatible with the latest version of transformers, and so it is no longer useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161222 Approved by: https://github.com/titaiwangms	2025-08-22 22:15:27 +00:00
Shivam Raikundalia	3373b074f5	[Profiler] Add GC Events to Python Stack Tracer (#161209 ) Summary: Adds Python Garbage Collection to Kineto Traces and Profiler FunctionEvents. Create custom cpp callback in profiler_python.cpp. Then define a python function with cpp and register that callback for all python garbage collection. We don't worry about thread safety in this case because we are only doing init/teardown for main thread while holding GIL. Currently we are hiding this behind experimental config because python tracing tends to be unstable especially when adding any new feature. If this is found to not add too much overhead we can set this to on by default. NOTE: To enable this you need both with_stack=True and the experimental config on! Test Plan: Ran trace with GC induced and saw it on trace Also added a test Rollback Plan: Differential Revision: D80491146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161209 Approved by: https://github.com/ngimel	2025-08-22 22:11:25 +00:00
Nikita Shulga	c8bb0e4720	[MPS] Fix `index_copy` for scalars (#161267 ) By `squeezing the input` when copying into scalar tensor from a 1d one And enable `test_index_copy_scalars_mps` Fixes https://github.com/pytorch/pytorch/issues/160737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161267 Approved by: https://github.com/manuelcandales, https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #161206	2025-08-22 21:45:34 +00:00
Rob Timpe	4c36c8a994	[dynamo] Support method calls on complex ConstantVariables (#161122 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161122 Approved by: https://github.com/mlazos, https://github.com/guilhermeleobas	2025-08-22 21:40:03 +00:00
Yiming Zhou	9d882fd9ff	[benchmark] Add torchscript jit.trace to benchmark option (#161223 ) For comparing NativeRT and TorchScript. We add `torchscript-jit-trace` as an option in the benchmark. With this option, we can run trace a model and run inference with the traced module using TorchScript interpreter ``` python ./benchmarks/dynamo/huggingface.py --performance --inference --torchscript-jit-trace python ./benchmarks/dynamo/timm_models.py --performance --inference --torchscript-jit-trace python ./benchmarks/dynamo/torchbench.py --performance --inference --torchscript-jit-trace ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161223 Approved by: https://github.com/huydhn	2025-08-22 21:38:28 +00:00
Eddie Yan	2835cc5e91	[cuDNN] head dim > 128 works on H100 again in cuDNN SDPA? (#161210 ) reference: https://github.com/pytorch/torchtitan/pull/1610 9.10 only for now, we would want to hold off on upgrading to either cuDNN frontend 1.14+/cuDNN 9.11+ due to some head-dim > 128 handling issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/161210 Approved by: https://github.com/Skylion007	2025-08-22 21:21:53 +00:00
PyTorch MergeBot	3f1a97a99c	Revert "[dynamic shapes] unbacked-safe slicing (#157944 )" This reverts commit 44549c7146bd6c4166f97e856037babe1b7f4f49. Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/pianpwk due to this PR & internal diff landed out of sync, just reverted internal with D80720654, will revert this & reland as codev ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3215610135))	2025-08-22 20:48:46 +00:00
PyTorch MergeBot	981ac533c6	Revert "Close some sources of fake tensor leakages (#159923 )" This reverts commit 5afa4187dfe1e99278f8e372ec09102d5b937572. Reverted https://github.com/pytorch/pytorch/pull/159923 on behalf of https://github.com/zou3519 due to broke aoti test in inductor periodic ([comment](https://github.com/pytorch/pytorch/pull/159923#issuecomment-3215580688))	2025-08-22 20:42:50 +00:00
Gabriel Ferns	3ea6cc8c2d	Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387 ) Conv exhuastive currently throws an error, and I think it's worth adding tests to the other ops too in order to prevent regression in exhaustive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159387 Approved by: https://github.com/coconutruben	2025-08-22 20:06:09 +00:00
PyTorch MergeBot	2c0650a00a	Revert "[BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711 )" This reverts commit 8dbe7f99bd707ee28ae12ecb9cab54e1785bf13e. Reverted https://github.com/pytorch/pytorch/pull/160711 on behalf of https://github.com/davidberard98 due to internal failure - T235384144 - I'll revert while I investigate. ([comment](https://github.com/pytorch/pytorch/pull/160711#issuecomment-3215343200))	2025-08-22 19:10:35 +00:00
PyTorch MergeBot	eba1ad09e4	Revert "[SymmMem] Support rendezvous on view of a tensor (#160925 )" This reverts commit 9d7cecdd6c44c5421d341bcc359be4097ea9a2f5. Reverted https://github.com/pytorch/pytorch/pull/160925 on behalf of https://github.com/kwen2501 due to Change of course: use storage ptr as symm mem keys as in the old days and force no_split in MemPool ([comment](https://github.com/pytorch/pytorch/pull/160925#issuecomment-3215315717))	2025-08-22 18:59:25 +00:00
Wang, Chuanqi	a43480d19c	[CD] Enable triton xpu Windows build for Python 3.14 (#161255 ) Follow #159869 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161255 Approved by: https://github.com/atalman	2025-08-22 18:39:31 +00:00
Xu Han	17b0263e86	[inductor] fix march=native pass to Windows CC. (#161264 ) fix march=native pass to Windows CC. <img width="593" height="218" alt="image" src="https://github.com/user-attachments/assets/1caedffa-d9be-43d9-9ce2-590c055980cd" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161264 Approved by: https://github.com/angelayi	2025-08-22 18:38:51 +00:00
Xu Han	97200c9711	[inductor] Add get page_size support for Windows. (#161273 ) `resource` can't work on Windows, as it is a Unix specific package as seen in https://docs.python.org/2/library/resource.html Use Windows system API to get page_size. Local tested: <img width="467" height="433" alt="image" src="https://github.com/user-attachments/assets/47a39060-3aea-46c3-bd8e-35a39413c51f" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161273 Approved by: https://github.com/angelayi	2025-08-22 18:36:14 +00:00
PyTorch MergeBot	1d458e2947	Revert "[Inductor] Update Outer Reduction Heuristic (#159093 )" This reverts commit f085f299584b06a2a7d8855eda2a411313e782ad. Reverted https://github.com/pytorch/pytorch/pull/159093 on behalf of https://github.com/seemethere due to this fails internal tests, see D80630416 for more info ([comment](https://github.com/pytorch/pytorch/pull/159093#issuecomment-3215263317))	2025-08-22 18:35:36 +00:00
Yidi Wu	266784ec6a	remove old while_loop_schema_gen test (#161202 ) Fixes https://github.com/pytorch/pytorch/issues/141202. This test is flaky for mysterious reasons and we have created a new way of creating schemas for hops. So delete the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161202 Approved by: https://github.com/zou3519	2025-08-22 18:22:29 +00:00
Jeff Daily	25df65afd8	[ROCm] revamp HIPCachingAllocatorMasqueradingAsCUDA (#161221 ) HIPAllocatorMasqueradingAsCUDA and HIPCachingAllocatorMasqueradingAsCUDA are now proper complete wrappers of HIPAllocator and HIPCachingAllocator, respectively. HIPAllocatorMasqueradingAsCUDA now subclasses HIPAllocator instead of Allocator. This fixes usability of hipify replacing c10::cuda::CUDACachingAllocator::get() where callers expect a CUDAAllocator to be returned but instead were getting a very thin Allocator shim instead. This also fixes using cudagraph trees with torch compile. The hip:0 device was not being replaced by the cuda:0 device in all methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161221 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-22 18:13:12 +00:00
atalman	e20f6d7986	Move non inductor workflows to Python 3.9 -> 3.10 (#161182 ) Related to: https://github.com/pytorch/pytorch/issues/161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161182 Approved by: https://github.com/malfet, https://github.com/huydhn	2025-08-22 16:48:43 +00:00
Nikita Shulga	c2390087c3	[MPS] Fix index_select for scalar_types (#161206 ) By copy-n-pasting logic from `index_select_out_cpu` (and `_cuda`), where essentially the resizing is done inside the op, which also fixes faulty logic for scalars Pull Request resolved: https://github.com/pytorch/pytorch/pull/161206 Approved by: https://github.com/manuelcandales	2025-08-22 16:45:35 +00:00
zeshengzong	f09458c2e1	Enable `test/test_numpy_interop.py` config in mypy (#158556 ) ## Test Result ```bash lintrunner --take MYPY test/test_numpy_interop.py Warning: Could not find a lintrunner config at: '.lintrunner.private.toml'. Continuing without using configuration file. ok No lint issues. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158556 Approved by: https://github.com/soulitzer	2025-08-22 16:18:58 +00:00
Jithun Nair	7fcdd8d6af	Use ROCm MI325 runners for trunk.yml (#161184 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161184 Approved by: https://github.com/jeffdaily	2025-08-22 16:18:55 +00:00
PyTorch MergeBot	c7a77470c5	Revert "[DTensor] Make default RNG semantics match user-passed generator (#160482 )" This reverts commit d1faf2ef0476eb60b42c057baee9af0f48ae849a. Reverted https://github.com/pytorch/pytorch/pull/160482 on behalf of https://github.com/jeffdaily due to failing cuda and rocm jobs ([comment](https://github.com/pytorch/pytorch/pull/160482#issuecomment-3214694297))	2025-08-22 15:04:28 +00:00
Rex Zhang	ce467df5d1	rm platform args xplat/langtech/mobile/BUCK (#161018 ) Differential Revision: D80460691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161018 Approved by: https://github.com/drisspg	2025-08-22 14:47:36 +00:00
IvanKobzarev	db44de4c0d	[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113 ) 1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory: ``` """ Alternative version of estimate_peak_memory, that respects the fact, that every SchedulerNode has multiple phases: 1. alloc ( outputs ) 2. run_kernel 3. dealloc last_use buffers estimate_peak_memory collapses memory into one value: size_alloc - size_free While peak memory happens after alloc. Duplicating the code to not migrate all callsites at once, In future usages of estimate_peak_memory will migrate to this version. """ ``` - Applying this in `reorder_communication_preserving_peak_memory` pass. 2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode. - Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size). 4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder. What is after this PR: Iterative recomputation of memory estimations matches full memory estimations. Active memory is not regressing a lot, but reserved memory is significantly regressed. Investigation and fix of "reserved" memory will be in following PRs. BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb ``` [rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step: 1 loss: 12.2722 grad_norm: 4.2192 active_memory: 24.66GiB(25.96%) reserved_memory: 25.38GiB(26.72%) tps: 99 tflops: 5.71 mfu: 0.58% [rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step: 2 loss: 13.1738 grad_norm: 50.5566 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 4,448 tflops: 257.63 mfu: 26.05% [rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step: 3 loss: 15.6866 grad_norm: 80.0862 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,900 tflops: 341.72 mfu: 34.55% [rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step: 4 loss: 13.4853 grad_norm: 7.8538 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,881 tflops: 340.57 mfu: 34.44% [rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step: 5 loss: 16.1191 grad_norm: 53.2481 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,867 tflops: 339.77 mfu: 34.35% ``` REORDER: active: 32Gb reserved: 36Gb ``` [rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step: 1 loss: 12.2490 grad_norm: 4.1944 active_memory: 24.66GiB(25.96%) reserved_memory: 26.81GiB(28.22%) tps: 85 tflops: 4.90 mfu: 0.50% [rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step: 2 loss: 13.1427 grad_norm: 39.5942 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 3,205 tflops: 185.61 mfu: 18.77% [rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step: 3 loss: 14.6084 grad_norm: 51.0743 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,688 tflops: 329.44 mfu: 33.31% [rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step: 4 loss: 13.6181 grad_norm: 8.1122 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,744 tflops: 332.68 mfu: 33.64% [rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step: 5 loss: 15.8913 grad_norm: 59.8510 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,046 tflops: 292.22 mfu: 29.55% ``` REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb ``` [rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step: 1 loss: 12.2646 grad_norm: 4.1282 active_memory: 27.60GiB(29.05%) reserved_memory: 32.49GiB(34.20%) tps: 173 tflops: 10.00 mfu: 1.01% [rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step: 2 loss: 13.2353 grad_norm: 42.4234 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,152 tflops: 356.26 mfu: 36.02% [rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step: 3 loss: 13.8205 grad_norm: 24.0156 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,169 tflops: 357.29 mfu: 36.13% [rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step: 4 loss: 13.1033 grad_norm: 9.1167 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,183 tflops: 358.10 mfu: 36.21% [rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step: 5 loss: 16.3530 grad_norm: 51.8118 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,130 tflops: 355.03 mfu: 35.90% ``` Differential Revision: [D80718143](https://our.internmc.facebook.com/intern/diff/D80718143) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113 Approved by: https://github.com/wconstab, https://github.com/eellison Co-authored-by: eellison <elias.ellison@gmail.com>	2025-08-22 14:19:57 +00:00
PyTorch MergeBot	639b8cc51d	Revert "cd: Add no-cache for test binaries (#149218 )" This reverts commit 523bffd38856dc9fca36bddded64f74822a6e1a2. Reverted https://github.com/pytorch/pytorch/pull/149218 on behalf of https://github.com/atalman due to Lets not use no-cache flags on test binaries ([comment](https://github.com/pytorch/pytorch/pull/149218#issuecomment-3214338844))	2025-08-22 13:14:23 +00:00
Ting Lu	49ff884b1e	Add CUDA 13.0 x86 builds (#160956 ) https://github.com/pytorch/pytorch/issues/159779 CUDA 13.0.0 NVSHMEM 3.3.20 CUDNN 9.12.0.46 Adding x86 linux builds for CUDA 13. Adding libtorch docker. Package naming changed for CUDA 13 (removed postfix -cu13 for some packages). Preparation checklist: 1. Update index https://download.pytorch.org/whl/nightly/cu130 with pypi packages 2. Update packaging name based on https://pypi.org/project/cuda-toolkit/ metadata Pull Request resolved: https://github.com/pytorch/pytorch/pull/160956 Approved by: https://github.com/atalman Co-authored-by: atalman <atalman@fb.com>	2025-08-22 11:31:09 +00:00
Ting Lu	a68f63e331	Add Windows CUDA 13 build and magma script (#161073 ) Add magma build 13.0 for Windows Add cuda_install.bat 13.0 for Windows build https://github.com/pytorch/pytorch/issues/159779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161073 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2025-08-22 11:24:25 +00:00
Tom Ritchford	774b4befa1	[BE] [dynamo] Simplify two methods in ConstDictVariable (#159361 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159361 Approved by: https://github.com/anijain2305	2025-08-22 11:11:30 +00:00
FFFrog	2beffb3311	Refactoring TensorImpl by using constexpr and std::is_same_v (#161043 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161043 Approved by: https://github.com/Skylion007	2025-08-22 10:49:49 +00:00
frost-intel	9b4adc4db7	[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL (#158568 ) Adds support for FlightRecorder in ProcessGroupXCCL. See https://github.com/intel/torch-xpu-ops/pull/1867 for XCCL implementation and more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158568 Approved by: https://github.com/guangyey, https://github.com/fduwjj	2025-08-22 09:03:35 +00:00
Arsh Zahed	9e491f753e	[dynamo] Remove extra if statement in builder _wrap (#161215 ) Removes a redundant if statement. Does not impact logic so no test changes needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161215 Approved by: https://github.com/StrongerXi	2025-08-22 08:56:06 +00:00
Yu, Guangye	373e25c2eb	Disable background threads for XPU host allocator (#161242 ) # Motivation https://github.com/pytorch/pytorch/pull/160505 enables background threads for XPU host allocator. However, it will hang on Windows during program exit. Now disable it until we narrow down the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161242 Approved by: https://github.com/EikanWang	2025-08-22 08:40:13 +00:00
IvanKobzarev	595987d28d	[bucketing] allow convert_element_type after fsdp reduce_scatter (#161159 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161159 Approved by: https://github.com/eellison	2025-08-22 06:41:50 +00:00
Xu Han	c4670e40c9	[inductor] remove Windows unsupported build options. (#161197 ) Changes: 1. Math related build option is not supported by msvc, skip them on Windows. 2. Move all math related build option to `_get_ffast_math_flags` function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161197 Approved by: https://github.com/jansel	2025-08-22 06:23:43 +00:00
Xu Han	9b3ebd25ac	[inductor] Enable max compatible to msvc for oneAPI headers. (#161196 ) Enable max compatible to msvc for oneAPI headers. The key context is `The /permissive- option is compatible with almost all of the header files from the latest Windows Kits` from https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161196 Approved by: https://github.com/jansel	2025-08-22 06:23:26 +00:00
zeshengzong	f8bd85827d	Optimzie `zero_grad` description (#161239 ) Optimize [zero_grad doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html) format and description. ## Test Result ### Before <img width="996" height="534" alt="image" src="https://github.com/user-attachments/assets/e1db973c-57e8-4525-90e7-0500cde2263d" /> ### After <img width="890" height="496" alt="image" src="https://github.com/user-attachments/assets/5579c4fb-a857-4030-9303-34770083d1a5" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161239 Approved by: https://github.com/janeyx99	2025-08-22 06:18:25 +00:00
Huy Do	bc7eaa0d8a	[BE] Remove the default TORCH_CUDA_ARCH_LIST in CI Docker image (#161137 ) This doesn't make sense to have this default to Maxwell, which is too old. All other places in CI/CD needs to overwrite this value. IMO, it makes more sense to not set this at all and let CI/CD jobs set it for their own use cases instead. This is partly responsible for the build failure in https://github.com/pytorch/pytorch/issues/160988 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161137 Approved by: https://github.com/msaroufim	2025-08-22 06:03:11 +00:00
Yang Wang	0dea191ff7	[VLLM TEST]setup test workflow (#160583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160583 Approved by: https://github.com/huydhn, https://github.com/atalman	2025-08-22 05:38:39 +00:00
Simon Fan	8aad3a60ce	[dynamo] propagate tensor metadata on Tensor.__setitem__(tensor) (#161036 ) Fixes silent incorrectness for autograd function tracing, where we rely on FakeTensor metadata (requires_grad) to determine whether to HOP or not: `5ee464db5c/torch/_dynamo/variables/misc.py (L671)` Stared at this with @anijain2305 yesterday, `Tensor.__setitem__` can update tensor metadata, and we can just run the fake prop and extract the output metadata from the updated FakeTensor. FIXES https://github.com/pytorch/pytorch/issues/160901 It should also be the root cause behind the issue in https://github.com/pytorch/torchtitan/pull/1604 @bdhirsh @ruisizhang123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161036 Approved by: https://github.com/anijain2305 ghstack dependencies: #160805	2025-08-22 04:43:22 +00:00
PyTorch UpdateBot	c7fb031706	[audio hash update] update the pinned audio hash (#161226 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161226 Approved by: https://github.com/pytorchbot	2025-08-22 04:22:08 +00:00
Yiming Zhou	c60dea5261	[export] Allow tempfile._TemporaryFileWrapper in package_pt2 (#161203 ) Summary: We use tempfile.NamedTemporaryFile to create a temporary pt2 file in `test_nativert.py` However, it is not recognized as an allowed file format and a warning will be thrown. Test Plan: CI Rollback Plan: Differential Revision: D80740916 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161203 Approved by: https://github.com/angelayi	2025-08-22 04:10:35 +00:00
Phoslight	bf8431ba06	[inductor][cpu] Fix double-offset issue in `GEMM_TEMPLATE` (#159233 ) Fixes #158076 Basically, the gemm template generates code like ``` cpp_CppMicroGemmRef_micro_gemm<static_cast<bool>(false), static_cast<bool>(false)>( &(X[static_cast<int64_t>(k_start + 196LLm_start + 38416LLks_b_index)]), &(W[static_cast<int64_t>(200704000LL + n_start + 80LLk_start + 15680LLks_b_index)]), &(local_acc_buf[static_cast<int64_t>(Nrnci + ((-1LL)Nrnc))]), static_cast<int64_t>(m_end + ((-1LL)m_start)), static_cast<int64_t>(Nr), static_cast<int64_t>(k_end + ((-1LL)k_start)), static_cast<int64_t>(196LL), static_cast<int64_t>(80LL), static_cast<int64_t>(Nc_blocksNr) ); ``` However, when the input tensor W has a storage offset, this results in a double offset issue. That is, the resulting pointer is `2 * 200704000LL` away from `W.storage().data_ptr()`, which causes an out-of-bounds access. The storage offset of `W` is introduced by [this patch](https://github.com/pytorch/pytorch/pull/136421/files), but I think it's a reasonable fix. So `cpp_gemm_template.py` should handle input matrices with storage offsets properly. I think a good way to fix this issue is to create a new matrix that has no storage offset. When `should_block_weights` is true, `block_weight()` creates a clean new matrix, so that branch is not affected by this issue. BTW I've also examined the FX IRs generated by `torch.compile()`, as well as the generated python module, and they are correct. The newly-added test in `test_cpu_select_algorithm.py` can reproduce the issue. With this patch, the crash is fixed. It also resolves the crash reported in #158076. I ran CPU tests in `test_cpu_select_algorithm.py`, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159233 Approved by: https://github.com/leslie-fang-intel, https://github.com/swolchok	2025-08-22 03:47:28 +00:00
Jovian Anthony Jaison	2fdd4f918c	Log exception_stack_trace to dynamo_compile (#161096 ) Note: Adding unit test for this is tricky as having errors in the specific unit test would cause test_utils.py to crash all together. Tested as follows: 1. Added x = 1/0 after guarded_code = compile_inner(code, one_graph, hooks, transform) in convert_frame.py 2. Printed exception_stack_trace and got: ['Traceback (most recent call last):\n File "/data/users/jovian/pytorch/torch/_dynamo/convert_frame.py", line 1207, in _compile\n x = 1/0\n ~^~\nZeroDivisionError: division by zero\n'] Pull Request resolved: https://github.com/pytorch/pytorch/pull/161096 Approved by: https://github.com/c00w	2025-08-22 03:29:15 +00:00
Scott Todd	31a41daff4	[ROCm][Windows] Include native_transformers srcs to fix link errors. (#160373 ) Following up on https://github.com/pytorch/pytorch/pull/152951#discussion_r2267714825, this removes a few lines added in that pull request, fixing link errors like ``` [7019/7028] Linking CXX shared library bin\torch_hip.dll FAILED: [code=4294967295] bin/torch_hip.dll lib/torch_hip.lib C:\Windows\system32\cmd.exe /C "cd . && D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\cmake\data\bin\cmake.exe -E vs_link_dll --msvc-ver=1942 --intdir=caffe2\CMakeFiles\torch_hip.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100261~1.0\x64\rc.exe --mt=C:\PROGRA~2\MICROS~2\2022\BUILDT~1\VC\Tools\Llvm\x64\bin\llvm-mt.exe --manifests -- D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO && cd ." LINK: command "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /MANIFEST:EMBED,ID=2" failed (exit code 1) with the following output: lld-link: error: undefined symbol: __declspec(dllimport) class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::native::transform_bias_rescale_qkv_cuda(class at::Tensor const &, class at::Tensor const &, __int64) >>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_CUDA___transform_bias_rescale_qkv(class 0xE9BF7323::Tensor const &, class 0xE9BF7323::Tensor const &, __int64)) >>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterNestedTensorCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_NestedTensorCUDA___transform_bias_rescale_qkv(class 0xEFEB5304::Tensor const &, class 0xEFEB5304::Tensor const &, __int64)) ``` The `native_transformers_hip_hip` and `native_transformers_hip_cpp` sources are okay to define (and are required) even if accelerated versions of these operations are not available. I've tested downstream builds of torch with ROCm on native Windows via https://github.com/ROCm/TheRock both with and without aotriton and these changes were needed for the build to succeed in both cases. I have _not_ tested Linux, WSL, or with the HIP SDK. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160373 Approved by: https://github.com/alugorey, https://github.com/jeffdaily	2025-08-22 01:43:25 +00:00
Jane Xu	cc791d5857	Quick fix to headers in stable/tensor_inl.h (#161168 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161168 Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007	2025-08-22 01:27:44 +00:00
Yiming Zhou	be2e6b3158	[export] Remove unused Model, tensor_paths, constant_paths (#161185 ) Summary: Removed `Model`, it's not being used anywhere so it's safe. Removed `tensor_paths` and `constant_paths` fields in `ExportedProgram` - BC: when the current deserializer load a previously serialized EP (that comes with empty `tensor_paths` and `constant_paths`), it will just ignore those two fields - FC: when the old deserializer load a newly serialized EP (that doesn't come with `tensor_paths` and `constant_paths`, it will also ignore those two fields in `_dict_to_dataclass()` Differential Revision: D80725094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161185 Approved by: https://github.com/SherlockNoMad	2025-08-22 01:07:01 +00:00
eellison	a85711d565	Avoid making node a successor/predecessor of itself (#161205 ) This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161205 Approved by: https://github.com/IvanKobzarev, https://github.com/bdhirsh	2025-08-22 00:30:29 +00:00
dolpm	ff4f5dd8ed	[nativert] oss layout planner tests (#160942 ) Summary: att - changed one of the tests to get rid of torcharrow dep. Test Plan: ``` buck2 test //caffe2/test/cpp/nativert:layout_planner_tests Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Reviewed By: SherlockNoMad Differential Revision: D80108549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160942 Approved by: https://github.com/georgiaphillips, https://github.com/henryoier	2025-08-22 00:26:25 +00:00
Ankita George	46429be723	[DCP][HF] Add option to parallelize reads in HF Storage Reader (#160205 ) Parallelize reading of data behind thread_count argument to HFStorageReader Test plan: ensure existing tests pass and run a job successfully with these changes Differential Revision: [D79478188](https://our.internmc.facebook.com/intern/diff/D79478188/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160205 Approved by: https://github.com/meetv18	2025-08-21 23:58:02 +00:00
dependabot[bot]	f5bf5147ad	Bump uv from 0.8.4 to 0.8.6 in /.ci/lumen_cli (#161212 ) Bumps [uv](https://github.com/astral-sh/uv) from 0.8.4 to 0.8.6. - [Release notes](https://github.com/astral-sh/uv/releases) - [Changelog](https://github.com/astral-sh/uv/blob/main/CHANGELOG.md) - [Commits](https://github.com/astral-sh/uv/compare/0.8.4...0.8.6) --- updated-dependencies: - dependency-name: uv dependency-version: 0.8.6 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-08-21 15:54:34 -07:00
PyTorch MergeBot	fc0683b1e7	Revert "[ATen][CPU][Sparse] Use Third-Party Eigen for sparse add and addmm (#155357 )" This reverts commit ce048de608180fa88335e5821070472539968b54. Reverted https://github.com/pytorch/pytorch/pull/155357 on behalf of https://github.com/seemethere due to This is causing buck builds to fail since we didn't add the definition of AT_USE_EIGEN_SPARSE in the buckbuild.bzl file, will follow-up and re-land this. ([comment](https://github.com/pytorch/pytorch/pull/155357#issuecomment-3212270510))	2025-08-21 22:38:40 +00:00
Nikita Shulga	cb57953215	[BE] Enable `test_index_put_accumulate_duplicate_indices` on MPS (#161201 ) By changing dtype to float if device is MPS Note: for some reason test runs much longer on MPS than on CPU ``` % python ../test/test_indexing.py -v -k test_index_put_accumulate_duplicate_indices_mps test_index_put_accumulate_duplicate_indices_mps (__main__.TestIndexingMPS.test_index_put_accumulate_duplicate_indices_mps) ... ok ---------------------------------------------------------------------- Ran 1 test in 9.139s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161201 Approved by: https://github.com/dcci	2025-08-21 22:05:42 +00:00
PaulZhang12	f085f29958	[Inductor] Update Outer Reduction Heuristic (#159093 ) Update outer reduction heuristics for significant speedups. HuggingFace: <img width="572" height="705" alt="Screenshot 2025-08-20 at 12 44 51 AM" src="https://github.com/user-attachments/assets/4872a23b-d136-423a-b2e6-187895bccba1" /> Average ~20% speedup on a kernel by kernel basis TorchBench: <img width="572" height="705" alt="Screenshot 2025-08-20 at 12 45 10 AM" src="https://github.com/user-attachments/assets/b8357b6d-6107-4104-b906-292a17d14d48" /> Average ~40% speedup on a kernel by kernel basis <img width="1705" height="729" alt="Screenshot 2025-08-21 at 5 50 32 PM" src="https://github.com/user-attachments/assets/a9715a2b-9e6c-4b33-ba9f-7870dc561e31" /> Differential Revision: [D80630416](https://our.internmc.facebook.com/intern/diff/D80630416) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159093 Approved by: https://github.com/jansel	2025-08-21 22:02:49 +00:00
Will Constable	d1faf2ef04	[DTensor] Make default RNG semantics match user-passed generator (#160482 ) Previously, DTensor kept its own copy of the generator state after the first time a random operator was called on a DTensor. This copy would evolve independently from the generator outside of DTensor. After adding support for users to pass a specific generator into random operators (e.g. `uniform_(..., generator=)`), it was determined (in discussion on #159991) to change the semantics so that any random operations performed on DTensor would evolve the state of the publicly visible generators (either the default one or user-passed one). The upsides are (1) it is now possible to call torch.manual_seed() at any point in the program and have a consistent effect on DTensor, (2) DTensor ops have an observable effect on the generator. The downside is that users are now responsible for seeding their generator before using DTensor, ensuring all ranks use the same seed. Fixes #159991 confirmed docs rendered OK <img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160482 Approved by: https://github.com/wanchaol	2025-08-21 22:02:16 +00:00
Yang Wang	cc2b65a91a	[VLLM]setup test cli logics (#160361 ) setup vllm test logics. 1. install wheels generated from previous build stage 2. generate and install vllm test pkg list on run time based on the torch wheels in the instance 3. run test based on the pre-defined test plan notice the test-plan format is temporary for some basic vllm testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/160361 Approved by: https://github.com/atalman, https://github.com/huydhn	2025-08-21 21:59:41 +00:00
Gabriel Ferns	67fc16c744	Add profiler analysis flag to combine multiple profiles into one (#161145 ) Combine multiple profiles into one: ``` python profile_analysis.py --combine <file1> <file2> ... <out> ``` This only works well if they have different pids, like from different programs in a distributed run. <img width="1521" height="465" alt="combining_multiple_profiles" src="https://github.com/user-attachments/assets/aba7112b-e9a9-4075-b82b-a4e4408384da" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161145 Approved by: https://github.com/xmfan	2025-08-21 21:36:58 +00:00
Ankita George	fb241d0a44	[dcp][hf] Fix multi-rank consolidation for no files to process case (#160660 ) Summary: In the consolidate_safetensors_files_on_every_rank method, where we use multiple ranks to combine sharded safetensors files, if there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work. Test Plan: tested this case on a job e2e added a unit test Rollback Plan: Differential Revision: D80273616 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160660 Approved by: https://github.com/sibuachu	2025-08-21 21:18:03 +00:00
Jagadish Krishnamoorthy	d2b8c0d431	forward fix of #152198 (#161166 ) torch._inductor.virtualized.OpsValue objects instance does not have shape attribute. This breaks the fp8 test on ROCm. Add the OpsValue class in todo list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161166 Approved by: https://github.com/jeffdaily	2025-08-21 21:09:48 +00:00
can-gaa-hou	e25ee0290e	Fix constant_pad_nd_mps bug when pad is empty (#161149 ) Fixes #161066 There is a size check here, which causes the error. `8ce81bcee1/aten/src/ATen/native/mps/operations/Pad.mm (L39-L40)` If the argument `pad` is empty, it will return the cloned tensor on CPU. `8ce81bcee1/aten/src/ATen/native/PadNd.cpp (L43-L64)` Therefore, this PR fixes the empty padding argument error by checking the size first and returning a cloned tensor immediately if the padding size is 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161149 Approved by: https://github.com/malfet	2025-08-21 20:45:26 +00:00
Animesh Jain	5805c4210b	[invoke_subgraph][inductor] Thread graphsafe rng input states for hops (#160713 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160713 Approved by: https://github.com/eellison	2025-08-21 20:41:29 +00:00
Xu Han	db38c44ad6	[inductor] add libraries_dirs for level_zero (#161146 ) Changes: 1. change set `include_dirs` to append value. 2. add append `libraries_dirs` for level_zero. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161146 Approved by: https://github.com/angelayi	2025-08-21 19:55:12 +00:00
Xu Han	1e3fe78a10	[inductor] disable min/max macro on Windows. (#161133 ) Disable min/max macro on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161133 Approved by: https://github.com/angelayi	2025-08-21 19:52:56 +00:00
Tsung-Hsien Lee	a445b41e4f	[pytorch] Simplify PyTorch `foreach_*` API restrictions check (#161039 ) Summary: C++'s polymorphism and reusing components help us reduce the amount of bolierplate codes here. Test Plan: CI & tests Rollback Plan: Differential Revision: D80594353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161039 Approved by: https://github.com/janeyx99	2025-08-21 19:50:02 +00:00
Tsung-Hsien Lee	801851086d	[pytorch] Invoke `vector.reserve()` consistently for non-inplace foreach operations (#161128 ) Summary: The `reserve()` method is used to pre-allocate memory for the result vector before adding elements to it. This is an optimization that makes sense for several reasons: 1. Performance improvement: By pre-allocating memory for the exact number of elements needed, it avoids multiple reallocations and memory copies that would occur as the vector grows dynamically. 2. Memory efficiency: It ensures that the vector allocates exactly the amount of memory needed, no more and no less, which is efficient when we know the final size in advance. 3. Reduced overhead: Each reallocation typically involves: - Allocating a new, larger block of memory - Copying all existing elements to the new location - Destroying the old elements - Deallocating the old memory block - Consistent performance: Without reservation, vector growth typically follows a geometric progression (like 1, 2, 4, 8, 16...), which can lead to unpredictable performance spikes when reallocation occurs. Test Plan: OSS CI & tests Rollback Plan: Differential Revision: D80674453 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161128 Approved by: https://github.com/Skylion007	2025-08-21 19:43:11 +00:00
dolpm	958f9ca88e	[nativert] oss static kernel tests (#161087 ) Summary: att - should be no-op Test Plan: buck2 test //caffe2/test/cpp/nativert:static_kernel_ops_tests Tests finished: Pass 24. Fail 0. Fatal 0. Skip 0. Build failure 0 Rollback Plan: Differential Revision: D80216488 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161087 Approved by: https://github.com/georgiaphillips, https://github.com/henryoier	2025-08-21 19:42:21 +00:00
James Wu	9668210302	Allow bypasses for Precompile when guards, etc. cannot be serialized (#160902 ) This adds a new function `bypass_package` and `CompilePackage.bypass_current_entry()`. This allows us to safely bypass if there are models with unserializable or incompatible parts. When we encounter something incompatible, we'll raise a bypass and ignore that particular code in DynamoCodeEntry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160902 Approved by: https://github.com/zhxchen17	2025-08-21 18:20:42 +00:00
Huy Do	3f5a8e2003	Fix torchaudio build when TORCH_CUDA_ARCH_LIST is not set (#161084 ) Fixes https://github.com/pytorch/pytorch/issues/160988. The root cause can be found in the same issue. This fix ensures that when reuse old wheel is on and `torchaudio` wheel is not there, the inductor test job can still rebuild the wheel it needs Pull Request resolved: https://github.com/pytorch/pytorch/pull/161084 Approved by: https://github.com/malfet, https://github.com/zou3519	2025-08-21 17:38:32 +00:00
Angela Yi	3dacaf0e1e	[aoti-fx] Add meta["val"] metadata (#161019 ) Summary: Added a `_set_node_metadata_hook` which automatically adds node.meta["val"] to every new node that gets created under this context. Test Plan: ` buck2 test //mtia/host_runtime/afg/tests:test_dynamic_shapes_advanced_ops` https://www.internalfb.com/buck2/866439a2-2ba6-42d1-8e43-508d60456e2e `buck2 test //mtia/host_runtime/afg/tests:test_dynamic_shapes_basic_ops` https://www.internalfb.com/intern/testinfra/testrun/11540474149662857 Rollback Plan: Differential Revision: D80579336 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161019 Approved by: https://github.com/blaine-rister	2025-08-21 16:45:41 +00:00
PyTorch MergeBot	a6401cb5aa	Revert "flip the list-as-tuple behavior for short lists (#160794 )" This reverts commit febfc3ec03004116dfd6d504e6853ff02a1dd6e0. Reverted https://github.com/pytorch/pytorch/pull/160794 on behalf of https://github.com/seemethere due to This if failing internal tests, see D80671241 ([comment](https://github.com/pytorch/pytorch/pull/160794#issuecomment-3211314867))	2025-08-21 16:33:30 +00:00
PyTorch MergeBot	7006fd0c88	Revert "[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113 )" This reverts commit 517d38d3406abbba35d0694bff259a698cad3ec9. Reverted https://github.com/pytorch/pytorch/pull/160113 on behalf of https://github.com/IvanKobzarev due to Segment tree starts failing on trunk even ciflows/trunk passed on PR ([comment](https://github.com/pytorch/pytorch/pull/160113#issuecomment-3211286092))	2025-08-21 16:22:44 +00:00
IvanKobzarev	517d38d340	[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113 ) 1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory: ``` """ Alternative version of estimate_peak_memory, that respects the fact, that every SchedulerNode has multiple phases: 1. alloc ( outputs ) 2. run_kernel 3. dealloc last_use buffers estimate_peak_memory collapses memory into one value: size_alloc - size_free While peak memory happens after alloc. Duplicating the code to not migrate all callsites at once, In future usages of estimate_peak_memory will migrate to this version. """ ``` - Applying this in `reorder_communication_preserving_peak_memory` pass. 2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode. - Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size). 4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder. What is after this PR: Iterative recomputation of memory estimations matches full memory estimations. Active memory is not regressing a lot, but reserved memory is significantly regressed. Investigation and fix of "reserved" memory will be in following PRs. BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb ``` [rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step: 1 loss: 12.2722 grad_norm: 4.2192 active_memory: 24.66GiB(25.96%) reserved_memory: 25.38GiB(26.72%) tps: 99 tflops: 5.71 mfu: 0.58% [rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step: 2 loss: 13.1738 grad_norm: 50.5566 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 4,448 tflops: 257.63 mfu: 26.05% [rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step: 3 loss: 15.6866 grad_norm: 80.0862 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,900 tflops: 341.72 mfu: 34.55% [rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step: 4 loss: 13.4853 grad_norm: 7.8538 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,881 tflops: 340.57 mfu: 34.44% [rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step: 5 loss: 16.1191 grad_norm: 53.2481 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,867 tflops: 339.77 mfu: 34.35% ``` REORDER: active: 32Gb reserved: 36Gb ``` [rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step: 1 loss: 12.2490 grad_norm: 4.1944 active_memory: 24.66GiB(25.96%) reserved_memory: 26.81GiB(28.22%) tps: 85 tflops: 4.90 mfu: 0.50% [rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step: 2 loss: 13.1427 grad_norm: 39.5942 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 3,205 tflops: 185.61 mfu: 18.77% [rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step: 3 loss: 14.6084 grad_norm: 51.0743 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,688 tflops: 329.44 mfu: 33.31% [rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step: 4 loss: 13.6181 grad_norm: 8.1122 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,744 tflops: 332.68 mfu: 33.64% [rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step: 5 loss: 15.8913 grad_norm: 59.8510 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,046 tflops: 292.22 mfu: 29.55% ``` REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb ``` [rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step: 1 loss: 12.2646 grad_norm: 4.1282 active_memory: 27.60GiB(29.05%) reserved_memory: 32.49GiB(34.20%) tps: 173 tflops: 10.00 mfu: 1.01% [rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step: 2 loss: 13.2353 grad_norm: 42.4234 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,152 tflops: 356.26 mfu: 36.02% [rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step: 3 loss: 13.8205 grad_norm: 24.0156 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,169 tflops: 357.29 mfu: 36.13% [rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step: 4 loss: 13.1033 grad_norm: 9.1167 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,183 tflops: 358.10 mfu: 36.21% [rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step: 5 loss: 16.3530 grad_norm: 51.8118 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,130 tflops: 355.03 mfu: 35.90% ``` Differential Revision: [D79886535](https://our.internmc.facebook.com/intern/diff/D79886535) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113 Approved by: https://github.com/wconstab, https://github.com/eellison Co-authored-by: eellison <elias.ellison@gmail.com>	2025-08-21 15:45:06 +00:00
Andy Lugo	3caddd4daa	[ROCm] SDPA fix mem fault when dropout is enabled (#154864 ) Fixes issue that exhibited a device side memory access fault due to incorrect tensor life management Pull Request resolved: https://github.com/pytorch/pytorch/pull/154864 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-21 14:23:13 +00:00
Kaichao You	18271148d3	[dist] expose unsafe_get_ptr for dist.ProcessGroupNCCL.NCCLConfig (#161136 ) expose the pointer so that we can create the `ncclConfig_t` object from pytorch and use it elsewhere. this is useful to control the nccl communicator parameters for multiple nccl communicators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161136 Approved by: https://github.com/kwen2501	2025-08-21 10:47:03 +00:00
Xia, Weiwen	a941d7ffe5	[Quant][CPU] Avoid NaN in fp8 output of qlinear and qconv (#160957 ) Summary When output dtype is fp8, oneDNN does not ensure intermediate results in the range of [-448, 448] before converting to fp8. So, we may get NaN in the output, which is a disaster for inference. This PR fixes this issue by clamping the intermediate results by oneDNN's post-op clip. Test plan ``` pytest -sv test/quantization/core/test_quantized_op.py -k "q and fp8" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160957 Approved by: https://github.com/Valentine233, https://github.com/CaoE	2025-08-21 08:36:21 +00:00
PyTorch MergeBot	acb00d3ccf	Revert "Fix torchaudio build when TORCH_CUDA_ARCH_LIST is not set (#161084 )" This reverts commit cfdaaaaa26d7f34427ba941569eca46f02f79f3e. Reverted https://github.com/pytorch/pytorch/pull/161084 on behalf of https://github.com/huydhn due to My mistake in not checking for nvidia-smi availability ([comment](https://github.com/pytorch/pytorch/pull/161084#issuecomment-3209498435))	2025-08-21 08:17:04 +00:00
PyTorch MergeBot	bd5857a1d6	Revert "[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113 )" This reverts commit 9d18bf01b1661d227f6af41ac07a1e9ef20a9e1a. Reverted https://github.com/pytorch/pytorch/pull/160113 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but lots of failures showing up after this lands ([comment](https://github.com/pytorch/pytorch/pull/160113#issuecomment-3209487237))	2025-08-21 08:13:33 +00:00
CaoE	23b033452f	[Inductor][CPP] Fix layout for local buf in outer loop fusion (#160857 ) Fixes #159154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160857 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-08-21 06:00:04 +00:00
Dylan Maloy	2f50ae7d20	[nativert] make runtime const folding aware of run_const_graph (#160760 ) Summary: it's possible that we have foldable nodes that use things that will be folded by run_const_graph Test Plan: CI Rollback Plan: Differential Revision: D80355542 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160760 Approved by: https://github.com/SherlockNoMad	2025-08-21 05:22:03 +00:00
IvanKobzarev	9d18bf01b1	[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113 ) 1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory: ``` """ Alternative version of estimate_peak_memory, that respects the fact, that every SchedulerNode has multiple phases: 1. alloc ( outputs ) 2. run_kernel 3. dealloc last_use buffers estimate_peak_memory collapses memory into one value: size_alloc - size_free While peak memory happens after alloc. Duplicating the code to not migrate all callsites at once, In future usages of estimate_peak_memory will migrate to this version. """ ``` - Applying this in `reorder_communication_preserving_peak_memory` pass. 2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode. - Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size). 4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder. What is after this PR: Iterative recomputation of memory estimations matches full memory estimations. Active memory is not regressing a lot, but reserved memory is significantly regressed. Investigation and fix of "reserved" memory will be in following PRs. BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb ``` [rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step: 1 loss: 12.2722 grad_norm: 4.2192 active_memory: 24.66GiB(25.96%) reserved_memory: 25.38GiB(26.72%) tps: 99 tflops: 5.71 mfu: 0.58% [rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step: 2 loss: 13.1738 grad_norm: 50.5566 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 4,448 tflops: 257.63 mfu: 26.05% [rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step: 3 loss: 15.6866 grad_norm: 80.0862 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,900 tflops: 341.72 mfu: 34.55% [rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step: 4 loss: 13.4853 grad_norm: 7.8538 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,881 tflops: 340.57 mfu: 34.44% [rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step: 5 loss: 16.1191 grad_norm: 53.2481 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,867 tflops: 339.77 mfu: 34.35% ``` REORDER: active: 32Gb reserved: 36Gb ``` [rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step: 1 loss: 12.2490 grad_norm: 4.1944 active_memory: 24.66GiB(25.96%) reserved_memory: 26.81GiB(28.22%) tps: 85 tflops: 4.90 mfu: 0.50% [rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step: 2 loss: 13.1427 grad_norm: 39.5942 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 3,205 tflops: 185.61 mfu: 18.77% [rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step: 3 loss: 14.6084 grad_norm: 51.0743 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,688 tflops: 329.44 mfu: 33.31% [rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step: 4 loss: 13.6181 grad_norm: 8.1122 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,744 tflops: 332.68 mfu: 33.64% [rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step: 5 loss: 15.8913 grad_norm: 59.8510 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,046 tflops: 292.22 mfu: 29.55% ``` REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb ``` [rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step: 1 loss: 12.2646 grad_norm: 4.1282 active_memory: 27.60GiB(29.05%) reserved_memory: 32.49GiB(34.20%) tps: 173 tflops: 10.00 mfu: 1.01% [rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step: 2 loss: 13.2353 grad_norm: 42.4234 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,152 tflops: 356.26 mfu: 36.02% [rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step: 3 loss: 13.8205 grad_norm: 24.0156 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,169 tflops: 357.29 mfu: 36.13% [rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step: 4 loss: 13.1033 grad_norm: 9.1167 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,183 tflops: 358.10 mfu: 36.21% [rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step: 5 loss: 16.3530 grad_norm: 51.8118 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,130 tflops: 355.03 mfu: 35.90% ``` Differential Revision: [D79886535](https://our.internmc.facebook.com/intern/diff/D79886535) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113 Approved by: https://github.com/wconstab, https://github.com/eellison Co-authored-by: eellison <elias.ellison@gmail.com>	2025-08-21 05:19:38 +00:00
dolpm	67b98da1b2	[nativert] oss static kernel test utils (#161086 ) Summary: att - should be a no-op Test Plan: ci Rollback Plan: Differential Revision: D80214768 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161086 Approved by: https://github.com/georgiaphillips	2025-08-21 04:49:06 +00:00
PyTorch UpdateBot	b0420d2438	[vllm hash update] update the pinned vllm hash (#161121 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161121 Approved by: https://github.com/pytorchbot	2025-08-21 04:21:09 +00:00
PyTorch UpdateBot	6096d277c5	[audio hash update] update the pinned audio hash (#161021 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161021 Approved by: https://github.com/pytorchbot	2025-08-21 04:20:56 +00:00
Huy Do	cfdaaaaa26	Fix torchaudio build when TORCH_CUDA_ARCH_LIST is not set (#161084 ) Fixes https://github.com/pytorch/pytorch/issues/160988. The root cause can be found in the same issue. This fix ensures that when reuse old wheel is on and `torchaudio` wheel is not there, the inductor test job can still rebuild the wheel it needs Pull Request resolved: https://github.com/pytorch/pytorch/pull/161084 Approved by: https://github.com/malfet, https://github.com/zou3519	2025-08-21 03:47:15 +00:00
Eddie Yan	117f11adb4	[FlexAttention][TF32] Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` (#161102 ) For https://github.com/pytorch/pytorch/issues/161022 The warning says the old API will be deprecated in 2.9+ anyway, leaving it up to the author of #125888 to decide on initialization behavior then Pull Request resolved: https://github.com/pytorch/pytorch/pull/161102 Approved by: https://github.com/ngimel, https://github.com/drisspg, https://github.com/BoyuanFeng	2025-08-21 03:36:52 +00:00
Rohit Manav	a154c2093c	remove redundant installation (#160634 ) Fixes #160302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160634 Approved by: https://github.com/sekyondaMeta, https://github.com/malfet	2025-08-21 03:31:12 +00:00
Xia, Weiwen	39862acb2e	[CPU][Inductor] improve performance of A16W4 GEMM template (#159127 ) Summary This PR improves performance of A16W4 GEMM template by removing boundary check of prefetch in the kernel code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159127 Approved by: https://github.com/CaoE	2025-08-21 03:16:26 +00:00
bobrenjc93	9a41570199	[rfc] add hint_override kwarg to mark_dynamic (#161007 ) The motivation for this change can be seen through the following example: ``` import torch GPU_TYPE = "cuda" @torch.compile def no_override(x): return x.sum(dim=0) @torch.compile def override(x): return x.sum(dim=0) x_small = torch.randn(4096, 512, device=GPU_TYPE) no_override(x_small) torch._dynamo.decorators.mark_dynamic(x_small, 0, hint_override=4096 * 1000) override(x_small) ``` Previously, when reductions were split, codegen relied only on the first observed shape. With a small input, this resulted in a small split size: ``` def triton_red_fused_sum_0(in_ptr0, out_ptr0, ks0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr): xnumel = 16384 rnumel = r0_numel ``` With the new scheme, inductor honors hint_override during codegen, producing larger and more appropriate split sizes: ``` def triton_red_fused_sum_0(in_ptr0, out_ptr0, ks0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr): xnumel = 1024000 rnumel = r0_numel ``` This addresses a broader problem with dynamism: performance and numerics previously depended on whichever shape was seen first. For example: ``` f(s0) -> f(s2) f(s1) -> f(s2) ``` could generate different kernels. With the new approach, an explicit override pins the chosen configuration: ``` f(s0, hint_override=s0) -> f(s2) f(s1, hint_override=s0) -> f(s2) ``` ensuring consistent kernel generation regardless of input order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161007 Approved by: https://github.com/jansel	2025-08-21 02:22:52 +00:00
PyTorch MergeBot	f9875166a9	Revert "[FSDP][Collectives] skipping reduce_scatter when world size is 1 (#160136 )" This reverts commit 3d126e17e0c2630031e7a359d6a6fd1dbe52c4f7. Reverted https://github.com/pytorch/pytorch/pull/160136 on behalf of https://github.com/jithunnair-amd due to Sorry, but looks like this broke ROCm distributed CI ([comment](https://github.com/pytorch/pytorch/pull/160136#issuecomment-3208632921))	2025-08-21 01:34:19 +00:00
PyTorch MergeBot	6b5be1f4a0	Revert "[FSDP][Replicate] replicate tests for param registration and input device movements (#160147 )" This reverts commit a3a82e3da85a53afc4bbf3d75bd3d3dcc2e06645. Reverted https://github.com/pytorch/pytorch/pull/160147 on behalf of https://github.com/jithunnair-amd due to Sorry, but looks like this broke ROCm distributed CI ([comment](https://github.com/pytorch/pytorch/pull/160136#issuecomment-3208632921))	2025-08-21 01:34:19 +00:00
Huamin Li	0924304e72	[AOTI] Add a new config cpp.use_constexpr_for_int_array (#160927 ) Summary: Default True so same as before, but make it configurable Differential Revision: D80185094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160927 Approved by: https://github.com/henryoier	2025-08-21 01:16:27 +00:00
Natalia Gimelshein	d875d3ca1e	don't try to set lazy module loading env var (#161103 ) This is not needed on drivers >=525, and in DriverAPI::get() we are initializing the context anyway, so setting environment variable after that is beside the point As a result of calling DriverAPI::get on systems that don't have gpus available (e.g. due to CUDA_VISIBLE_DEVICES="") people were getting confusing errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161103 Approved by: https://github.com/eqy, https://github.com/malfet	2025-08-21 01:06:51 +00:00
Yuxuan Chen	a825557ed5	Workaround ATen SFINAE under libc++ (#161101 ) The existing logic here to workaround dealing with SFINAE under Microsoft platforms also applies to libc++ platforms. It appears that nvcc reports ambiguity in overload resolution for `pow_`. This seems like a nvcc limitation. ``` fbcode/caffe2/aten/src/ATen/native/cuda/Pow.cuh(42): error: more than one instance of overloaded function "pow" matches the argument list: function template "std::__2::enable_if<<expression>, std::__2::__promote<_A1, _A2, void>>::type::type pow(_A1, _A2) noexcept" (declared at line 848 of fbcode/third-party-buck/platform010-libcxx/build/libcxx/include/c++/v1/math.h) function template "std::__2::enable_if<<expression>, std::__2::__promote<_Tp, _Up, void>>::type pow(_Tp, _Up) noexcept" (declared at line 11308 of fbcode/third-party-buck/platform010/build/cuda/12.4/bin/..//include/crt/math_functions.h) argument types are: (double, float) return ::pow(base, exp); ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161101 Approved by: https://github.com/malfet	2025-08-21 00:55:58 +00:00
Nikita Shulga	3e3e83418d	[BE] Move indexing tests to test_indexing (#160994 ) Which enables them on MPS device - xfail all `test_index_reduce` on MPS, as op is not implemented - xfail all `test_index_copy` on MPS due to the silent correctness problems, see https://github.com/pytorch/pytorch/issues/160993 - Fixed hard crash in `index_fill` and replaced `skipIfMPS` with `expectedFailueMPS` - Created issue for the lack of deterministic algorithms for MPS backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/160994 Approved by: https://github.com/manuelcandales ghstack dependencies: #160850, #160889, #160926	2025-08-21 00:42:55 +00:00
Jazlyn Li	667245dc60	TritonKernel.inductor_meta_common() -> self.inductor_meta_common() (#160895 ) Summary: use `self.inductor_meta_common()` to call the static method, since the custom subclasses may overwrite the method to be an instance method Test Plan: ``` caffe2/test/inductor:select_algorithm -- test_finalized_subclass_hooks ``` Rollback Plan: Differential Revision: D80375351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160895 Approved by: https://github.com/eellison, https://github.com/blaine-rister	2025-08-21 00:22:51 +00:00
Grant	54c2b66592	Replace _device_t with torch.types.Device in torch/cpu/__init__.py (#161031 ) Fixes #152952 Replace `_device_t` with `torch.types.Device` in `torch/cpu/__init__.py`. Did basic smoke test by running tests that `import torch.cpu` including `test/distributed/test_c10d_functional_native.py` and `test/test_decomp.py`. Based this PR off of #152935 which is referenced in the main issue. (also, this is my first contribution but I followed the contributing guide closely) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161031 Approved by: https://github.com/janeyx99	2025-08-21 00:22:43 +00:00
Xu Han	be87f22dfb	[inductor] Enable updated __cplusplus macro (#161064 ) Intel oneAPI has some header depends on `__cplusplus` macro. This PR is enable updated __cplusplus macro for msvc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161064 Approved by: https://github.com/angelayi	2025-08-21 00:17:08 +00:00
Xu Han	2a7a7ad711	[inductor] add level zero for xpu (#161061 ) Add level zero for Inductor xpu on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161061 Approved by: https://github.com/angelayi	2025-08-21 00:14:15 +00:00
Teja Rao	7e6ce41555	[dcp_poc] add async checkpointing tests (#161034 ) Summary: add tests for async checkpointer for the experimental checkpointer Test Plan: tests Rollback Plan: Differential Revision: D80590461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161034 Approved by: https://github.com/pradeepfn	2025-08-21 00:08:53 +00:00
Ben Niu	4ed3184dee	Conditionally enable ACL for bmm_out_or_baddbmm_ (#161065 ) Summary: Similar to #ifdef checks added in addmm_impl_cpu_ to conditionally enable ACL, we add the same checks in bmm_out_or_baddbmm_. This essentially disables ACL for bmm_out_or_baddbmm_ and enables ArmPL, which seems to be performing better. Test Plan: AR SL Rollback Plan: Reviewed By: Nicoshev Differential Revision: D80494623 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161065 Approved by: https://github.com/q10	2025-08-20 23:32:25 +00:00
Pian Pawakapan	44549c7146	[dynamic shapes] unbacked-safe slicing (#157944 ) Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944 Approved by: https://github.com/laithsakka	2025-08-20 22:52:56 +00:00
Natalia Gimelshein	febfc3ec03	flip the list-as-tuple behavior for short lists (#160794 ) Per title, previously we started throwing noisy warnings, but given how popular this pattern was in our test suite decided to leave it as warning, not as silent behavior change for one release. Now `treatSequenceAsTuple` would return `true` in the only case where the sequence was indeed a tuple, so no need for a special function anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160794 Approved by: https://github.com/albanD	2025-08-20 22:40:42 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	5afa4187df	Close some sources of fake tensor leakages (#159923 ) Differential Revision: D79694055 Couple of fixes: 1. When we run into an operation we didn't proxy, we end up emitting fake constants. We detect this and error using the FQN of the lifted constant 2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict 3. We modify yolov3 to fix the previous silent incorrect behaviour When upgrading torchbench pin, opacus_cifar10 seems to not run on eager anymore. I verified this by pushing a temporary PR on master with new pin. So i added it to expect_fail list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159923 Approved by: https://github.com/avikchaudhuri	2025-08-20 22:24:23 +00:00
Mikayla Gawarecki	30384abcb1	Decrease number of bytes used by uninitialized tokens_ in KernelFunction (#160764 ) std::unique_ptr to decrease bytes from 24 to 8 Since std::unique_ptr is not copyable this required defining the copy / copy assignment constructors. Which made me realize we shouldn't be copying `tokens_` in those. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160764 Approved by: https://github.com/albanD	2025-08-20 21:33:27 +00:00
Ethan Wee	16e811e0b5	[CI] remove tb-nightly (#160996 ) Removing tb-nightly because we found issues when importing tensorboard as having both tb-nightly and tensorboard causes issues when pip would report 2.18.0 (pinned tensorboard) but importing in a python shell would report 2.13.XXX. This mismatch causes issues when running tests in a numpy2.X environment. e.g. ``` /var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler /opt/venv/lib/python3.12/site-packages/redis/connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing warnings.warn(msg) /opt/venv/lib/python3.12/site-packages/google/protobuf/internal/well_known_types.py:91: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC). _EPOCH_DATETIME_NAIVE = datetime.datetime.utcfromtimestamp(0) E ====================================================================== ERROR: test_event_handler (__main__.TestMonitorTensorboard.test_event_handler) ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/lib/jenkins/pytorch/test/test_monitor.py", line 116, in setUp from tensorboard.backend.event_processing import ( File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/plugin_event_multiplexer.py", line 25, in <module> from tensorboard.backend.event_processing import ( File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/plugin_event_accumulator.py", line 25, in <module> from tensorboard.backend.event_processing import event_file_loader File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/event_file_loader.py", line 21, in <module> from tensorboard import dataclass_compat File "/opt/venv/lib/python3.12/site-packages/tensorboard/dataclass_compat.py", line 33, in <module> from tensorboard.plugins.hparams import metadata as hparams_metadata File "/opt/venv/lib/python3.12/site-packages/tensorboard/plugins/hparams/metadata.py", line 32, in <module> NULL_TENSOR = tensor_util.make_tensor_proto( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/venv/lib/python3.12/site-packages/tensorboard/util/tensor_util.py", line 405, in make_tensor_proto numpy_dtype = dtypes.as_dtype(nparray.dtype) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/venv/lib/python3.12/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py", line 677, in as_dtype if type_value.type == np.string_ or type_value.type == np.unicode_: ^^^^^^^^^^ File "/opt/venv/lib/python3.12/site-packages/numpy/__init__.py", line 400, in __getattr__ raise AttributeError( AttributeError: `np.string_` was removed in the NumPy 2.0 release. Use `np.bytes_` instead. ---------------------------------------------------------------------- Ran 1 test in 0.355s FAILED (errors=1) ``` After removing tb-nightly and ensuring that tensorboard 2.18.0 is the only tensoboard in the env: ``` root@rocm-framework-47:/var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler . ---------------------------------------------------------------------- Ran 1 test in 0.409s OK ``` ``` >>> import tensorboard >>> print(tensorboard.__version__) 2.13.0a20230426 ``` ```:/# pip show tensorboard Name: tensorboard Version: 2.18.0 Summary: TensorBoard lets you watch Tensors Flow Home-page: https://github.com/tensorflow/tensorboard Author: Google Inc. Author-email: packages@tensorflow.org License: Apache 2.0 Location: /opt/venv/lib/python3.12/site-packages Requires: absl-py, grpcio, markdown, numpy, packaging, protobuf, setuptools, six, tensorboard-data-server, werkzeug Required-by: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160996 Approved by: https://github.com/huydhn	2025-08-20 21:25:58 +00:00
Tsung-Hsien Lee	19c70c2f3d	[pytorch] Faster and safer lambda expression capture in `has_integral_tensor()` (#161042 ) Summary: Because `includeBool` is already a small value type (i.e., `bool`, 1 byte) that's passed by value to the function. Capturing by reference (4 or 8 bytes depending on the system) is unnecessary and could potentially lead to dangling reference issues if the lambda outlives the original variable. Capturing by value is more efficient for small types and safer. Test Plan: OSS CI & tests Rollback Plan: Differential Revision: D80595698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161042 Approved by: https://github.com/Skylion007	2025-08-20 20:59:41 +00:00
Will Feng	8047cde0f3	Try to fix Inductor CI periodic tests (#160932 ) - hf_Reformer: this one starts failing due to increased graph breaks due to transformers pin bump (#159291). We can likely just bump the expected graph break count. - dla102: this one starts timing out on 8/13 Wed between commit 6e8865f and ee1b041. But based on the PT2 dashboard, this model actually doesn't have compile time or runtime regression. Will try to bump up the timeout and see if it can work. - hf_BigBird: this one has its accuracy status improved since today. Will update hf_BigBird accuracy status. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160932 Approved by: https://github.com/zou3519, https://github.com/huydhn, https://github.com/malfet	2025-08-20 20:36:46 +00:00
Dmitry Nikolaev	24e7f3c21c	[ROCm] fix large tensor sort on MI350 (#161054 ) Currently std::min -> ::min did not work as expected on ROCm when input values >= 2147483648 Replace `std::min` to ternary statement Also `std::min` can be replaced by explicit typing `std::min<int64_t>` fixes on ROCm: test_sort_and_select.py::TestSortAndSelectCUDA::test_sort_large_cuda_float16 error: RuntimeError: Cannot sort dimension of length 8192 Similar PR to fix large tensors on ROCm https://github.com/pytorch/pytorch/pull/130994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161054 Approved by: https://github.com/jeffdaily	2025-08-20 19:58:01 +00:00
Nikita Shulga	e1a64b75ff	[CD] Delete full builds (#161075 ) As they are no longer needed for Colab, see https://github.com/googlecolab/colabtools/issues/5508#issuecomment-3200871941 and [<img width="896" height="128" alt="image" src="https://github.com/user-attachments/assets/a287393c-bde7-4e10-99bf-2e0d66346efe" /> ](https://colab.research.google.com/drive/1YJ5Y0xsApXSewM1cQwWQ_AS3A77vytgq) Fixes https://github.com/pytorch/pytorch/issues/160972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161075 Approved by: https://github.com/atalman	2025-08-20 19:40:15 +00:00
eellison	b708966201	Fix bucketing introducing cycles (#160967 ) We were just looking at direct arguments, but not transitive dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160967 Approved by: https://github.com/IvanKobzarev	2025-08-20 19:38:46 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	dbef606631	Add support for tracing vmap in pre-dispatch export (#154650 ) Summary: ONNX team and recent transformer upgrade ran into this error and we also ran into during our export benchmarking. This diff makes it possible to trace through vmap implementation in pre-dispatch IR. Note that we don't support serializing functorch ops in pre-dispatch IR and in the future, we should desugar them to post-grad ops. The implementation strategy is: 1. We add python wrappers around vmap APIs so that we attach custom torch function handler that is only on during non-strict export. The reason is we don't want to add this to default torch_function handler because it will break BC. 2. Some dynamo changes to make sure it picks up new python wrapper APIs. The reason is when we do strict export, we need to re-materialize these APIs in pre-dispatch IR from torch IR. We can avoid this by special casing in dynamo for export to proxy different API calls but i feel that is too much chaos because you need to be able to proxy 2 different variants of same vmap API. Test Plan: CI Differential Revision: D75623875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154650 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-08-20 19:31:07 +00:00
Ruben Rodriguez Buchillon	c5cb255625	[inductor][mm] fix tma issue (#161025 ) # why - head is broken # what - the template for experimental API is broken - the test assumes not experimental API # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_regular_mm_persistent_tma_strided_a_transposed_True_b_transposed_False_dynamic_True -v ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161025 Approved by: https://github.com/PaulZhang12	2025-08-20 18:52:38 +00:00
redwrasse	957b170d8e	Fix SVD forward-mode AD multiplication priority (#161027 ) Multiplication order priority for the SVD JVP appears to have been the opposite of the optimal one. Results from a crude CPU benchmark on my laptop for random matrices of various ratios: ``` Performance Results Table \| Test Case \| Matrix Size \| Aspect Ratio \| Before JVP (ms) \| After JVP (ms) \| Change (ms) \| % Change \| Status \| \|----------------------------------\|-------------\|--------------\|-----------------\|----------------\|-------------\|----------\|---------------------\| \| Tall matrix (10:1 ratio) \| 1000×100 \| 10:1 tall \| 3.13 \| 3.24 \| +0.11 \| -3.5% \| ❌ Regression \| \| Tall matrix (10:1 ratio, larger) \| 2000×200 \| 10:1 tall \| 15.72 \| 14.66 \| -1.06 \| +6.7% \| ✅ Improvement \| \| Tall matrix (10:1 ratio, large) \| 5000×500 \| 10:1 tall \| 105.97 \| 101.84 \| -4.13 \| +3.9% \| ✅ Improvement \| \| Wide matrix (1:10 ratio) \| 100×1000 \| 1:10 wide \| 5.90 \| 4.64 \| -1.26 \| +21.4% \| ✅ Major Improvement \| \| Wide matrix (1:10 ratio, larger) \| 200×2000 \| 1:10 wide \| 18.29 \| 17.78 \| -0.51 \| +2.8% \| ✅ Improvement \| \| Wide matrix (1:10 ratio, large) \| 500×5000 \| 1:10 wide \| 137.40 \| 128.70 \| -8.70 \| +6.3% \| ✅ Improvement \| \| Square matrix (baseline) \| 1000×1000 \| 1:1 square \| 116.16 \| 106.09 \| -10.07 \| +8.7% \| ✅ Improvement \| \| Square matrix (larger baseline) \| 2000×2000 \| 1:1 square \| 714.30 \| 673.23 \| -41.07 \| +5.7% \| ✅ Improvement \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161027 Approved by: https://github.com/soulitzer	2025-08-20 18:47:11 +00:00
Jovian Anthony Jaison	c02e26bf31	Fix filename showing up as ints in dynamo_compile stack_trace column. (#160916 ) Test plan: $ python -m test_utils Note: Another way is adding the actual file_name to from_traceback, but since it's referenced in multiple places and may have associated tests this seems safer. Lmk if changes are needed @c00w Pull Request resolved: https://github.com/pytorch/pytorch/pull/160916 Approved by: https://github.com/c00w, https://github.com/masnesral	2025-08-20 18:38:38 +00:00
eqy	c74e5f6061	[CUDA] Bump tolerances for `test_baddmm` (#159915 ) Only one mismatch out of the entire result tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159915 Approved by: https://github.com/nWEIdia, https://github.com/drisspg	2025-08-20 18:05:51 +00:00
dolpm	1471b20cb3	add static dispatch kernel registration to open source (#160439 ) Summary: static dispatch registry should be moved to open source. the rest can maintain internally for now, since delegates will all go through ET hop. Test Plan: spot checked existing tests and didn't see any missing registrations Differential Revision: D80099377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160439 Approved by: https://github.com/SherlockNoMad, https://github.com/zhxchen17	2025-08-20 17:58:00 +00:00
Kevin Yin	b2632e7982	Fix error message for fsdp_pre_all_gather (#160817 ) See: `20e40492b0/test/distributed/_composable/fsdp/test_fully_shard_extensions.py (L97-L104)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160817 Approved by: https://github.com/weifengpy, https://github.com/H-Huang	2025-08-20 17:43:57 +00:00
zhxchen17	5255e65c01	[dynamo] Refactor convert_frame to remove usage of nonlocal tracer output return. [4/n] (#160899 ) Today convert_frame is implemented like the following: ``` def _compile(): tracer_output = None def transform(): nonlocal tracer_output ... def _compile_inner(): transform(...) compile_inner(...) ``` The code is using unconventional nonlocal variable as the return value. This is not ideal for 2 reasons: 1. Reasoning about the code, especially together with error handling code becomes harder. 2. more importantly, this makes it harder to extract out common code pieces into a shared library because everything must depend on a central global state. In this diff we remove the usage of nonlocal return and just use the conventional function return to output the compilation data. Differential Revision: [D80461258](https://our.internmc.facebook.com/intern/diff/D80461258/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160899 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #160814, #160815, #160855	2025-08-20 17:37:26 +00:00
zhxchen17	9e050b6339	[dynamo] Refactor convert_frame._compile_inner to return compiled bytecode + output graph. [3/n] (#160855 ) We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export). This PR adds a new helper function compile_frame() which takes a bytecode and a transform function and return compiled bytecode + output graph as DynamoOutput type. Differential Revision: [D80430802](https://our.internmc.facebook.com/intern/diff/D80430802/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160855 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #160814, #160815	2025-08-20 17:37:26 +00:00
eellison	b3e215b864	Trigger h100 on test_max_autotune, mm, grouped_mm changes (#160678 ) Following @henrylhtsang 's pr here: https://github.com/pytorch/pytorch/pull/160656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160678 Approved by: https://github.com/henrylhtsang, https://github.com/ngimel	2025-08-20 16:56:30 +00:00
Wang, Chuanqi	e483947047	[BE] Remove intel-openmp dependency in setup.py (#160976 ) Fixes #160962 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160976 Approved by: https://github.com/xuhancn, https://github.com/atalman	2025-08-20 16:33:16 +00:00
Angel Li	8e17709055	FlexDecode not guarding on GQA groups correctly (#160904 ) Addressing #151359 Updates flex_decode dispatch to use flex attention rather than flex decode if number of groups is not a power of 2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160904 Approved by: https://github.com/drisspg	2025-08-20 16:32:16 +00:00
Isuru Fernando	e631557518	Fix meta function for aten.complex (#160894 ) Closes https://github.com/pytorch/pytorch/issues/160882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160894 Approved by: https://github.com/mlazos	2025-08-20 16:30:04 +00:00
Charlie West-Taylor	7f201baf41	Allow exposing more functions during initial template expansion (#159554 ) Also adds a `_register_hook` utility, and documents & type annotates `PartialRender`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159554 Approved by: https://github.com/laithsakka, https://github.com/kundaMwiza	2025-08-20 16:08:55 +00:00
Aidyn-A	ce048de608	[ATen][CPU][Sparse] Use Third-Party Eigen for sparse add and addmm (#155357 ) This pull request adds the following ops for sparse matrices using Eigen library: ```python add(a_csr, b_csr) add(a_csc, b_csc) addmm(c_csr, a_csr, b_csr) addmm(c_csr, a_csr, b_csc) addmm(c_csr, a_csc, b_csc) addmm(c_csr, a_csc, b_csr) addmm(c_csc, a_csr, b_csr) addmm(c_csc, a_csr, b_csc) addmm(c_csc, a_csc, b_csc) addmm(c_csc, a_csc, b_csr) ``` Currently, the operations for sparse matrices on CPU are available through MKL only. The non-existence of MKL on `aarch64` causes the unavailability of these ops on any machines with ARM based CPUs, including Apple Silicon, AWS Graviton and NVIDIA Grace. This PR addresses this issue by using Eigen as a backend for the above ops. This is a re-factored version of my previous PR #101814. The main difference with the old one, this does not enable Eigen by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155357 Approved by: https://github.com/pearu, https://github.com/eqy	2025-08-20 15:44:54 +00:00
PyTorch MergeBot	90ea9ccefe	Revert "[rfc] add hint_override kwarg to mark_dynamic (#161007 )" This reverts commit 0533ff2ccba7e77622ac3c6758f1032bdc10feff. Reverted https://github.com/pytorch/pytorch/pull/161007 on behalf of https://github.com/jeffdaily due to failing on both cuda and rocm ([comment](https://github.com/pytorch/pytorch/pull/161007#issuecomment-3206893756))	2025-08-20 15:31:33 +00:00
PyTorch MergeBot	6ea4be1e2e	Revert "[dynamic shapes] unbacked-safe slicing (#157944 )" This reverts commit 2f0cba934de7094a66c6ce68f5e937254f23142a. Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/seemethere due to This is blocking internal sync due to merge conflicts ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3206833193))	2025-08-20 15:16:45 +00:00
Joshua Su	a818fa77e3	Back out "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" (#160999 ) Summary: reverting this diff since it caused S551328. Please see D80217492 for dertails. Test Plan: NA Rollback Plan: Differential Revision: D80553314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160999 Approved by: https://github.com/izaitsevfb, https://github.com/jingsh	2025-08-20 15:04:36 +00:00
Mwiza Kunda	5ee464db5c	[inductor] Fix descriptor broadcasting for singleton dimensions (#160310 ) This fixes the case when an input / output contains both zero strides and singleton dimensions. In this case the broadcasting dimensions generated for the descriptor need to ignore dimensions that have zero strides with size 1, otherwise the determination of which dimensions to broadcast will fail. As an example, consider the following store instruction: ``` name=buf1 index=x2 + 192y0 + 64y1 valule=TritonCSEVariable('tmp7') params = BlockParameters( shape=[3, 4, 1, 1, 64], block_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), 1, 1, XBLOCK], strides=[64, 192, 0, 0, 1], offsets=[(yoffset//4), ModularIndexing(yoffset, 1, 4), 0, 0, xoffset] ) broadcasting_dims=[False, False, True, True, False] broadcast_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), XBLOCK] ``` Because `len(self.broadcasting_dims) != self.broadcast_shape)`, dim3 is incorrectly marked as a broadcast dimension when the pre-broadcast shape is computed in `codegen_broadcast_and_reshape`. ``` 9 pre_broadcast_shape = [ 280 sympy.S.One if is_broadcasting else dim 281 for dim, is_broadcasting in zip( 282 -> self.broadcast_shape, self.broadcasting_dims 283 ) 284 ] ``` The pre_broadcast_shape is now wrong: `[((YBLOCK + 3)//4), Min(4, YBLOCK), 1]` Triton throws the following error: `reshape() cannot change total number of elements in tensor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160310 Approved by: https://github.com/blaine-rister	2025-08-20 09:48:58 +00:00
bobrenjc93	0533ff2ccb	[rfc] add hint_override kwarg to mark_dynamic (#161007 ) The motivation for this change can be seen through the following example: ``` import torch GPU_TYPE = "cuda" @torch.compile def no_override(x): return x.sum(dim=0) @torch.compile def override(x): return x.sum(dim=0) x_small = torch.randn(4096, 512, device=GPU_TYPE) no_override(x_small) torch._dynamo.decorators.mark_dynamic(x_small, 0, hint_override=4096 * 1000) override(x_small) ``` Previously, when reductions were split, codegen relied only on the first observed shape. With a small input, this resulted in a small split size: ``` def triton_per_fused_sum_1(in_ptr0, out_ptr0, xnumel, r0_numel, XBLOCK : tl.constexpr): xnumel = 512 r0_numel = 32 ``` With the new scheme, inductor honors hint_override during codegen, producing larger and more appropriate split sizes: ``` def triton_red_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr): xnumel = 16384 r0_numel = 128 ``` This addresses a broader problem with dynamism: performance and numerics previously depended on whichever shape was seen first. For example: ``` f(s0) -> f(s2) f(s1) -> f(s2) ``` could generate different kernels. With the new approach, an explicit override pins the chosen configuration: ``` f(s0, hint_override=s0) -> f(s2) f(s1, hint_override=s0) -> f(s2) ``` ensuring consistent kernel generation regardless of input order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161007 Approved by: https://github.com/jansel	2025-08-20 07:51:09 +00:00
Nick Riasanovsky	a9fabeb012	[BE] Fix old TMA API in persistent matmul template (#161030 ) Summary: Fixes a bug introduced by https://github.com/pytorch/pytorch/pull/159407 Test Plan: NA Rollback Plan: Differential Revision: D80588320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161030 Approved by: https://github.com/adamomainz, https://github.com/NikhilAPatel, https://github.com/nmacchioni, https://github.com/aakhundov	2025-08-20 05:53:57 +00:00
FFFrog	0f801a510f	Using std::vector or c10::SmallVector instead of CArray (#160959 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160959 Approved by: https://github.com/Skylion007	2025-08-20 05:32:29 +00:00
dolpm	576a0e64ed	[nativert] ensure that moveable outputs are set in other executionframe ctor (#161005 ) Summary: so we use this constructor in HigherOrderKernel. problems arise in the loop condition, where it's possible for an output from the prev. iteration to be an input to the next. so the Output(N) of a kernel may be the Input(M) to a kernel in the next iteration. Thus, if the output value is reset (via. fastresizetozero) or overwritten by a prev. kernel before it is to be used, we have major major issues. we need to enforce that outputs are moved, not copied, to ensure this doesn't happen. Test Plan: buck2 test //caffe2/test:test_export --local-only -- test_while_loop_tensor_constant_idx_cpp_runtime_nonstrict Rollback Plan: Differential Revision: D80565374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161005 Approved by: https://github.com/SherlockNoMad	2025-08-20 05:05:32 +00:00
Menglu Yu	a3fe1ced40	[Optimus][decompose_mm] Fix BooleanAtom corner case (#160987 ) Summary: We observe a case where the BooleanAtom does not support regular sum op for bool exp, thus we fix it by using bool() Rollback Plan: Differential Revision: D80550876 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160987 Approved by: https://github.com/Yuzhen11, https://github.com/mlazos	2025-08-20 04:36:12 +00:00
PyTorch UpdateBot	7e4bfa74ea	[vllm hash update] update the pinned vllm hash (#161020 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161020 Approved by: https://github.com/pytorchbot	2025-08-20 04:15:50 +00:00
Teja Rao	d8fcb2a4ac	[dcp_poc] Fix parameter order in distributed checkpoint API to use path-first for consistency (#160986 ) Summary: This commit standardizes the parameter order across PyTorch's experimental distributed checkpoint (DCP) API, changing all checkpoint operations from (state_dict, path) to (path, state_dict) for consistency with standard file I/O patterns. Test Plan: sandcastle tests Rollback Plan: Differential Revision: D80549014 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160986 Approved by: https://github.com/pradeepfn	2025-08-20 04:09:18 +00:00
Sandeep Narendranath Karjala	2b62ef7420	Add kernel information JSON generation for AOTI packages (#160540 ) Summary: Build on D80031559. Generate kernel_information.json in AOTI compiled artifacts by combining stack traces and node mappings from provenance tracking. This implementation delivers exactly what Zoomer team requested: 1. Core Function: `create_kernel_information_json()` in debug.py combines 3 data sources: - `_inductor_kernel_stack_trace` → `stack_traces` field - `_inductor_triton_kernel_to_post_grad_node_info` → `post_grad_nodes` field - `_inductor_post_to_pre_grad_nodes["postToPre"]` → `pre_grad_nodes` field 2. AOTI Integration: codecache.py writes `kernel_information.json` to pt2 packages when both AOTI packaging and provenance tracking are enabled. 3. Test Coverage: TestKernelInformationAOTI class validates: - JSON file creation in AOTI packages using zipfile - Exact format compliance - Proper disabling without provenance tracking Output Format (exact specification): ```json { "triton_kernel_name_1": { "stack_traces": [str, str, ...], "post_grad_nodes": [str, str, ...], "pre_grad_nodes": [str, str, ...] } } ``` Test Plan: ``` buck test fbcode//caffe2/test/inductor:provenance_tracing -- TestKernelInformationAOTI ``` Manual validation: ```python import torch model = torch.nn.Linear(10, 1) with torch._inductor.config.patch("aot_inductor.package", True): with torch._inductor.config.patch("trace.basic_provenance_tracking", True): # AOTI compilation should generate kernel_information.json compiled = torch.export.export(model, (torch.randn(1, 10),)) ``` --- Rollback Plan: Differential Revision: D80139160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160540 Approved by: https://github.com/yushangdi	2025-08-20 02:33:45 +00:00
Lucas Kabela	54cc63b467	[BE][Dynamo] Type coverage for symbolic_convert (#160922 ) As part of better engineering, we add type coverage to `dynamo/symbolic_convert.py`, which is the main work engine of dynamo for emulating python bytecode. Running ``` mypy torch/_dynamo/symbolic_convert.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Annotated \| Lines Total \| % lines covered \| Funcs Annotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 764 \| 4286 \| 17.83% \| 43 \| 241 \| 17.84% \| \| This PR \| 4322 \| 4322 \| 100.00% \| 241 \| 241 \| 100.00% \| \| Delta \| +3558 \| +36 \| +82.17% \| +198 \| 0 \| +82.16% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/160922 Approved by: https://github.com/StrongerXi	2025-08-20 01:24:31 +00:00
zhxchen17	599f639ddb	[dynamo] Refactor transform() so that instruction translator can be used as a tracing function. [2/n] (#160815 ) We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export). This PR follows the last one which separate out the part to run instruction translator on a given frame and return a DynamoTracerOutput. The end result is a free function that runs instruction translator indepedently. A follow up diff will wrap the low level function. Differential Revision: [D80388694](https://our.internmc.facebook.com/intern/diff/D80388694/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160815 Approved by: https://github.com/anijain2305 ghstack dependencies: #160814	2025-08-20 01:16:35 +00:00
Simon Fan	72e4786d16	[dynamo][dist] trace DeviceMesh's get_local_rank and get_rank as constants (#160805 ) Used in https://github.com/pytorch/torchtitan/pull/1555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160805 Approved by: https://github.com/StrongerXi, https://github.com/mlazos	2025-08-20 01:12:24 +00:00
CaoE	371909cfd1	[Inductor][CPP] Add float16 support for CppMicroGemmAMX (#147368 ) Add float16 support for CppMicroGemmAMX for float16 gemm template. Float16 CppMicroGemmAMX needs a higher version of compiler, e.g., GCC 13. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147368 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-08-20 01:04:05 +00:00
Mikayla Gawarecki	78a8e6a671	Add new_empty (with dtype argument only) to torch::stable (#159508 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159508 Approved by: https://github.com/janeyx99 ghstack dependencies: #160557	2025-08-20 00:50:42 +00:00
Jagadish Krishnamoorthy	543896fcf3	test_matmul_cuda: Refine MX test skipping (#161009 ) Replace return unittest.skip with raise unittest.SkipTest to ensure that the test suite correctly reports skipped tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161009 Approved by: https://github.com/jeffdaily	2025-08-20 00:47:45 +00:00
Anshul Sinha	a3a82e3da8	[FSDP][Replicate] replicate tests for param registration and input device movements (#160147 ) Summary: In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. To this end, I have added three test cases, one to test input device movement and the other two to test parameter registration during the forward and backward pass of a model. Test Cases 1. pytest test/distributed/_composable/test_replicate_training.py -k test_root_move_forward_input_to_device 2. pytest test/distributed/_composable/test_replicate_training.py -k TestReplicateRegisteredParams Pull Request resolved: https://github.com/pytorch/pytorch/pull/160147 Approved by: https://github.com/weifengpy ghstack dependencies: #160135, #160136	2025-08-20 00:47:00 +00:00
Ke Wen	9d7cecdd6c	[SymmMem] Support rendezvous on view of a tensor (#160925 ) `tensor.view` share the same `data_ptr()` as the original tensor, thus cannot serve as key to rendezvous' map (we want a 1:1 match between handle and tensor, thus need a unique key). @ezyang suggests using the raw `TensorImpl` of a tensor, for which `tensor.view` would have a different value than the original tensor. But the raw `TensorImpl` can be stumbled on again when a previous tensor gets deallocated and a new one allocated. For that reason, we'd also need to use a `weak_instrusive_ptr` to distinguish the two tensors, i.e. for the deallocated tensor, `weak_instrusive_ptr::expired()` would return true. Added `test_rendezvous_view` and `test_rendezvous_same`. Note: the view support has been added to NVSHMEM backend and NCCL backend. For CUDA backend, I have yet to investigate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160925 Approved by: https://github.com/ngimel ghstack dependencies: #160825	2025-08-19 23:49:25 +00:00
Natalia Gimelshein	0d19541284	fabric detection - fix build on an old toolkit (#160984 ) Fixes #160960 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160984 Approved by: https://github.com/eqy	2025-08-19 23:43:36 +00:00
eqy	e836323a23	[FP8][cuBLAS][SM100] cuBLAS doesn't support rowwise-scaling on `sm100` (#160693 ) See also: https://docs.nvidia.com/cuda/cublas/#id93 Only tensor-wide scales and 1D scales with tiled layout are supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160693 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007	2025-08-19 23:22:51 +00:00
Colin Peppler	512fc768e9	Add tlparse artifact for joint graph passes (for inference & non-freezing only) (#160589 ) Summary: Joint graph passes run several FX passes which can modify the graph before it hits Inductor. There's three usages of joint graph passes: - for inference & not freezing (we add structured loggings only for this) - for inference & freezing - for fw/bw split Rollback Plan: Reviewed By: yushangdi Differential Revision: D80130321 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160589 Approved by: https://github.com/yushangdi	2025-08-19 23:18:40 +00:00
Xilun Wu	a7b5955ea8	[ContextParallel] add Document Masking test (#160700 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #160700 Summary add test case to CP + FlexAttention for Document Masking Test `pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention_document_mask` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160700 Approved by: https://github.com/fegin	2025-08-19 23:03:18 +00:00
PyTorch MergeBot	e83825f91c	Revert "handling special case for pow(3) for GPU (#157537 )" This reverts commit 05e8fac4f374c4dbf0cd0e85e925e9112cf234a2. Reverted https://github.com/pytorch/pytorch/pull/157537 on behalf of https://github.com/malfet due to This is really really bad from performance point of view, wonder if any benchmarks will detect that ([comment](https://github.com/pytorch/pytorch/pull/157537#issuecomment-3202661810))	2025-08-19 22:57:45 +00:00
Pian Pawakapan	33c3794533	[dynamic shapes] use prims_common contiguity in create_example_tensors (#160933 ) Summary: forward fix T234739699 Test Plan: T234739699 Rollback Plan: Differential Revision: D80503451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160933 Approved by: https://github.com/henrylhtsang	2025-08-19 22:43:13 +00:00
Jane Xu	8f766d6839	Add ScalarType -> shim conversion, add stable::Tensor.scalar_type (#160557 ) TL;DR: Moving to ScalarType in user extensions and removing deprecated dtypes. This change _modifies_ the from/to behavior between ScalarType and StableValue! Whereas before, user extensions could only in abstract pass around obfuscated dtypes appearing as int32_ts, now, users can confidently use torch::headeronly::ScalarType in their extensions for major scalar types. This PR enables ABI stability by adding a translation layer through the shim, so that even if the ScalarType enum values change in the future, user extensions need not fear. Then we add a Tensor scalar_type API which reuses the from/to logic to return to the user a nice ScalarType (vs an abstracted int32_t). I then changed the test to test the scalar_type API. This code change required some refactoring because of circular dependencies. ## BC Breaking note This commit is (narrowly) BC-breaking for unpopular dtypes: `quint`s, `qint`s, `Bits`, `dummy_uint`s, `dummy_int*`s, `Float8_e8m0fnu`, and `Float4_e2m1fn_x2` in the narrow use case where an extension retrieves a Tensor dtype of the above and passes it into `aoti_torch_call_dispatcher`. As of now, I believe there are 0 users of this use case, so the benefits of this change significantly justify BC-breaking this API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160557 Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet	2025-08-19 22:13:47 +00:00
Raman Kumar	05e8fac4f3	handling special case for pow(3) for GPU (#157537 ) follows #152373 Special case for pow(3): Similar to the [CPU kernel](`d27d36136c/aten/src/ATen/native/cpu/PowKernel.cpp (L64)`), added corresponding GPU code for numerical stability. issue #150951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157537 Approved by: https://github.com/soulitzer	2025-08-19 21:57:08 +00:00
Zhengxu Chen	f90ccad165	[export] Relax FC requirement of serde.deserialize by allowing unknown fields. (#160918 ) Summary: Previously we will pass all serialized data to dataclass ctors. Now we just loop over all the existing fields in dataclass and fetch only the field we need to run ctor. This should help with the case when we deserializing a buffer with new field. Test Plan: CI Rollback Plan: Differential Revision: D80487716 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160918 Approved by: https://github.com/angelayi	2025-08-19 21:54:46 +00:00
Rob Timpe	35e4d97e04	[dynamo] Support builtin complex with constant args (#160799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160799 Approved by: https://github.com/guilhermeleobas, https://github.com/mlazos	2025-08-19 20:38:54 +00:00
Jazlyn Li	66166cf1e7	preserve node meta to fix inductor generated kernel name for pattern matched graphs (#160542 ) Summary: When using inductor pattern matcher to replace graphs, the graph generated by replacement function can be missing `original_aten` metadata for the replaced nodes. This further results in inductor failing to generate a sensible kernel name, eg. `tri_poi_fused_0` , missing the aten op name. This diff attempts to fix that by allowing tracing the graph in replacement function with `preserve_node_meta`. Included this as an option to turn on in `pattern_matcher.fwd_only` function. Can confirm that with the fix, MTIA's pattern matcher replaced original graph with a node that has original_aten meta, and inductor generated kernel name has op name. Test Plan: added kernel_name check to afg_inductor_test silu test Rollback Plan: Differential Revision: D80183670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160542 Approved by: https://github.com/eellison, https://github.com/bdhirsh	2025-08-19 20:32:17 +00:00
PyTorch MergeBot	eba20d2d74	Revert "[WIP] Merge Test (#160998 )" This reverts commit ef761c43538abae5bccc0c4b6ebaf42ff676db7a. Reverted https://github.com/pytorch/pytorch/pull/160998 on behalf of https://github.com/ZainRizvi due to Undoing test merge ([comment](https://github.com/pytorch/pytorch/pull/160998#issuecomment-3202125839))	2025-08-19 20:30:39 +00:00
John Stawinski	ef761c4353	[WIP] Merge Test (#160998 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160998 Approved by: https://github.com/ZainRizvi	2025-08-19 20:26:07 +00:00
Will Constable	1ea918caf9	[C10D] Make MultiProcContinuousTest less spammy (#160821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160821 Approved by: https://github.com/fduwjj ghstack dependencies: #160892	2025-08-19 20:17:19 +00:00
Will Constable	779fc29c04	[C10D] Fix spelling of MultiProcContinuousTest (#160892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160892 Approved by: https://github.com/fduwjj	2025-08-19 20:17:19 +00:00
Aaron Gokaslan	ed8bcccf31	[BE][Ez]: Update ruff to 0.12.9 (#160896 ) Updates ruff. Fixes false positives and other miscellaneous ruff linting and formatting fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160896 Approved by: https://github.com/zou3519	2025-08-19 19:56:24 +00:00
Ke Wen	9d9cc9897a	[SymmMem] Support rendezvous on slice of a tensor (#160825 ) When we search for a NVSHMEM allocation backing a tensor, don't limit it to an exact match between `tensor.data_ptr()` and `allocation.base_ptr`. Instead, test whether the former is within an allocation range, i.e. [base_ptr, base_ptr + size). This PR also squashed in original base PR #160795: Since (i) `handle = rendezvous(tensor)`, and (ii) we pass `handle->buffer_ptrs` to kernels, `handle` should carry the `data_ptr()` of tensor instead of the base address of a memory allocation (previous case). Pull Request resolved: https://github.com/pytorch/pytorch/pull/160825 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-08-19 19:08:45 +00:00
Markus Hoehnerbach	65d21dae18	[inductor] dont reuse buffers if it affects peak (#145883 ) (#159530 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159530 Approved by: https://github.com/eellison	2025-08-19 19:02:56 +00:00
atalman	62db8ec391	windows python 3.14 nightly builds (#159869 ) Related to https://github.com/pytorch/pytorch/issues/156856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159869 Approved by: https://github.com/malfet, https://github.com/williamwen42	2025-08-19 18:36:16 +00:00
Mengtian Xu	5dad5b4f57	[AIDIR] Revise the insight content (#160649 ) Summary: Make it more descriptive and understable to user. Rollback Plan: Differential Revision: D80218659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160649 Approved by: https://github.com/jingsh	2025-08-19 18:04:49 +00:00
Huy Do	fab5dac734	Tweak dependabot to run inductor jobs (#160935 ) After https://github.com/pytorch/pytorch/pull/160635, I can see dependabot creating the PR to bump `transformers` version at https://github.com/pytorch/pytorch/pull/160807. This a good start, but there are several tweaks we need: 1. Run inductor tests on the PR including one round of perf benchmark, which is always needed. So, we need `ciflow/inductor` label and a `pull_request` trigger for the benchmark 2. Per @anijain2305 feedback, we don't need to update patch version. So, I add a rule to ignore it. Again, we would need to test this out after this lands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160935 Approved by: https://github.com/anijain2305	2025-08-19 17:56:07 +00:00
Nikita Shulga	a44a0d3671	[MPS] Fix index_add for complex + int64 (#160926 ) By re-using deterministic algorithm from `bbc7c03e93/aten/src/ATen/native/cuda/Indexing.cu (L1106-L1113)` Fixes https://github.com/pytorch/pytorch/issues/160845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160926 Approved by: https://github.com/manuelcandales ghstack dependencies: #160850, #160889	2025-08-19 17:43:06 +00:00
Pian Pawakapan	2f0cba934d	[dynamic shapes] unbacked-safe slicing (#157944 ) Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944 Approved by: https://github.com/laithsakka	2025-08-19 17:32:47 +00:00
Sam Anklesaria	0a5ab612dd	Port amax to stable ABI (#160214 ) To enable porting torchaudio to the stable ABI, we need the `amax` operation to be accessible. This PR ports the op and provides tests that it behaves correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160214 Approved by: https://github.com/mikaylagawarecki	2025-08-19 17:24:53 +00:00
Jeff Daily	1fbe230b0d	forward fix #160747 (#160981 ) broke rocm inductor tests Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/160981 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007 Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-19 17:16:41 +00:00
PyTorch MergeBot	eddaaa6c2a	Revert "Recheck Autotune cache on Precompile serialization to prune compilation results (#158656 )" This reverts commit 664005662ad8c9aa1942015397048aa9ca14fd6d. Reverted https://github.com/pytorch/pytorch/pull/158656 on behalf of https://github.com/seemethere due to failing internal tests, see D80486843 ([comment](https://github.com/pytorch/pytorch/pull/158656#issuecomment-3201491561))	2025-08-19 16:53:20 +00:00
Richard Barnes	fecc5f6001	[codemod] Fix unused-local-typedef issue in caffe2/aten/src/ATen/native/cuda/CUDALoops.cuh +2 (#160944 ) Summary: LLVM has a warning `-Wunused-local-typedef` which we are enabling to remove unused code. This has the side-effect of making it easier to do refactors should as removing unnecessary includes. For questions/comments, contact r-barnes. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Rollback Plan: Differential Revision: D80511128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160944 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-08-19 16:49:29 +00:00
Isuru Fernando	f305019377	[inductor] propagate shapes in CSEVariable (#152198 ) Fixes #149905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152198 Approved by: https://github.com/eellison	2025-08-19 16:46:38 +00:00
Tialo	50cfe76231	Update checkpoint warning to target PyTorch 2.9 (#160725 ) Follow-up to #160534. Fixes the docstrings and the warning in checkpoint_sequential, which presumably should have same deprecation notice Pull Request resolved: https://github.com/pytorch/pytorch/pull/160725 Approved by: https://github.com/soulitzer	2025-08-19 15:08:50 +00:00
James Wu	9225c61994	Move save guard error throwing to separate phase (#160662 ) This diff makes it so that the portion saving guards that can throw is completely separated from GuardBuilder, and instead in `serialize_guards`. This lets me add a try catch around it for caching precompile later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160662 Approved by: https://github.com/zhxchen17	2025-08-19 14:46:43 +00:00
PyTorch MergeBot	e3ebf364e6	Revert "Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836 )" This reverts commit 5d9653d90ee003173dd03f93e09fed236500ef06. Reverted https://github.com/pytorch/pytorch/pull/160836 on behalf of https://github.com/malfet due to It broke inductor tests by improving them ([comment](https://github.com/pytorch/pytorch/pull/160836#issuecomment-3200834103))	2025-08-19 13:46:53 +00:00
FFFrog	284b719005	Remove the uncessary empty file (#160728 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160728 Approved by: https://github.com/Skylion007	2025-08-19 10:54:08 +00:00
FFFrog	daeb3a6094	Using std::make_unique<T>() instead of unique<T>(new T()) (#160723 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160723 Approved by: https://github.com/Skylion007	2025-08-19 10:25:47 +00:00
cyy	5d9653d90e	Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836 ) Because numpy 1.22.4 had reached EOL 3 years ago. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160836 Approved by: https://github.com/malfet	2025-08-19 09:15:06 +00:00
Nick Riasanovsky	df60736410	[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747 ) Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs. Test Plan: Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda` Rollback Plan: Differential Revision: D80348643 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747 Approved by: https://github.com/NikhilAPatel	2025-08-19 07:32:55 +00:00
thenumberouscode	8f31aa97a3	[dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#160934 ) Fixes #157399 cherry pick of d6a5c03 @mlazos Pull Request resolved: https://github.com/pytorch/pytorch/pull/160934 Approved by: https://github.com/mlazos	2025-08-19 06:01:26 +00:00
Nikita Shulga	29afde2020	[CD] Build libtorch without nvshmem (#160910 ) It was done once for cuSparseLT in `f01d7105b1` , now it's nvShmem's time Fixes https://github.com/pytorch/pytorch/issues/160762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160910 Approved by: https://github.com/Skylion007	2025-08-19 05:58:25 +00:00
David Berard	8dbe7f99bd	[BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711 ) allow_tf32 is deprecated. Also, this will make it easier to support tf32x3 (i.e. #160359). dashboard results on h100 show no change: [inference](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2011%20Aug%202025%2017%3A01%3A22%20GMT&stopTime=Mon%2C%2018%20Aug%202025%2017%3A01%3A22%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/399/orig&lCommit=ce12d0fd751a733f22b5bdda00bd58d323e0a526&rBranch=main&rCommit=e444cd24d48b3a46f067974f2cc157f5ed27709f), [training](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2011%20Aug%202025%2017%3A01%3A22%20GMT&stopTime=Mon%2C%2018%20Aug%202025%2017%3A01%3A22%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/399/orig&lCommit=ce12d0fd751a733f22b5bdda00bd58d323e0a526&rBranch=main&rCommit=e444cd24d48b3a46f067974f2cc157f5ed27709f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160711 Approved by: https://github.com/PaulZhang12, https://github.com/njriasan	2025-08-19 05:27:10 +00:00
PyTorch UpdateBot	1d46aa736f	[audio hash update] update the pinned audio hash (#160930 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160930 Approved by: https://github.com/pytorchbot	2025-08-19 04:22:55 +00:00
PyTorch UpdateBot	2cf69fe0e1	[vllm hash update] update the pinned vllm hash (#160929 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160929 Approved by: https://github.com/pytorchbot	2025-08-19 04:22:45 +00:00
dolpm	923bc46122	fix mul.Scalar with strided tensor (#160560 ) Summary: out variant has to be strided like self. since memory format isn't provided, this should be equivalent. Test Plan: prev. when we enable static dispatch this test would have numeric issues ``` buck2 test //caffe2/test:test_export -- test__scaled_dot_product_flash_attention_cpp_runtime_nonstrict --print-passing-details ``` Rollback Plan: Reviewed By: SherlockNoMad Differential Revision: D80191085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160560 Approved by: https://github.com/SherlockNoMad	2025-08-19 04:15:12 +00:00
Paul de Supinski	58f9a3dd63	[ez] Only use default numa bindings if nproc == cuda device count (#160848 ) # Context Another fix to enable broad rollout of #149334. The implementation assumes that the trainer process with local rank `n` only uses device `cuda:n`. However, there are sometimes jobs with more than one GPU per process, in which case our assumption could be incorrect and actually lead to worse memory locality. # This PR As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160848 Approved by: https://github.com/kiukchung	2025-08-19 02:50:01 +00:00
Will Feng	a391fa1c42	Make Inductor benchmarker more compatible with Triton do_bench (#160921 ) Common benchmark suites like TritonBench uses `triton.testing.do_bench` for kernel timing measurement which is not always fair for all backends. E.g. it includes torch.compile Dynamo invocation overhead and hence doesn't reflect real-world model use case where Dynamo overhead is usually hidden. I also opened a PR to use this timing measurement function on TritonBench side: https://github.com/meta-pytorch/tritonbench/pull/333. But regardless of whether that PR can land, I think we should enhance Inductor benchmark_gpu to match do_bench features, to make it easier to people to migrate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160921 Approved by: https://github.com/BoyuanFeng	2025-08-19 02:40:21 +00:00
Yidi Wu	209143ddeb	[while_loop][inductor] fix aliased inputs by cloning (#160668 ) [fx_graph_cse](https://github.com/pytorch/pytorch/blob/main/torch/_functorch/compile_utils.py#L46) is executed in min_cut partitioner which accidentally creates the aliasing for empty buffers and we could see the following graph node for joint graph with cmd: "pytest test/functorch/test_control_flow.py -k test_scan_multiple_layers_gradient_layers_2_device_cpu" ```python while_loop = torch.ops.higher_order.while_loop(while_loop_cond_graph_0_0, while_loop_body_graph_0_0, (full_default_4, empty_strided_default, full_default_2, full_default_3, full_default_2, full_default_3, full_default, full_default, rev, rev_1, rev_2, rev_3), (primals_4, primals_5, primals_6, primals_7)); ``` Notice the operands sequence "full_default_2, full_default_3, full_default_2, full_default_3, full_default, full_default", which indicates the gradient of different layers now sharing the same buffer, which create silent incorrectness. Fixes https://github.com/pytorch/pytorch/pull/158168. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160668 Approved by: https://github.com/zou3519 ghstack dependencies: #160548, #160374	2025-08-19 02:33:59 +00:00
Wang, Chuanqi	b1380f434d	[CD] Disable USE_MPI in XPU CI/CD wheel build (#159135 ) XPU wheel build need source MPI for distributed XCCL backend build, but it also enable USE_MPI by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159135 Approved by: https://github.com/malfet	2025-08-19 02:32:03 +00:00
mori360	e6e45e6ae8	[FSDP] Use post_reduce_stream.record_event() on hsdp+cpuoffload (#160481 ) Fixes https://github.com/pytorch/pytorch/issues/160291 `post_reduce_stream` is `all_reduce_stream` during HSDP, but CPU-GPU sync is hard coded to `reduce_scatter_stream` The hard-code could fail unit test on HSDP+CPU offload, add unit test here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160481 Approved by: https://github.com/weifengpy	2025-08-19 02:20:14 +00:00
Anshul Sinha	3d126e17e0	[FSDP][Collectives] skipping reduce_scatter when world size is 1 (#160136 ) Summary: In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_collectives to skip reduce_scatter in the foreach_reduce API when world_size ‎ = 1. I have created edited a test that uses CommDebugMode to verify that the reduce_scatter has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. I have also added a test command. Test Cases 1. pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_single_worldsize1 2. pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_tp_with_fsdp_offloading Pull Request resolved: https://github.com/pytorch/pytorch/pull/160136 Approved by: https://github.com/weifengpy ghstack dependencies: #160135	2025-08-19 02:13:30 +00:00
Kevin Fu	8d15af2320	[PT2]: Allow None for wrapped_fbgemm_linear_fp16_weight (#160802 ) Summary: Currently the implementation of [fbgemm_linear_fp16_weight](https://www.internalfb.com/code/fbsource/[ffe8ba561cb6af33fde5b32c27411d6d3f4f2c70]/fbcode/caffe2/aten/src/ATen/native/QuantizedLinear.cpp?lines=477) does not allow None for `bias`, but it's actually a valid case and internally `fbgemm_linear_fp16_weight_fp32_activation` accept None bias as well. For BC reason, we can't directly change the function signature. So wrapping an empty tensor if bias is None to workaround it in Sigmoid. Test Plan: P1906210273 ``` MODEL_TYPE=dpa_product_first_ctr_model MODEL_ENTITY_ID=778442870 SNAPSHOT_ID=6 MODULE=user SUFFIX=.predictor.precompute.remote_request_only buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice="" --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true --benchmarkNumIterations=10000 &> ~/logs/${MODEL_TYPE}/load_net_predictor_${MODEL_ENTITY_ID}_${SNAPSHOT_ID}_${MODULE} ``` Rollback Plan: Reviewed By: henryoier, hl475 Differential Revision: D80382652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160802 Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier	2025-08-19 01:46:53 +00:00
zhxchen17	e9209e0854	[dynamo] Refactor tracer logic in convert_frame so that it doesn't leak to outer layer. [1/n] (#160814 ) We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export). One incremental step we can take is to refactor out InstructionTranslator as a functional piece providing bytecode tracing. To separate out this part, we notice currently the tracer object is being passed around in the entire convert frame compile function. This is not very ideal because we want to build a boundary between the tracing and downstream compiler stack. Ideally, we should extract all the relevant information out of the tracer object and return a new data structure that is free of internal states of InstructionTranslator. Luckily, there aren't many data used from tracer, after tracing is finished. The major one is OutputGraph, other than that, we only need to record two boolean flags for error handling purposes. The new type we're adding is called DynamoTracerOutput, which contains all the information needed by torch.compile internal after symbolic convert is finished. To simplify the current PR, we leave out the part which reduce OutputGraph into a minimal set, since this can be done in a separate PR. Differential Revision: [D80388693](https://our.internmc.facebook.com/intern/diff/D80388693/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160814 Approved by: https://github.com/tugsbayasgalan	2025-08-19 01:46:24 +00:00
Pian Pawakapan	4cb31015f2	[dynamic shapes] prims_common non_overlapping_and_dense (#160462 ) Differential Revision: D80120333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160462 Approved by: https://github.com/laithsakka	2025-08-19 01:35:28 +00:00
PyTorch MergeBot	5e98d9f9ba	Revert "[dynamic shapes] unbacked-safe slicing (#157944 )" This reverts commit 56218d85e2da09d9ede3809718ec989c2151632c. Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think this is failing test_draft_export in trunk `56218d85e2` ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3198874677))	2025-08-19 01:16:17 +00:00
Michael Lazos	5cf6567c1f	[Inductor] add cuda compile cmd to autotuning logging (#160906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160906 Approved by: https://github.com/henrylhtsang	2025-08-19 01:14:46 +00:00
Shangdi Yu	41b3e80a55	Fix duplicated kernel name in kernel stack trace tracking (#160905 ) Summary: as title. When we have two kernels with the same name, the stack traces should be appended, not overwritten. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing ``` Rollback Plan: Differential Revision: D80472731 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160905 Approved by: https://github.com/angelayi	2025-08-19 01:14:34 +00:00
Ting Lu	b6852778ff	Add Magma build for CUDA 13.0 (#160770 ) Add magma build for CUDA 13.0 after almalinux docker is available https://github.com/pytorch/pytorch/issues/159779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160770 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com> Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-08-19 01:10:00 +00:00
xinan.lin	1853f71b4f	[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#160403 ) Fixes #160243, Fixes #160244, Fixes #160245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160403 Approved by: https://github.com/janeyx99	2025-08-19 00:54:51 +00:00
Lakshay Garg	bbc7c03e93	Fix UndefinedGrad::apply (#160572 ) The function incorrectly reserved space in the input parameter instead of the output parameter Pull Request resolved: https://github.com/pytorch/pytorch/pull/160572 Approved by: https://github.com/soulitzer	2025-08-19 00:15:51 +00:00
Justin Chu	dc200066cf	[ONNX] Use onnxruntime 1.22 in CI (#160924 ) Use onnxruntime 1.22 in CI to enable testing of newer opsets and IR versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160924 Approved by: https://github.com/titaiwangms	2025-08-19 00:05:26 +00:00
Pian Pawakapan	56218d85e2	[dynamic shapes] unbacked-safe slicing (#157944 ) Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944 Approved by: https://github.com/laithsakka	2025-08-18 22:38:16 +00:00
Natalia Gimelshein	0254646654	harden fabric checks for symmetric memory (#160790 ) Now we check only that fabric allocation succeeded, but sometimes we fail during export or import afterwards, with no recourse. Check the full cycle before attempting to allocate memory with the fabric. TODO: move it to c10/cuda so that it can be used from CUDACachingAllocator too Pull Request resolved: https://github.com/pytorch/pytorch/pull/160790 Approved by: https://github.com/Skylion007	2025-08-18 22:35:50 +00:00
dolpm	b439675ae2	[nativert] oss pass graph pass registration (#160859 ) Summary: att Test Plan: CI Rollback Plan: Differential Revision: D80368343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160859 Approved by: https://github.com/georgiaphillips	2025-08-18 22:23:38 +00:00
PyTorch MergeBot	82c7a1eb4b	Revert "[ONNX] Default to dynamo export (#159646 )" This reverts commit 11b6ceb7b4f81ba02f88652136a93d685c399191. Reverted https://github.com/pytorch/pytorch/pull/159646 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/159646#issuecomment-3198507767))	2025-08-18 21:41:32 +00:00
Wei Wang	16ada80c61	[BE][CUDA][Distributed] Add require_exact_world_size() and a few distributed unit test fixes (#160803 ) 1. Add require_exact_world_size() 2. Decorate the test `test_new_subgroups_with_group_param` with this require_exact_world_size(4) as the test would fail with world_size of 8 when testing with 8xB200 runner. 3. Modify `test_new_subgroups_world_size_not_divisible_by_group_size` so that it will not fail due to 4 vs. 8 mismatch. Doing so makes the test pass with both 4-GPU runner and 8-GPU runner. Separating these changes out from B200 distributed runner PR #159323 Fixes https://github.com/pytorch/pytorch/issues/159987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160803 Approved by: https://github.com/fduwjj	2025-08-18 21:15:33 +00:00
Klaus Zimmermann	c27d6df1ea	For sdists, replace symlink with copy for docs requirements (#157811 ) Before this change, there was the requirements file `.ci/docker/requirements-docs.txt` which was symlinked as `../.ci/docker/requirements-docs.txt` from `docs/requirements.txt` since #151796. In this situation, [because `.ci` is excluded from the source tarball](`3173616532/.github/workflows/create_release.yml (L67)`), we end up with a broken symlink, that additionally is [invalid in a Python source distribution](https://packaging.python.org/en/latest/specifications/source-distribution-format/#unpacking-without-the-data-filter). The broken symlink can be confirmed in [the rc sources](https://github.com/pytorch/pytorch/actions/runs/15892205745). ~After this change, there is still a single source of truth, which now is `docs/requirements.txt`, symlinked as `../docs/requirements.txt` from `.ci/docker/requirements-docs.txt`, which would also be invalid in a Python source distribution, but is not included in the tarball (see above). Additionally, the docs requirements that were missing from the previous tarball, are now actually included, allowing users to build the documentation again.~ @malfet clarified offline that there is a problem with the docs workflows because they use a cache with a key that includes the hash of the requirements document in the `.ci` folder, which now does no longer change when the requirements change. Hence, a different solution is needed~, though for now the problem remains~. The solution in this PR is simply to copy the actual document to replace the symlink just prior to creating the source distribution. This way, a single document needs to be maintained, git checkouts remain as they are, and the source distributions contain the before-missing document. A better solution may be implemented at a later stage with a better build system. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157811 Approved by: https://github.com/atalman	2025-08-18 21:10:44 +00:00
Mitchell, Frost	d910cb3b2d	[cpp][inductor] Fix crash on bmm when input is used twice. (#160087 ) Fixes #156412 For torch.bmm using CPP generated template code, when the input is used as both the first and second weights, the generated code will simplify so it only passes one input instead of 2. However, if the weights are being repacked and saved for more efficient data-loading patterns, then we need to save both inputs instead of just one. This PR fixes this issue. ## Test code: ```python import torch @torch.compile(mode="max-autotune") def my_function(x, y): return torch.bmm(x, x) # Test x = torch.randn(2, 3, 3) y = torch.randn(2, 3, 3) result = my_function(x, y) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160087 Approved by: https://github.com/guangyey, https://github.com/jansel	2025-08-18 20:34:14 +00:00
Ryan Guo	a1a555ed7b	[dynamo] Fix graph break on calling functions decorated with special context manager (#160703 ) As title. This is a follow-up of the previous patch, with the goal of supporting a new pattern that showed up in ComfyUI: `644b23ac0b/comfy/ops.py (L44)` Effectively, the semantics of calling a function decorated with a context manager is: ```python @ctx_manager(args) def f(x): ... f(x) # -----> with ctx_manager(args): f.__wrapped__(x) ``` Yes, a fresh context manager instance per invokation, see CPython source code: https://github.com/python/cpython/blob/3.12/Lib/contextlib.py#L119-L122 So Dynamo already 1. knows how to handle the `with ctx_manager(args)` syntax, and has special handling for a few torch native context managers, like `sdpa_kernel` in this patch. 2. can trace through a good chunk (at least the ones that matter in this case) of contextlib. This patch just let Dynamo trace a bit more into contextlib, and then keep the torch-native special cases by moving their handling a bit down the stack, so that no additional logic is introduced -- it's only refactored. This also allows us to get rid of some `_sdpa_kernel_variadic` special handling, since now we will trace through its code, and it boils down to `sdpa_kernel` anyways. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160703 Approved by: https://github.com/guilhermeleobas, https://github.com/mlazos ghstack dependencies: #160684	2025-08-18 20:33:45 +00:00
Ryan Guo	72b559b2c8	[dynamo] Fix crash and silent incorrectness issues in `attention.sdpa_kernel` calls with kwargs (#160684 ) This patch fixes 2 issues, illustrated by the test cases added: 1. using `sdpa_kernel(backends=..., set_priority=...)` due to an internal assert that forgot to be updated after #147768. 2. forgetting to convert the `set_priority` VariableTracker back to a python constant so that its value is properly used by `sdpa_kernel`, also from #147768. I ran into (1) because ComfyUI had a recent update that actually sues this pattern `644b23ac0b/comfy/ops.py (L44)`, and then noticed (2), and fixed it conveniently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160684 Approved by: https://github.com/mlazos	2025-08-18 20:33:45 +00:00
cyy	1f19003694	Use py3.10 for ONNX CI jobs (#160852 ) Use Python 3.10 for ONNX jobs because Python 3.9 is near EOL and futher ONNX versions drop 3.9 support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160852 Approved by: https://github.com/justinchuby, https://github.com/malfet	2025-08-18 19:37:47 +00:00
Shangdi Yu	4e90441133	Add signpost to provenance tracking error (#160755 ) Summary: As title, add signpost to better track error when computing provenance tracking related debugging information Test Plan: CI Rollback Plan: Differential Revision: D80292285 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160755 Approved by: https://github.com/angelayi	2025-08-18 19:17:47 +00:00
Xinya Zhang	bfcae7e1c1	[ROCm] Fix Sliding Window Attention in AOTriton integration code (#159773 ) AOTriton implements Sliding Window Attention (SWA) as a more generalized version of causal masks and also needs an atomic counter for dynamic workload allocation. Fixes #158308 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159773 Approved by: https://github.com/jeffdaily	2025-08-18 18:45:58 +00:00
Michael Lazos	01bba62e21	Remove unused test code (#160823 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160823 Approved by: https://github.com/Skylion007	2025-08-18 18:37:52 +00:00
angelayi	6ac9035a84	[aoti-fx] Dynamic shapes support (#160766 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160766 Approved by: https://github.com/jansel ghstack dependencies: #160765	2025-08-18 18:14:08 +00:00
angelayi	bab79824cb	[aoti-fx] Initial AOTInductor FX (#160765 ) Using the existing WrapperFxCodegen backend, this PR prototypes an AOT version of it which will directly return a graph module. How to use: ```python exported_gm = torch.export.export(model, inp, dynamic_shapes=dynamic_shapes).module() compiled_gm = torch._inductor.aot_compile( exported_gm, inp, options={"fx_wrapper": True, "compile_threads": 1} ) assert torch.allclose(model(inp), compiled_gm(inp)) ``` The motivation behind this is that backends like ExecuTorch/MTIA would like to use inductor's optimization technologies, but might have their own graph lowering pipelines so they might not want to use AOTI (which generates an so). Pull Request resolved: https://github.com/pytorch/pytorch/pull/160765 Approved by: https://github.com/jansel	2025-08-18 18:14:08 +00:00
Rob Timpe	162bf78df6	[dynamo] Support itertools.filterfalse (#160596 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160596 Approved by: https://github.com/guilhermeleobas	2025-08-18 18:07:57 +00:00
Michael Lazos	450517f346	[Dynamo][Hierarchical Compile] Flatten tuple inputs for regions (#158812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158812 Approved by: https://github.com/anijain2305 ghstack dependencies: #158810, #158811	2025-08-18 18:03:11 +00:00
James Wu	664005662a	Recheck Autotune cache on Precompile serialization to prune compilation results (#158656 ) This PR rechecks the autotune cache on Precompile.serialize(), allowing us to ahead of time save autotune results for statically compiled triton kernels, so that warm start does not need to check the autotune cache. It has a few extra changes to make this work: ### Storing source code in TritonBundler - We now store the source_code for statically compiled triton kernels instead of the hash of the source code in TritonBundler, so that we can easily access their source code when rechecking the autotune cache on PrecompileContext.serialize. To make sure that this is not a huge space concern, I ran the entire hugging face benchmark on training. The total space of `/tmp/torchinductor_jjwu/fxgraph` before my change was 1185004 KB (1.18 GB). After my change, this increased to 1207312 KB (1.2 GB), for an increased storage cost of ~1.8%, which seems safe. - We now return early from recheck_autotune_cache if the number of triton kernels being compiled is 1, since there's no reason to check the cache at all in those cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158656 Approved by: https://github.com/zhxchen17	2025-08-18 17:55:10 +00:00
Sam Anklesaria	c0a1ae4404	Add `is_cpu` method to stable tensor type (#160212 ) Porting torchaudio to use the stable api requires the `is_cuda` and `dtype` functions. It would be more convenient if these were methods of the stable tensor class rather than utilities one needed to call from the C api. This PR adds them as methods, mirroring how `is_cuda` and `get_device` are already defined. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160212 Approved by: https://github.com/janeyx99	2025-08-18 17:42:43 +00:00
Nikita Shulga	b0071c65e2	[MPS] Fix error check for torch.var on scalar (#160889 ) Fixes https://github.com/pytorch/pytorch/issues/160738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160889 Approved by: https://github.com/Skylion007 ghstack dependencies: #160850	2025-08-18 17:36:42 +00:00
Guilherme Leobas	c6333f7dae	Fixes for `collections.NamedTuple` (#159367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159367 Approved by: https://github.com/mlazos ghstack dependencies: #159365, #159366, #159368, #159483, #159902, #159864, #159865	2025-08-18 17:32:59 +00:00
Ting Lu	87d6831b2e	Add CUDA installation script for CUDA 13 (#160201 ) Add the almalinux docker for building magma-cuda 13.0 https://github.com/pytorch/pytorch/issues/159779 Also fixed the NVSHMEM download link Pull Request resolved: https://github.com/pytorch/pytorch/pull/160201 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2025-08-18 17:26:25 +00:00
James Wu	4014672b30	Replace guard_serialization_mode with save_guards, remove load cases (#160531 ) This PR replaces "guard_serialization_mode" into `save_guards`. All cases where we care about whether or not we're loading guards can be inferred automatically from the existing inputs. The only case that's special here is whether or not to check guards. We don't want to check guards on guard load in CheckFnManager, because these guards have already been checked on save. Therefore, we put the setting in OutputGraphGuardsState, so that when we save, we bypass the guards check. Because of this change, it is technically possible to do a load and a save in the same CheckFunctionManager.__init__() by passing all the necessary parts, and also passing `save_guards=True`. This should just work out of the box, but so far no callsites need it, so not super important. Next up, we'll work on removing save_guards from GuardBuilder, and putting it into its own phase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160531 Approved by: https://github.com/zhxchen17	2025-08-18 17:04:17 +00:00
Peter Y. Yeh	e389a08dcd	AMD/ROCm OCP Micro-scaling Format (mx-fp8/mx-fp4) Support (#151360 ) - This pull request introduces support for the [OCP Micro-scaling (MX) format](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf), with a focus on compatibility with AMD ROCm 7.0 and the gfx950 architecture. This PR also establishes the foundation for enabling MX-FPX features in [TorchAO](https://github.com/pytorch/ao/issues/2229) on the AMD platform. - Validation (ROCm 7.0 + gfx950 required): `111 relevant tests passing.` > PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Co-author: @jagadish-amd — Thank you for the efforts leading validation on gfx950 with ROCm 7.0. ----------------------------------- This pull request introduces support for new scalar types and scaling methods, particularly for ROCm 7.0 and gfx950, and refines testing for these features. Key changes include adding constraints for matrix dimensions, enabling block-wise scaling, and updating tests to accommodate new data types. ### Support for new scalar types and scaling methods: * [`aten/src/ATen/cuda/CUDABlas.cpp`](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885): Added constraints for matrix dimensions when using `Float8_e8m0fnu` with block-wise scaling, ensuring dimensions are multiples of 32. Updated compatibility checks to support ROCm 7.0 for `Float8_e8m0fnu` and `Float8_e4m3fn`. [[1]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885) [[2]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeL1913-R1934) * [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290): Introduced block-wise scaling for `Float8_e8m0fnu`, with checks for ROCm 7.0 and GPU architecture `gfx950`. Added validation for supported scalar types and matrix dimensions. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1349-R1364) ### Updates to scalar type mappings: * [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L93-R93): Extended scalar type mappings to support `Float4_e2m1fn_x2` for ROCm 7.0. * [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fR88-R96): Added a constexpr mapping for `Float4_e2m1fn_x2` based on ROCm version. ### Enhancements to testing(@jagadish-amd): * [`test/test_matmul_cuda.py`](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766): Updated tests to include new scalar types (`Float4_e2m1fn_x2`) and recipes (`mxfp4`). Added logic to handle different scaling recipes and validate compatibility with ROCm and CUDA versions. [[1]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766) [[2]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23L1331-R1356) F592e669L1353R1472) These changes improve compatibility with newer hardware and software versions, enhance functionality for matrix operations, and ensure robust testing for the added features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151360 Approved by: https://github.com/drisspg, https://github.com/malfet	2025-08-18 16:43:09 +00:00
Animesh Jain	f2be3dc8da	[dynamo][guards] Optimize module getattr access for inline flag (#160864 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160864 Approved by: https://github.com/Lucaskabela ghstack dependencies: #160863	2025-08-18 16:38:46 +00:00
Animesh Jain	b8ff0fd21b	[dynamo][guards] Remove long lines from TORCH_LOGS=guards (#160863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160863 Approved by: https://github.com/Lucaskabela	2025-08-18 16:38:46 +00:00
Nikita Shulga	6b994c47ca	[MPS][BE] Fix unused vars in GridSampler (#160850 ) This fixes following warnings during the compilation of GridSampler.metal ``` /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/GridSampler.metal:22:23: warning: unused parameter 'input_sizes' [-Wunused-parameter] constant int32_t* input_sizes, ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/GridSampler.metal:24:23: warning: unused parameter 'grid_sizes' [-Wunused-parameter] constant int32_t* grid_sizes, ^ 2 warnings generated. ``` Introduced by https://github.com/pytorch/pytorch/pull/160541 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160850 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-08-18 16:24:45 +00:00
angelayi	3c8c509a9c	[export] Fix custom ops in subgraphs (#160004 ) Fixes https://github.com/pytorch/pytorch/issues/159995 Currently there are two problems with extern kernels in subgraphs: 1. They don't get serialized to the extern kernel json file because we only look at the toplevel graph. 2. Since the scope of each extern_kernel list is within its own subgraph, the indices referencing the operator is messed up because each subgraph will start counting from 0. So, this PR moves the extern_kernels list to a global view (under virtualized) so that we can count the extern kernels across subgraphs and the toplevel graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160004 Approved by: https://github.com/ydwu4	2025-08-18 15:42:19 +00:00
Angela Yi	1091165826	[export] Update move_to_device_pass for to.device (#160528 ) Differential Revision: D80135455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160528 Approved by: https://github.com/yushangdi	2025-08-18 15:41:48 +00:00
Scott Todd	d91a03f96a	[ROCm] Add HIPConfig.h to .gitignore like CUDAConfig.h. (#159805 ) This file is generated into the source directory by CMake just like `cuda/CUDAConfig.h`, so it seems appropriate to add it to `.gitignore` in the same place: `83ba3f1101/aten/src/ATen/CMakeLists.txt (L39-L47)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159805 Approved by: https://github.com/jeffdaily	2025-08-18 15:34:01 +00:00
Nichols A. Romero	0298ebc97a	[ROCm][inductor][dashboard] Add GPT2ForSequenceClassification to use_larger_multiplier_for_smaller_tensor list (#160001 ) GPT2ForSequenceClassification Hugging Face (HF) model fails on ROCm for bfloat16. The failure is numerically small. This PRs adds this model to an exception list for small tensors. The exception list already includes two models. This increases the multiplier factor to 10.0 instead of 3 (default) for this model used in `torch/_dynamo/utils.py`. In the PR comment below, I include a short analysis of the numerics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160001 Approved by: https://github.com/anijain2305, https://github.com/jataylo, https://github.com/jeffdaily	2025-08-18 15:33:30 +00:00
PyTorch UpdateBot	179511694c	Update slow tests (#160870 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160870 Approved by: https://github.com/pytorchbot	2025-08-18 11:53:41 +00:00
PyTorch UpdateBot	e7c3b77b22	[xla hash update] update the pinned xla hash (#160871 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160871 Approved by: https://github.com/pytorchbot	2025-08-18 11:50:47 +00:00
Sun, Jiayi	95e456fcc5	[inductor] pack linear for FP32 dynamic mode (#157542 ) Summary: Currently, Linear in FP32 dynamic mode(batch_size has free symbols) does not support weight prepacking since MKL Linear does not support dynamic mode. This PR uses oneDNN Linear to support Linear weight prepacking in FP32 dynamic mode. I tested the Inductor benchmark in FP32 dynamic mode on CPU using this PR, and saw ~8% improvement in timm_models geomean speedup, ~2% improvement in torchbench geomean speedup, and no change in huggingface. There are about 18 models with different degrees of performance improvement, among which BERT_pytorch, soft_actor_critic, BlenderbotForCausalLM, ElectraForCausalLM, crossvit_9_240, mobilevit_s, twins_pcpvt_base have more than 20% performance improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157542 Approved by: https://github.com/CaoE, https://github.com/jansel	2025-08-18 10:18:46 +00:00
Sun, Jiayi	de744ca4b1	[Inductor] modify convert_to_reinterpret_view (#158914 ) Summary: Fix https://github.com/pytorch/pytorch/issues/159121, Modify the rules for freezing the layout of `x.unwrap_view()` in `convert_to_reinterpret_view`: relax the condition of `isinstance(x_unwrap_view, (ReinterpretView, Buffer))` to `isinstance(x_unwrap_view, (ReinterpretView, Buffer, MutableBox))`. Prefer channels last format according to how the format of `x_unwrap_view_fx_node` is set from eager. Example: ``` import torch import torch.nn as nn class M(nn.Module): def __init__(self): super(M, self).__init__() self.relu = torch.nn.ReLU() def forward(self, x): n, c, h, w = x.shape return self.relu(x).permute(0, 2, 3, 1).reshape( n, h * w, c ) model = M().eval() x = torch.randn(2, 32, 4, 4).to(memory_format=torch.channels_last) compiled_model = torch.compile(model) with torch.no_grad(): compiled_model(x) ``` Generated code: - before ``` cpp_fused_permute_relu_view_0 = async_compile.cpp_pybinding(['const float', 'float', 'float'], ''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const float in_ptr0, float* out_ptr0, float* out_ptr1) { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(16L)) { for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(16L); x2+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(32L) && x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(16L))) { alignas(std::max(std::size_t(16), alignof(float))) float tmp0[1616]; transpose_mxn<float,static_cast<int64_t>(16),static_cast<int64_t>(16),false>(in_ptr0 + static_cast<int64_t>(x1 + 32Lx2 + 512Lx0), static_cast<int64_t>(32L), tmp0, static_cast<int64_t>(16)); for (long x1_inner = 0; x1_inner < static_cast<int64_t>(16); x1_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<int64_t>(16Lx1_inner), static_cast<int64_t>(16)); auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0)); tmp2.store(out_ptr0 + static_cast<int64_t>(x2 + 16Lx1 + 16Lx1_inner + 512Lx0)); } } } } } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(16L)) { for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L) && x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L))) { alignas(std::max(std::size_t(16), alignof(float))) float tmp0[1616]; transpose_mxn<float,static_cast<int64_t>(16),static_cast<int64_t>(16),false>(out_ptr0 + static_cast<int64_t>(x1 + 16Lx2 + 512Lx0), static_cast<int64_t>(16L), tmp0, static_cast<int64_t>(16)); for (long x1_inner = 0; x1_inner < static_cast<int64_t>(16); x1_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<int64_t>(16Lx1_inner), static_cast<int64_t>(16)); tmp1.store(out_ptr1 + static_cast<int64_t>(x2 + 32Lx1 + 32Lx1_inner + 512Lx0)); } } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (2, 32, 4, 4), (512, 1, 128, 32)) buf0 = empty_strided_cpu((2, 32, 4, 4), (512, 16, 4, 1), torch.float32) buf1 = empty_strided_cpu((2, 16, 32), (512, 32, 1), torch.float32) cpp_fused_permute_relu_view_0(arg0_1, buf0, buf1) del arg0_1 return (buf1, ) ``` - After ``` cpp_fused_relu_0 = async_compile.cpp_pybinding(['const float', 'float'], ''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(1024L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1024L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = at::vec::clamp_min(tmp0, decltype(tmp0)(0)); tmp1.store(out_ptr0 + static_cast<int64_t>(x0)); } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (2, 32, 4, 4), (512, 1, 128, 32)) buf0 = empty_strided_cpu((2, 32, 4, 4), (512, 1, 128, 32), torch.float32) cpp_fused_relu_0(arg0_1, buf0) del arg0_1 return (reinterpret_tensor(buf0, (2, 16, 32), (512, 32, 1), 0), ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158914 Approved by: https://github.com/CaoE, https://github.com/jansel	2025-08-18 07:41:20 +00:00
PyTorch MergeBot	b82aa3df20	Revert "Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. (#159197 )" This reverts commit e444cd24d48b3a46f067974f2cc157f5ed27709f. Reverted https://github.com/pytorch/pytorch/pull/159197 on behalf of https://github.com/laithsakka due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/159197#issuecomment-3195436668))	2025-08-18 07:22:13 +00:00
zhaoguoan	d8d589bd3a	Add build support for RISCV (#160172 ) In requirements.txt, do not install lintrunner on riscv64 Fixes #160170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160172 Approved by: https://github.com/malfet	2025-08-18 05:29:34 +00:00
drisspg	3c6efd1380	Add cutedsl template support to compile (#160108 ) ## Summary Still figuring out what actually writing a template should look like, but lands alot of the base infra <img width="1267" height="262" alt="Screenshot 2025-08-16 at 10 22 12 PM" src="https://github.com/user-attachments/assets/229f8bfa-0cb4-4fb1-8530-f535e569d350" /> Test code: ```Python #!/usr/bin/env python3 """ Fixed CuteDSL template test with proper def_kernel usage. """ import torch import torch._inductor.config as config from torch._inductor.lowering import lowerings from torch._inductor.ir import TensorBox from torch._inductor.select_algorithm import autotune_select_algorithm from torch._inductor.codegen.cutedsl import CuteDSLTemplate def create_fixed_cutedsl_template(): """Create a properly structured CuteDSL template.""" def cutedsl_grid(M, N, meta): return (1,) # Part 1: Imports and kernel definition template_part1 = r""" import torch import cutlass import cutlass.cute as cute from cutlass.cute.runtime import from_dlpack @cute.kernel def {{kernel_name}}_kernel(gA: cute.Tensor, gB: cute.Tensor, gC: cute.Tensor): # Get thread and block indices tidx, _, _ = cute.arch.thread_idx() bidx, _, _ = cute.arch.block_idx() bdim, _, _ = cute.arch.block_dim() thread_idx = bidx * bdim + tidx m, n = gA.shape if thread_idx < m * n: mi = thread_idx // n ni = thread_idx % n if mi < m and ni < n: a_val = gA[mi, ni] b_val = gB[mi, ni] result = a_val + b_val gC[mi, ni] = a_val + b_val """ # Part 2: JIT wrapper function template_part2 = r""" @cute.jit def {{kernel_name}}_jit(mA: cute.Tensor, mB: cute.Tensor, mC: cute.Tensor): m, n = mA.shape total_threads = m * n threads_per_block = 256 num_blocks = (total_threads + threads_per_block - 1) // threads_per_block kernel = {{kernel_name}}_kernel(mA, mB, mC) kernel.launch( grid=[num_blocks, 1, 1], block=[threads_per_block, 1, 1] ) """ # Part 3: Main kernel function template_part3 = r""" {{def_kernel("input_a", "input_b", "output_c")}} cute_a = from_dlpack(input_a, assumed_align=16) cute_b = from_dlpack(input_b, assumed_align=16) cute_c = from_dlpack(output_c, assumed_align=16) # Launch kernel {{kernel_name}}_jit(cute_a, cute_b, cute_c) return output_c """ # Combine all parts template = CuteDSLTemplate( name="fixed_add", grid=cutedsl_grid, source=template_part1 + template_part2 + template_part3 ) return template def fixed_cutedsl_lowering(a: TensorBox, b: TensorBox) -> TensorBox: """Fixed CuteDSL lowering.""" print(f"[FIXED] CuteDSL lowering: {a.get_size()} + {b.get_size()}") template = create_fixed_cutedsl_template() choices = [] error = template.maybe_append_choice( choices, input_nodes=[a.data, b.data], layout=a.get_layout() ) if error or not choices: print(f"[FIXED] Falling back: {error}") default_lowering = lowerings[torch.ops.aten.add.Tensor] return default_lowering(a, b) print(f"[FIXED] Using CuteDSL with {len(choices)} choices") result = autotune_select_algorithm( "fixed_cutedsl_add", choices, [a, b], a.get_layout(), ) return result def test_fixed_cutedsl(): """Test the fixed CuteDSL template.""" print("=" * 50) print("Fixed CuteDSL Template Test") print("=" * 50) original = lowerings.get(torch.ops.aten.add.Tensor, None) try: lowerings[torch.ops.aten.add.Tensor] = fixed_cutedsl_lowering def test_add(x, y): return x + y device = "cuda" if torch.cuda.is_available() else "cpu" x = torch.randn(128, 4, device=device, dtype=torch.float32) y = torch.randn(128, 4, device=device, dtype=torch.float32) print(f"[FIXED] Testing with {x.shape} tensors on {device}") compiled_fn = torch.compile(test_add, backend="inductor") result = compiled_fn(x, y) # Verify correctness expected = x + y if torch.allclose(result, expected, atol=1e-5): print("✅ [FIXED] Results match!") return True else: print("❌ [FIXED] Results don't match!") return False except Exception as e: print(f"❌ [FIXED] Failed: {e}") import traceback traceback.print_exc() return False finally: if original: lowerings[torch.ops.aten.add.Tensor] = original else: lowerings.pop(torch.ops.aten.add.Tensor, None) if __name__ == "__main__": success = test_fixed_cutedsl() print("🎉 Fixed test completed!" if success else "💥 Fixed test failed!") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160108 Approved by: https://github.com/mlazos	2025-08-18 04:37:15 +00:00
PyTorch UpdateBot	d18007a1d0	[vllm hash update] update the pinned vllm hash (#160847 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160847 Approved by: https://github.com/pytorchbot	2025-08-18 04:36:28 +00:00
dolpm	138413907a	[nativert] oss subgraph rewriter (#160780 ) Summary: att Test Plan: ci Rollback Plan: Differential Revision: D80367765 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160780 Approved by: https://github.com/SherlockNoMad, https://github.com/georgiaphillips	2025-08-18 04:25:05 +00:00
PyTorch MergeBot	3ced4f1e6c	Revert "Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836 )" This reverts commit 7a68d02292fd7a430b55c5bce3268a33c7ec5055. Reverted https://github.com/pytorch/pytorch/pull/160836 on behalf of https://github.com/clee2000 due to broke some inductor jobs? Maybe just update the expected values? Not sure what the policy is for something like this [GH job link](https://github.com/pytorch/pytorch/actions/runs/17024529273/job/48262123844) [HUD commit link](`7a68d02292`) ([comment](https://github.com/pytorch/pytorch/pull/160836#issuecomment-3194953213))	2025-08-18 03:09:31 +00:00
Pian Pawakapan	075a2e6967	[PGO] add extra read/write keys (#160715 ) Differential Revision: D80321215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160715 Approved by: https://github.com/bobrenjc93	2025-08-18 01:41:08 +00:00
cyy	7a68d02292	Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836 ) Because numpy 1.22.4 had reached EOL 3 years ago. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160836 Approved by: https://github.com/malfet	2025-08-17 18:39:06 +00:00
James Wu	63e1b58a13	[easy] [Precompile] Refactor guards, improve typing (#160530 ) Purely a refactor, improve typing and get rid of some type errors. Make certain fields as nonnull, since in general it's not empty. The goal of this stack of PRs is to move the save/load logic of guard serialization into separate, flat phases, instead of being embedded in guard creation. This way, we can put a try/catch around it and fail safely if certain guards are not serializable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160530 Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007	2025-08-17 17:54:55 +00:00
cyy	960c03daf6	Remove unused CONDA_CMAKE option (#160832 ) Remove CONDA_CMAKE from `.ci/docker/build.sh` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160832 Approved by: https://github.com/malfet	2025-08-17 17:08:42 +00:00
PyTorch MergeBot	04c7be903d	Revert "[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747 )" This reverts commit 8f434545c2e48c858d8b0d06db8f9642d6a87ad0. Reverted https://github.com/pytorch/pytorch/pull/160747 on behalf of https://github.com/malfet due to Looks like this breaks rocm, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm%20%2F%20linux-jammy-rocm-py3.10 ([comment](https://github.com/pytorch/pytorch/pull/160747#issuecomment-3194417733))	2025-08-17 14:22:48 +00:00
Johnny	691d17a5c6	Update TensorPipe submodule (#160808 ) To a commit containing https://github.com/pytorch/tensorpipe/pull/464 that fixes compilation with CUDA-13 Fixes https://github.com/pytorch/pytorch/issues/160104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160808 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007, https://github.com/malfet	2025-08-17 14:11:41 +00:00
Sandeep Narendranath Karjala	c699668009	[inductor] TLParse tensor metadata logging + test (#160132 ) Summary: - Add TLParse artifact logging per op with output tensor shape, stride, and dtype for cross-rank aggregation. Testing: - Add test to verify structure and contents of tlparse artifiact Pull Request resolved: https://github.com/pytorch/pytorch/pull/160132 Approved by: https://github.com/xmfan	2025-08-17 04:27:49 +00:00
PyTorch UpdateBot	0b56f3aed8	[vllm hash update] update the pinned vllm hash (#160831 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160831 Approved by: https://github.com/pytorchbot	2025-08-17 04:25:26 +00:00
Nick Riasanovsky	8f434545c2	[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747 ) Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs. Test Plan: Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda` Rollback Plan: Differential Revision: D80348643 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747 Approved by: https://github.com/NikhilAPatel	2025-08-17 00:35:12 +00:00
PyTorch MergeBot	26297c27e2	Revert "[inductor] TLParse tensor metadata logging + test (#160132 )" This reverts commit 2603e40be5fa4a66301e6654e34a82a67f2e4913. Reverted https://github.com/pytorch/pytorch/pull/160132 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/17010600949/job/48226137423) [HUD commit link](`2603e40be5`). landrace with another PR that changed some had_cuda related things ([comment](https://github.com/pytorch/pytorch/pull/160132#issuecomment-3193969792))	2025-08-16 23:47:03 +00:00
Guilherme Leobas	74871d4d46	[collections.abc] Ensure that binop calls works with UserDefinedObjects (#159865 ) Changes: (1) Replace UserDefinedSetVariable by UserDefinedObjectVariable in all binop calls Test plan: (1) The three tests from CPython `test_collections.py` ensures that Dynamo can trace through a dunder method (e.g. __add__, __ixor__, etc) defined in a user defined class Pull Request resolved: https://github.com/pytorch/pytorch/pull/159865 Approved by: https://github.com/mlazos ghstack dependencies: #159365, #159366, #159368, #159483, #159902, #159864	2025-08-16 20:44:40 +00:00
Guilherme Leobas	f019da2979	Implement `list(UserDefinedObject)` via `force_unpack_var_sequence` (#159864 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159864 Approved by: https://github.com/mlazos ghstack dependencies: #159365, #159366, #159368, #159483, #159902	2025-08-16 20:44:40 +00:00
Guilherme Leobas	f1bc843a5d	Wrap class definitions in `set_fullgraph(False)` in `test_collections` (#159902 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159902 Approved by: https://github.com/mlazos ghstack dependencies: #159365, #159366, #159368, #159483	2025-08-16 20:42:15 +00:00
Sandeep Narendranath Karjala	2603e40be5	[inductor] TLParse tensor metadata logging + test (#160132 ) Summary: - Add TLParse artifact logging per op with output tensor shape, stride, and dtype for cross-rank aggregation. Testing: - Add test to verify structure and contents of tlparse artifiact Pull Request resolved: https://github.com/pytorch/pytorch/pull/160132 Approved by: https://github.com/xmfan ghstack dependencies: #160260	2025-08-16 16:37:18 +00:00
Xuehai Pan	8fe4b3f848	[BE][CI] move `MYPYSTRICT` linter from `lintrunner-noclang` to `lintrunner-mypy` (#160806 ) Like `MYPY`, linter `MYPYSTRICT` will need `--all-files` too. See also: - https://github.com/pytorch/pytorch/pull/160652#issuecomment-3193390813 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160806 Approved by: https://github.com/seemethere	2025-08-16 16:15:22 +00:00
Hai Zheng	cff6def7f4	[MTIA] add correct name for CFF in tlparse (#160599 ) Differential Revision: D80201622 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160599 Approved by: https://github.com/bdhirsh	2025-08-16 14:58:03 +00:00
Laith Sakka	e444cd24d4	Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. (#159197 ) This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous() but want to find those call sites to handle this properly by calling is_contiguous_or_false() and not is_contiguous() explitly when appropriate. I had to fix one issue after removing the implicit size oblivious reasoning. here is context we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE. when people call is_contiguous we do sym_is_contiguous().guard_bool() when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false() one issue not handled well was this path ``` c10::SymBool TensorImpl::sym_is_contiguous_custom( at::MemoryFormat memory_format) const { if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) { return pyobj_slot_.load_pyobj_interpreter()->is_contiguous( this, memory_format); } return sym_is_contiguous_default(memory_format); } ``` namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format); This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning. once we removed that implicit size oblivious reasoning, the right thing we want is to call return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format); otherwise we would get DDE even if the caller is doing sym_is_contiguous. so I had to define it for pyinterpreter, and then I had to override it for nested tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159197 Approved by: https://github.com/ezyang	2025-08-16 09:15:58 +00:00
Huy Do	a84541c73f	Update transformers version automatically with Dependabot (#160635 ) My proposal here is to use GitHub Dependabot to make sure that `transformers` version used in CI are always up-to-date. To achieve this, this PR does 2 things: 1. Pin `transformers` version across all CI jobs to only one place at `.ci/docker/ci_commit_pins/huggingface.txt`. This file is now a regular pip requirements instead of a pinned commit text. There isn't any need to pin `transformers` to a specific commit and the file already refers to a stable version `v4.54.0` 2. Create `.github/dependabot.yml` to config the bot to update `transformers` automatically when there is a new version. Those labels will ensure that the right reviewers from torch.compile and Dev Infra are notified. I'm not sure how to test this out in PR, but it feels ok to land and test this in main. If this works, we should see a PR to update `v4.54.0` to the current latest `v4.55.0` ### Reference https://docs.github.com/en/code-security/dependabot/working-with-dependabot/dependabot-options-reference Pull Request resolved: https://github.com/pytorch/pytorch/pull/160635 Approved by: https://github.com/ZainRizvi	2025-08-16 05:53:39 +00:00
Rohit Singh Rathaur	114813ca77	Fix mypy errors: PyTreeSpec inheritance (#160652 ) Fixes #160650. I added type ignore comment to `LeafSpec` class inheritance in `torch/utils/_cxx_pytree.py` to handle `PyTreeSpec` being marked as final in optree's type stubs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160652 Approved by: https://github.com/Skylion007	2025-08-16 05:14:11 +00:00
Justin Chu	11b6ceb7b4	[ONNX] Default to dynamo export (#159646 ) Set dynamo=True and enable fallback. 1. Implemented the compatible behavior where BytesIO objects as `f` is accepted 2. Update tests to explicitly set dynamo=False #151693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159646 Approved by: https://github.com/titaiwangms	2025-08-16 04:48:58 +00:00
Michael Lazos	fb7e60ba7a	[Dynamo][Hierarchical Compile] Flatten tuple outputs in graph dedupe pass (#158811 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158811 Approved by: https://github.com/anijain2305 ghstack dependencies: #158810	2025-08-16 04:45:31 +00:00
PyTorch UpdateBot	f89186e910	[audio hash update] update the pinned audio hash (#160797 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160797 Approved by: https://github.com/pytorchbot	2025-08-16 04:26:59 +00:00
PyTorch UpdateBot	10eb83734f	[vllm hash update] update the pinned vllm hash (#160699 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160699 Approved by: https://github.com/pytorchbot	2025-08-16 04:26:55 +00:00
Yang Wang	75ea93484c	[vllm test] add vllm.yml and additional package (#160698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160698 Approved by: https://github.com/huydhn ghstack dependencies: #160116	2025-08-16 04:24:20 +00:00
Huy Do	45c2c7a5fc	Fix the wrong dataclasses_json mointoring dep MacOS test (#160796 ) Typo mistake. This should be `dataclasses_json` https://github.com/pytorch/pytorch/actions/runs/17000197828/job/48200676725#step:10:23 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160796 Approved by: https://github.com/yangw-dev	2025-08-16 04:00:31 +00:00
Shangdi Yu	b74c7cd335	Add kernel stack traces tlparse dump (#160608 ) (#160779 ) Summary: as title This is requested by the zoomer team so they can add stack trace information to profiler result. Test Plan: ``` buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r stack_traces ``` Rollback Plan: Differential Revision: D80050233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160779 Approved by: https://github.com/angelayi	2025-08-16 03:12:38 +00:00
Scott Todd	b7ca502f29	[ROCm][Windows] Add hipcc compatibility flags to cpp_extension.py. (#159790 ) This is a similar change to https://github.com/pytorch/pytorch/pull/153986, this time adding flags to the hipcc command under `cpp_extension.py`. The `-Wno-ignored-attributes` flag in particular avoids about 200MB of warning spam when building torchvision, like these: ``` In file included from D:\b\vision_main\torchvision\csrc\ops\hip\deform_conv2d_kernel.hip:72: In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ATen.h:13: In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/Functions.h:386: In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ops/_sparse_softmax.h:21: D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ops/_sparse_softmax_ops.h:18:8: warning: __declspec attribute 'dllimport' is not supported [-Wignored-attributes] 18 \| struct TORCH_API _sparse_softmax_int { \| ^~~~~~~~~ D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\torch/headeronly/macros/Export.h💯19: note: expanded from macro 'TORCH_API' 100 \| #define TORCH_API C10_IMPORT \| ^~~~~~~~~~ D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\torch/headeronly/macros/Export.h:53:31: note: expanded from macro 'C10_IMPORT' 53 \| #define C10_IMPORT __declspec(dllimport) \| ^~~~~~~~~ ``` The `-fms-extensions` flag just seems beneficial to include: https://clang.llvm.org/docs/MSVCCompatibility.html. See also this downstream issue where these changes were tested: https://github.com/ROCm/TheRock/issues/910. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159790 Approved by: https://github.com/jeffdaily	2025-08-16 02:20:49 +00:00
Nikita Shulga	7bd4cfaef4	[BE] Update nvshem dependency to 3.3.20 (#160458 ) Which is manylinux2_28 compatible, even on aarch64 platform archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works. Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel Should fix https://github.com/pytorch/pytorch/issues/160425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160458 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv	2025-08-16 02:00:57 +00:00
PyTorch MergeBot	c015e53d37	Revert "[BE] Update nvshem dependency to 3.3.20 (#160458 )" This reverts commit e0488d9f00865fb56c931580c80e099771c6285e. Reverted https://github.com/pytorch/pytorch/pull/160458 on behalf of https://github.com/wdvr due to need to rerun workflow generation (failing workflow-checks) ([comment](https://github.com/pytorch/pytorch/pull/160458#issuecomment-3193133706))	2025-08-16 01:47:42 +00:00
Laith Sakka	65dc4df74d	unify broadcast_shapes functions and avoid duplicates (#160251 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160251 Approved by: https://github.com/jingsh, https://github.com/ColinPeppler ghstack dependencies: #160250	2025-08-16 00:54:32 +00:00
Laith Sakka	c03809e8a5	guard_or_false cat ops (#160250 ) keep existing unbacked semantics unchanged, just use guard_or_false instead of guard_size_obl Pull Request resolved: https://github.com/pytorch/pytorch/pull/160250 Approved by: https://github.com/ColinPeppler, https://github.com/jingsh	2025-08-16 00:54:31 +00:00
Nikita Shulga	e0488d9f00	[BE] Update nvshem dependency to 3.3.20 (#160458 ) Which is manylinux2_28 compatible, even on aarch64 platform archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works. Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel Should fix https://github.com/pytorch/pytorch/issues/160425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160458 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv	2025-08-16 00:50:13 +00:00
Laith Sakka	f782c790df	migrate more simple gso checks (#160253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160253 Approved by: https://github.com/bobrenjc93	2025-08-16 00:15:24 +00:00
atalman	16ce2c15fa	Add python 3.14 support to linux aarch64 builds (#160788 ) Related to https://github.com/pytorch/pytorch/issues/156856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160788 Approved by: https://github.com/malfet	2025-08-16 00:03:21 +00:00
Andrey Talman	0d28d12b11	Fix typo packing libnvshmem into libtorch (#160778 ) Fix typo after https://github.com/pytorch/pytorch/pull/160465 Fixes: https://github.com/pytorch/pytorch/issues/160762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160778 Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/Skylion007	2025-08-15 23:43:02 +00:00
Edward Yang	838f22c57d	Do not incorrectly chain each of the strings as iterables (#160709 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160709 Approved by: https://github.com/Skylion007, https://github.com/fduwjj	2025-08-15 23:22:24 +00:00
eqy	387fe847ab	[cuDNN][SDPA] Introduce `TORCH_CUDNN_SDPA_AVOID_RECOMPILE=1` (#155958 ) Opt-in for now, but basically uses the variable-sequence length/ragged path for the common case of BSHD layout to avoid recompiling for different sequence lengths. Built on top of #149282 Tested using a primitive fuzzer, seems at least as stable as default path (with recompilation) on B200 (50000+ cases tested without any failures) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155958 Approved by: https://github.com/drisspg	2025-08-15 21:59:18 +00:00
Mu-Chu Lee	40311e2ec1	[AOTInductor] ABI-Compatibility for RecordFunction. (#159842 ) Summary: Previous our implementation for RecordFunction injects Aten into codegen, which is breaking the ABI contract for AOTInductor. C10::IValue is aded to call the full record function. The extension of more profiling info will come in later PRs. Test Plan: Included in commit. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D79622071](https://our.internmc.facebook.com/intern/diff/D79622071) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159842 Approved by: https://github.com/desertfire	2025-08-15 21:45:47 +00:00
Yidi Wu	8ca8b6053c	[inductor][while_loop][be] improve the readability of output handling (#160374 ) The logic doesn't change but make it easier to read and change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160374 Approved by: https://github.com/zou3519 ghstack dependencies: #160548	2025-08-15 20:13:12 +00:00
Yidi Wu	ff86509a06	[map] filter none gradients and add autograd inductor tests (#160548 ) Will filter the none outputs in autograd backward for other hops as follow ups Pull Request resolved: https://github.com/pytorch/pytorch/pull/160548 Approved by: https://github.com/zou3519	2025-08-15 20:13:12 +00:00
Shangdi Yu	fa75ba9303	Change IR node's stack traces to return a set of stack traces only (#160701 ) Summary: There can be excessive stack trace outputs in TORCH_LOGS="+inductor" when a single line of code corresponds to many post grad nodes, e.g. `self.multihead_attn(x, x, x)`, in that case, we'll see the same stack trace many times in the IR node, spamming the output log. So we change to return a set of stack traces. Test Plan: CI Rollback Plan: Differential Revision: D80310549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160701 Approved by: https://github.com/angelayi	2025-08-15 19:31:59 +00:00
Guilherme Leobas	b78968b4d1	Support `next(iterator, default)` (#159483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159483 Approved by: https://github.com/mlazos ghstack dependencies: #159365, #159366, #159368	2025-08-15 19:08:21 +00:00
Guilherme Leobas	e5621b4d8b	Fixes for `collections.Counter` (#159368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159368 Approved by: https://github.com/mlazos ghstack dependencies: #159365, #159366	2025-08-15 19:08:21 +00:00
Guilherme Leobas	2542e71f3f	Change mutation type of `MutableMappingVariable` to `AttributeMutationNew` (#159366 ) Also add MutableMappingVariable to `call_or_` / `call_ior` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159366 Approved by: https://github.com/zou3519 ghstack dependencies: #159365	2025-08-15 19:08:21 +00:00
Guilherme Leobas	0242d40fa5	Enable trace through the collections module (#159365 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159365 Approved by: https://github.com/zou3519	2025-08-15 19:08:21 +00:00
atalman	17de899709	Add py3.14 to macos arm64 (#160593 ) Related to https://github.com/pytorch/pytorch/issues/156856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160593 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-08-15 18:52:10 +00:00
Shangdi Yu	25d0d8b0a3	[inductor] Fix propagating torch.utils._sympy.functions.Identity in IndexPropagation (#155504 ) Fixes https://github.com/pytorch/pytorch/issues/160535 Index may contain ` torch.utils._sympy.functions.Identity`. When we call `SymPyOps.index_expr`, if the value is a sympy.Expr with Identity, `TypedExpr(value, dtype)` will fail. So when we unwrap arguments, we expand the sympy expression to unwrap Identity. Test Plan: buck run @mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_sym_expr_indexing Rollback Plan: Differential Re vision: D76308640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155504 Approved by: https://github.com/eellison	2025-08-15 18:38:23 +00:00
Liao, Wei	c6d697ff52	port 2 distributed pipeline test files for Intel GPU (#159140 ) it's another pr to port distributed pipeline test for Intel GPU, while the other pr is https://github.com/pytorch/pytorch/pull/159033. In this pr, we port two test files for Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. instantiate_device_type_tests() 2. skip the case at xpu due to accuracy gap introduced by oneDNN non-deterministic Pull Request resolved: https://github.com/pytorch/pytorch/pull/159140 Approved by: https://github.com/guangyey, https://github.com/d4l3k, https://github.com/H-Huang	2025-08-15 18:29:50 +00:00
PyTorch MergeBot	30d2f98daa	Revert "[cutlass backend] re-add pip cutlass path (#160180 )" This reverts commit d556586448f3caab85673c7da0978fe31c7748f7. Reverted https://github.com/pytorch/pytorch/pull/160180 on behalf of https://github.com/atalman due to broke macos nightly ([comment](https://github.com/pytorch/pytorch/pull/160180#issuecomment-3192311552))	2025-08-15 18:00:41 +00:00
Xuan Zhang	8780d28c65	raise exception in case of errors in memory reordering (#160455 ) This PR introduce two checks in the memory reordering pass to catch graph issues before performing the reordering task. For situation not covered by these checks, the reordering pass might fail and an exception will be thrown in this case. This addresses issue -- https://github.com/pytorch/pytorch/issues/159568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160455 Approved by: https://github.com/eellison	2025-08-15 17:31:55 +00:00
Yidi Wu	da8f48d88f	[associative_scan] support gen_schema for associative_scan (#158883 ) In-place mutation may create inter-loop dependency that breaks the parallelism we have for associative_scan so we ban input mutations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158883 Approved by: https://github.com/zou3519 ghstack dependencies: #154193, #158965, #158863, #158864	2025-08-15 17:28:44 +00:00
Yidi Wu	cb9e2092a8	[scan] support gen_schema for scan (#158864 ) We don't want to allow scan's combine_fn to mutate its inputs. The semantic of the mutation can be confusing. For example: ```python def combine_fn(init, x): ``` If combine_fn mutates init, only first iteration mutates init, the rest of the iterations mutates the previous carry, which is an intermediate result. This is kind of a weird semantic because the only observable mutation is for init, which can be done outside of the combine_fn. If combine_fn mutates x, where x is a slice of scanned inputs (i.e. xs), this pattern is more meaningful but we've not seen any use case yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158864 Approved by: https://github.com/zou3519 ghstack dependencies: #154193, #158965, #158863	2025-08-15 17:28:44 +00:00
Yidi Wu	f6bf1573fc	[while_loop] support gen_schema for while_loop (#158863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158863 Approved by: https://github.com/zou3519 ghstack dependencies: #154193, #158965	2025-08-15 17:28:34 +00:00
Yidi Wu	82a18423be	[BE] create an empty shape_env for check_input_alias_and_mutation_return_outputs (#158965 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158965 Approved by: https://github.com/zou3519 ghstack dependencies: #154193	2025-08-15 17:28:20 +00:00
Yidi Wu	3fe3c23d4e	[cond] support gen_schema for cond (#154193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154193 Approved by: https://github.com/zou3519	2025-08-15 17:28:13 +00:00
Prajesh Praveen Anchalia	052c441cf4	Add logging for when inbuilt_inline_nn_modules will help with ID_MATCH guard triggered recompiles (#160592 ) We add a logging around when an ID_MATCH guard is added at a place where inbuilt_inline_nn_modules would inline it. This is done with the aim of tagging recompiles that could be avoided by setting inbuilt_inline_nn_modules flag. It will help us log and track the flag's adoption and potentially quantify saving in the the number of recompiles. Differential Revision: D80075975 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160592 Approved by: https://github.com/anijain2305	2025-08-15 17:09:39 +00:00
Paul de Supinski	b26d2a9464	[ez] Make NUMA signpost parameters JSON serializable (#160710 ) # Context Broader context in #160163. In order for the _utils_internal version of signpost_event to do proper logging, its parameters argument needs to be json serializable. # This PR Convert `NumaOptions` to serializable form before inputting to `signpost_event`. # Test Plan ## Automated Added tests `$ pytest test/test_numa_binding.py`. ## Manual See [D80317206](https://www.internalfb.com/diff/D80317206). Pull Request resolved: https://github.com/pytorch/pytorch/pull/160710 Approved by: https://github.com/kiukchung	2025-08-15 16:52:43 +00:00
Kurt Mohler	6382302990	[MPS] Add `grid_sampler_3d` for MPS (#160541 ) This PR adds support for `grid_sampler_3d` for MPS with "bilinear" interpolation. NOTE: "nearest" interpolation is not yet supported Fixes #159882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160541 Approved by: https://github.com/malfet	2025-08-15 16:19:25 +00:00
Catherine Lee	80dd05e31e	Disable flaky cpp test RecordDebugHandles.Basic (#160577 ) Test is flaky and sometimes hangs in CI Here's an example of the failure: https://github.com/pytorch/pytorch/actions/runs/16946153494/job/48027937663 ``` 2025-08-13T20:54:00.1223688Z ==================================== RERUNS ==================================== 2025-08-13T20:54:00.1224156Z ___________________________ RecordDebugHandles.Basic ___________________________ 2025-08-13T20:54:00.1224682Z [gw2] linux -- Python 3.13.5 /opt/conda/envs/py_3.13/bin/python3.13 2025-08-13T20:54:00.1225568Z Internal Error: calling /opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit for test RecordDebugHandles.Basic failed (returncode=-6): 2025-08-13T20:54:00.1226430Z CUDA not available. Disabling CUDA and MultiCUDA tests 2025-08-13T20:54:00.1226988Z Note: Google Test filter = RecordDebugHandles.Basic-_CUDA:_MultiCUDA 2025-08-13T20:54:00.1227450Z [==========] Running 1 test from 1 test suite. 2025-08-13T20:54:00.1227792Z [----------] Global test environment set-up. 2025-08-13T20:54:00.1228145Z [----------] 1 test from RecordDebugHandles 2025-08-13T20:54:00.1228492Z [ RUN ] RecordDebugHandles.Basic 2025-08-13T20:54:00.1228822Z [ OK ] RecordDebugHandles.Basic (1 ms) 2025-08-13T20:54:00.1229204Z [----------] 1 test from RecordDebugHandles (1 ms total) 2025-08-13T20:54:00.1229501Z 2025-08-13T20:54:00.1229666Z [----------] Global test environment tear-down 2025-08-13T20:54:00.1230033Z [==========] 1 test from 1 test suite ran. (1 ms total) 2025-08-13T20:54:00.1230355Z [ PASSED ] 1 test. 2025-08-13T20:54:00.1230727Z terminate called after throwing an instance of 'std::system_error' 2025-08-13T20:54:00.1231154Z what(): Invalid argument 2025-08-13T20:54:00.1231416Z unknown file:0: C++ failure 2025-08-13T20:54:00.1231788Z ------------------------------ Captured c++ call ------------------------------- 2025-08-13T20:54:00.1232262Z CUDA not available. Disabling CUDA and MultiCUDA tests 2025-08-13T20:54:00.1232745Z Note: Google Test filter = RecordDebugHandles.Basic-_CUDA:_MultiCUDA 2025-08-13T20:54:00.1233199Z [==========] Running 1 test from 1 test suite. 2025-08-13T20:54:00.1233557Z [----------] Global test environment set-up. 2025-08-13T20:54:00.1233915Z [----------] 1 test from RecordDebugHandles 2025-08-13T20:54:00.1234247Z [ RUN ] RecordDebugHandles.Basic 2025-08-13T20:54:00.1234590Z [ OK ] RecordDebugHandles.Basic (1 ms) 2025-08-13T20:54:00.1235020Z [----------] 1 test from RecordDebugHandles (1 ms total) 2025-08-13T20:54:00.1235304Z 2025-08-13T20:54:00.1235431Z [----------] Global test environment tear-down 2025-08-13T20:54:00.1235793Z [==========] 1 test from 1 test suite ran. (1 ms total) 2025-08-13T20:54:00.1236126Z [ PASSED ] 1 test. 2025-08-13T20:54:00.1236481Z terminate called after throwing an instance of 'std::system_error' 2025-08-13T20:54:00.1236906Z what(): Invalid argument 2025-08-13T20:54:00.1237287Z ___________________________ RecordDebugHandles.Basic ___________________________ 2025-08-13T20:54:00.1237800Z [gw2] linux -- Python 3.13.5 /opt/conda/envs/py_3.13/bin/python3.13 2025-08-13T20:54:00.1238686Z Internal Error: calling /opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit for test RecordDebugHandles.Basic failed (returncode=-6): 2025-08-13T20:54:00.1239551Z CUDA not available. Disabling CUDA and MultiCUDA tests 2025-08-13T20:54:00.1240048Z Note: Google Test filter = RecordDebugHandles.Basic-_CUDA:_MultiCUDA 2025-08-13T20:54:00.1240495Z [==========] Running 1 test from 1 test suite. 2025-08-13T20:54:00.1240848Z [----------] Global test environment set-up. 2025-08-13T20:54:00.1241199Z [----------] 1 test from RecordDebugHandles 2025-08-13T20:54:00.1241542Z [ RUN ] RecordDebugHandles.Basic 2025-08-13T20:54:00.1241871Z [ OK ] RecordDebugHandles.Basic (1 ms) 2025-08-13T20:54:00.1242249Z [----------] 1 test from RecordDebugHandles (1 ms total) 2025-08-13T20:54:00.1242503Z 2025-08-13T20:54:00.1242641Z [----------] Global test environment tear-down 2025-08-13T20:54:00.1242993Z [==========] 1 test from 1 test suite ran. (19 ms total) 2025-08-13T20:54:00.1243329Z [ PASSED ] 1 test. 2025-08-13T20:54:00.1243697Z terminate called after throwing an instance of 'std::system_error' 2025-08-13T20:54:00.1244113Z what(): Invalid argument 2025-08-13T20:54:00.1244392Z unknown file:0: C++ failure 2025-08-13T20:54:00.1244759Z ------------------------------ Captured c++ call ------------------------------- 2025-08-13T20:54:00.1245235Z CUDA not available. Disabling CUDA and MultiCUDA tests 2025-08-13T20:54:00.1283768Z ============== 1 failed, 568 passed, 2 rerun in 115.57s (0:01:55) ============== ``` Here's an example of the hang: https://github.com/pytorch/pytorch/actions/runs/16942186826/job/48015238944 Logs aren't super helpful other than stating that it took a long time. Usually this file takes <2min to run ``` 2025-08-13T18:43:24.6586481Z [gw0] [ 97%] PASSED [1.4119s] ../../../../../opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit::PyTorch/LiteInterpreterDynamicTypeTestFixture::Conformance/8 2025-08-13T18:43:24.6587278Z [gw1] [ 97%] PASSED [1.4866s] ../../../../../opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit::PyTorch/LiteInterpreterDynamicTypeTestFixture::Conformance/9 Command took >30min, returning 124 2025-08-13T18:43:24.6587288Z 2025-08-13T18:43:24.6587632Z FINISHED PRINTING LOG FILE of cpp/test_jit 1/1 (test/test-reports/cpp.test_jit_1.1_c259e5a152845991_.log) 2025-08-13T18:43:24.6587639Z ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160577 Approved by: https://github.com/huydhn	2025-08-15 15:59:21 +00:00
PyTorch MergeBot	9df07ecfbe	Revert "[inductor] dont reuse buffers if it affects peak (#145883 ) (#159530 )" This reverts commit 3be70dc30e893b552fc0f23ca06cd8f7949b6d08. Reverted https://github.com/pytorch/pytorch/pull/159530 on behalf of https://github.com/clee2000 due to newly added test fail internally D80316528, probably just a targets change, but also imo the tests should probably go into a testcase class from common or inductor utils. While I'm pretty sure CI can run the globally defined ones, theres some CI related functionality that on the testcase class that CI benefits from ([comment](https://github.com/pytorch/pytorch/pull/159530#issuecomment-3191947506))	2025-08-15 15:49:04 +00:00
PyTorch MergeBot	846963fa9b	Revert "[Inductor] addmm + activation function fusion (#158137 )" This reverts commit b9d7de3a094598c3dc0dd52e57bce30eb684c9d8. Reverted https://github.com/pytorch/pytorch/pull/158137 on behalf of https://github.com/malfet due to Broke inductor torchbench, see `663da17b62/1` ([comment](https://github.com/pytorch/pytorch/pull/158137#issuecomment-3191841298))	2025-08-15 15:34:09 +00:00
chunhuanMeng	663da17b62	Update torch-xpu-ops commit pin (#160062 ) Update the torch-xpu-ops commit to [77cc792cd265179745d335579d233e6d4f9a2667](`77cc792cd2`), includes: - Ensures that the XPU cache is cleared before creating tensors during the test - Add unused variable warning - Fix test_linalg and test_torch issue with bf32_on_and_off updates - Fix deterministic indexing with broadcast - Fix dist.gather with noncontiguous tensor - Improve accuracy of index put deterministic kernel - Add generate file rely avoid build before generate - optimize embedding bag Fixes #160661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160062 Approved by: https://github.com/EikanWang	2025-08-15 15:27:24 +00:00
Shiva Kaul	e299926f72	[ONNX] Fix doc typo for symbolic_multi_out (#160702 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160702 Approved by: https://github.com/justinchuby	2025-08-15 14:34:42 +00:00
Huy Do	bbd11c4f23	Uninstall torchao on MPS benchmark (#160724 ) Fixes https://github.com/pytorch/pytorch/issues/160689 The current torchao 0.12.0 doesn't work with transformers 4.54.0 and ends up with this error: ``` File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/transformers/models/albert/modeling_albert.py", line 37, in <module> from ...modeling_utils import PreTrainedModel File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/transformers/modeling_utils.py", line 51, in <module> from torchao.quantization import Int4WeightOnlyConfig File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/__init__.py", line 41, in <module> from torchao.quantization import ( File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/quantization/__init__.py", line 6, in <module> from .autoquant import ( File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/quantization/autoquant.py", line 11, in <module> from torchao.dtypes import ( File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/__init__.py", line 1, in <module> from . import affine_quantized_tensor_ops File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/affine_quantized_tensor_ops.py", line 38, in <module> from torchao.dtypes.uintx.dyn_int8_act_int4_wei_cpu_layout import ( File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/uintx/__init__.py", line 7, in <module> from .dyn_int8_act_int4_wei_cpu_layout import ( File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/uintx/dyn_int8_act_int4_wei_cpu_layout.py", line 320, in <module> from ...prototype.inductor.fx_passes import register_da8w4_concat_linear_cpu_pass File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/prototype/inductor/fx_passes/__init__.py", line 2, in <module> from .int8_sdpa_fusion import _int8_sdpa_init File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/prototype/inductor/fx_passes/int8_sdpa_fusion.py", line 22, in <module> from ..int8_sdpa_lowering import register_int8_sdpa # noqa: F401 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/prototype/inductor/int8_sdpa_lowering.py", line 6, in <module> from torch._inductor.kernel.flex_attention import construct_strides, maybe_realize ModuleNotFoundError: No module named 'torch._inductor.kernel.flex_attention' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160724 Approved by: https://github.com/malfet	2025-08-15 13:55:39 +00:00
Sherlock Huang	eaa5d9d3d3	Introduce OpInfo test for testing export on fake device (#160694 ) Summary: Prepare for the upcoming diffs for exporting on fake cuda device. Test Plan: test Rollback Plan: Differential Revision: D80304225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160694 Approved by: https://github.com/dolpm	2025-08-15 07:26:28 +00:00
Colin Peppler	a7c75ae976	[dde] use sym_or when checking normalized shape in layer_norm (#160683 ) Use `sym_eq` to check equality on tuple of ints/symints ### DDE ``` torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq(u0, u1) (unhinted: Eq(u0, u1)). (Size-like symbols: u1, u0) Caused by: return torch.nn.functional.layer_norm( # test/inductor/test_unbacked_symints.py:527 in fn (_refs/__init__.py:3292 in native_layer_norm) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160683 Approved by: https://github.com/bobrenjc93	2025-08-15 06:56:00 +00:00
Pian Pawakapan	f7ad69f59c	[dynamic shapes] handle Max(*,1) for inductor layout contiguity (#160578 ) Differential Revision: D80214882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160578 Approved by: https://github.com/ZixinYang, https://github.com/bobrenjc93	2025-08-15 06:10:18 +00:00
Wang, Chuanqi	4cae9cf2df	Update triton xpu commit to support python 3.14 (#160183 ) Follow PR #159725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160183 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-08-15 05:41:17 +00:00
Yang Wang	7710800865	[3/3][ghstack][vllm ci build setup]vllm build workflow (#160116 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160116 Approved by: https://github.com/huydhn	2025-08-15 05:35:46 +00:00
Shangdi Yu	aa99e0958f	Separate provenance tracking to different levels (#160383 ) Summary: as title. We've got request from various parties who are interested in turning on the provenance tracking by default. In this PR, we prepare to turn on part of the provenance tracking that doesn't have too much overhead by default. - Change `provenance_tracking` config to `provenance_tracking_level` - turn on the following provenance tracking by default when `basic_provenance_tracking`=True - `set_kernel_post_grad_provenance_tracing` for kernels, this add mapping between triton kernels and post_grad nodes - `dump_inductor_provenance_info` if we're dumping tlparse log - `get_graph_provenance_json` and dump `reate_mapping_pre_post_grad_nodes`. This creates mapping between pre_grad and post_grad nodes. Since we're not turning on the provenance tracking in GraphTransformObserver by default, the mapping here maybe incomplete/limited. - add stack trace from post grad nodes to inductor IR nodes - add exception swallowing for all functions above Test Plan: CI Rollback Plan: Differential Revision: D80031559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160383 Approved by: https://github.com/angelayi	2025-08-15 04:59:35 +00:00
PyTorch UpdateBot	3fc7a95176	[audio hash update] update the pinned audio hash (#160485 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160485 Approved by: https://github.com/pytorchbot	2025-08-15 04:27:49 +00:00
Kevin Fu	858fb80b9b	[PT2]: Add Static Dispatch Kernel for wrapped_fbgemm_linear_fp16_weight (#160451 ) Summary: Add static dispatch kernel for wrapped_fbgemm_linear_fp16_weight. This optimization should improve perf for all Ads DSNN models using Sigmoid. Test Plan: ``` MODEL_TYPE=dpa_product_first_ctr_model MODEL_ENTITY_ID=892669089 SNAPSHOT_ID=37 OTHER_MODEL_ENTITY_ID=892669089 OTHER_SNAPSHOT_ID=36 MODULES=(mix prepare_float_features object user) SUFFIXES=(.predictor.local .predictor.precompute.prepare_float_features .predictor.precompute.remote_object_only .predictor.precompute.remote_request_only) for i in "${!MODULES[@]}"; do MODULE=${MODULES[i]} SUFFIX=${SUFFIXES[i]} buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkAB --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --otherNetFile=/data/users/$USER/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice "" --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true ``` Before: P1900475429 I0810 19:29:22.782902 2717337 load_net_predictor_lib.cpp:1807] Average latency A: 0.0843 ms I0810 19:29:22.782905 2717337 load_net_predictor_lib.cpp:1807] Average latency B: 0.0989 ms After: P1900825771 I0811 15:42:34.866408 2311279 load_net_predictor_lib.cpp:1807] [36mAverage latency A: 0.0854 ms[0m I0811 15:42:34.866411 2311279 load_net_predictor_lib.cpp:1807] [36mAverage latency B: 0.092 ms[0m Still has some regression but the gap is smaller... Rollback Plan: Reviewed By: henryoier, muchulee8 Differential Revision: D80042054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160451 Approved by: https://github.com/henryoier	2025-08-15 04:06:17 +00:00
Kevin Fu	55061c9602	[PT2]: Add Static Dispatch Kernel for scale_gradient (#160454 ) Summary: Add Static Dispatch Kernel for scale_gradient Test Plan: ``` MODEL_TYPE=dpa_product_first_ctr_model MODEL_ENTITY_ID=892669089 SNAPSHOT_ID=37 OTHER_MODEL_ENTITY_ID=892669089 OTHER_SNAPSHOT_ID=36 MODULES=(mix prepare_float_features object user) SUFFIXES=(.predictor.local .predictor.precompute.prepare_float_features .predictor.precompute.remote_object_only .predictor.precompute.remote_request_only) for i in "${!MODULES[@]}"; do MODULE=${MODULES[i]} SUFFIX=${SUFFIXES[i]} buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkAB --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --otherNetFile=/data/users/$USER/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice "" --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true ``` Rollback Plan: Reviewed By: henryoier Differential Revision: D80062244 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160454 Approved by: https://github.com/henryoier	2025-08-15 03:42:39 +00:00
Kevin Fu	214d04833a	[PT2]: Add Static Dispatch Kernel for fmod.Scalar (#160654 ) Summary: Add static dispatch for torch.ops.aten.fmod.Scalar. Found this missing in user/object nets for DSNN models. Test Plan: ``` MODEL_TYPE=dpa_product_first_ctr_model MODEL_ENTITY_ID=892669089 SNAPSHOT_ID=36 MODULE=user SUFFIX=.predictor.precompute.remote_request_only buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkByOp --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice="" --benchmarkEnableProfiling=true --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true --benchmarkNumIterations=1000 ``` Object tower: P1904347784 User tower: P1904348406 Rollback Plan: Differential Revision: D80238495 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160654 Approved by: https://github.com/henryoier	2025-08-15 03:11:48 +00:00
Johnny	9c5601ecc3	[NVIDIA] Refactor Family Blackwell Support codegen (#156176 ) With the legacy driver (nvgpu) used for CUDA 12.9, Thor was operating with SM 10.1. This changes to SM 11.0 when the newer driver model (OpenRM), which is intended for CUDA 13.0, is introduced. Thor 10.1 --> 11.0 Spark 12.1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156176 Approved by: https://github.com/ezyang	2025-08-15 02:51:26 +00:00
Nikita Shulga	5b9ad951f8	[BE][Docker] Do not install `cuda:11.8` (#160695 ) As CUDA-11.8 binary are no longer produced by CD Pull Request resolved: https://github.com/pytorch/pytorch/pull/160695 Approved by: https://github.com/huydhn	2025-08-15 02:23:04 +00:00
Lucas Kabela	4d5f92aa39	typing tvm.py (#160369 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160369 Approved by: https://github.com/Skylion007 ghstack dependencies: #160362, #160363, #160364, #160365, #160366, #160367, #160368	2025-08-15 02:09:31 +00:00
Lucas Kabela	39ca0ce0c8	Type backend torchxla (#160368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160368 Approved by: https://github.com/Skylion007 ghstack dependencies: #160362, #160363, #160364, #160365, #160366, #160367	2025-08-15 02:09:31 +00:00
Lucas Kabela	d52bb67ac3	typing registry.py (#160367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160367 Approved by: https://github.com/Skylion007 ghstack dependencies: #160362, #160363, #160364, #160365, #160366	2025-08-15 02:09:31 +00:00
Lucas Kabela	05b9b63fb6	typing inductor and placeholder backends (#160366 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160366 Approved by: https://github.com/Skylion007 ghstack dependencies: #160362, #160363, #160364, #160365	2025-08-15 02:09:31 +00:00
Lucas Kabela	453cfa5153	typing distributed.py (#160365 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160365 Approved by: https://github.com/StrongerXi ghstack dependencies: #160362, #160363, #160364	2025-08-15 02:09:31 +00:00
Lucas Kabela	9faca5f260	typing debugging.py (#160364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160364 Approved by: https://github.com/Skylion007 ghstack dependencies: #160362, #160363	2025-08-15 02:09:31 +00:00
Lucas Kabela	6fe6dd9fdc	Type cudagraphs.py (#160363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160363 Approved by: https://github.com/StrongerXi ghstack dependencies: #160362	2025-08-15 02:09:31 +00:00
Lucas Kabela	f82c7eed84	Typing for common.py (#160362 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160362 Approved by: https://github.com/Skylion007	2025-08-15 02:09:31 +00:00
Nick Riasanovsky	25ccc4716e	[Inductor] [Triton] Apply feedback to Enable padded stride support (#160614 ) Summary: Issue I noticed while fixing tests for TMA store. This triton.language.make_tensor_descriptor call hardcodes the shape information as the stride, which is not necessarily correct. In particular, its legal to have a stride bigger than the shape (e.g. padded to a size). A good example of the usage of this would be to allocate a tensor to always be a multiple of 16 and just pad the result so TMA is legal. This is redo of https://github.com/pytorch/pytorch/pull/160493 because I broke this accidentally trying to land internally first instead of merging through Github directly. Test Plan: Tested with `buck2 run mode/opt-split-dwarf mode/inplace -c fbcode.nvcc_arch=h100 caffe2/test/inductor:max_autotune 2>&1 \| tee ~/test_logs.log` and confirmed all max autotune tests passed. Rollback Plan: Differential Revision: D80224578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160614 Approved by: https://github.com/eellison	2025-08-15 02:06:14 +00:00
Guilherme Leobas	d387a48c38	[generator] Raise `StopIteration(value)` with value from the return stmt (#157152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157152 Approved by: https://github.com/zou3519 ghstack dependencies: #157148	2025-08-15 01:42:40 +00:00
Guilherme Leobas	831e85104a	[contextlib] Fixes for CPython contextlib tests (#157148 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157148 Approved by: https://github.com/zou3519	2025-08-15 01:42:40 +00:00
David Berard	211c98859a	[inductor][triton] Update triton_builtin handling after triton # 7239 (#160658 ) https://github.com/triton-lang/triton/pull/7239 will search for a _semantic kwarg in the signature of the function before passing in this kwarg. To fix this in Inductor: 1. explicitly take a _semantic kwarg 2. remove the functools.wraps around the wrapper function, which was causing inspect.signature to return the signature of the wrapped function (instead of the signature of the wrapper, which does contain the _semantic arg) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160658 Approved by: https://github.com/PaulZhang12, https://github.com/njriasan	2025-08-15 00:39:24 +00:00
Kaichao You	dae7710bf2	[cuda][cupy] Improve cupy device placement when device is provided with explicit index (#158529 ) resubmit https://github.com/pytorch/pytorch/pull/158320 , fixing a potential bug when device index is not specified explicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158529 Approved by: https://github.com/ezyang	2025-08-15 00:27:42 +00:00
ankushwahaRH	dc194a3096	Test multiprocessing spawn timing fix (#160672 ) Submitting PR to fix #160511. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160672 Approved by: https://github.com/mikaylagawarecki	2025-08-15 00:11:55 +00:00
Jeff Daily	4051b42c29	[ROCm] hipify needs specific header mappings (#160675 ) Fixes #160579. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160675 Approved by: https://github.com/ScottTodd, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-15 00:09:04 +00:00
henrylhtsang	eb0eaa67e1	[BE][ci] Increase frequency of cutlass backend ci (#160656 ) * increase frequency from every 24 hours to every 12 hours * automatically enable it if cutlass backend files are touched. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160656 Approved by: https://github.com/eellison	2025-08-14 23:44:55 +00:00
henrylhtsang	98373e5ad2	[doc] AOTI debugging guide (#160430 ) Folded from https://discuss.pytorch.org/t/a-beginners-guide-to-debugging-aot-inductor-cuda-illegal-memory-access/222188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160430 Approved by: https://github.com/angelayi	2025-08-14 23:42:17 +00:00
Michael Lazos	371eacb2ae	[Dynamo][Hierarchical Compile] Refactor for tuple flattening (#158810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158810 Approved by: https://github.com/StrongerXi	2025-08-14 22:45:44 +00:00
PyTorch MergeBot	3650989e6e	Revert "[cutlass] fix dictionary iteration error (#160552 )" This reverts commit 29d20d49f0b7f4e362e1cefdcdc4b5659969312c. Reverted https://github.com/pytorch/pytorch/pull/160552 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160552#issuecomment-3189940880))	2025-08-14 21:41:28 +00:00
Markus Hoehnerbach	3be70dc30e	[inductor] dont reuse buffers if it affects peak (#145883 ) (#159530 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159530 Approved by: https://github.com/eellison	2025-08-14 21:14:36 +00:00
David Berard	47a1db823d	[triton_heuristics] Optimize the triton launcher in pt2 (#160000 ) Summary: (Original author: Xu Zhao. Commandeered by David to land this since it is relatively urgent) We observed ~10us PT2-Triton launch overhead regression after pin update. Before Triton pin-update: {F1980557238} After Triton pin-update: {F1980557240} The root cause is because https://github.com/pytorch/pytorch/pull/145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path. The motivation for `_get_args_with_constexprs` was that between triton 3.2 and triton 3.3, the convention for calling Triton kernels (at the level that non-static-cuda-launcher inductor integrates) changed. Previously, the callable did not take constexpr arguments as parameters; after 3.3, it does. With pointwise/reduction kernels, we don't know the constexpr values until after autotuning occurs; so `_get_args_with_constexprs` would inject constexprs into the arguments list before calling the Triton kernel. The fix (in this PR) is to instead inject the constexpr args into the launcher string - this avoids the cost of sorting/reordering arguments which previously occurred upon execution of each kernel. Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (`e96c7c4bb0/torch/_inductor/runtime/static_cuda_launcher.py (L220)`), there is no need to pass in constexprs to the generated launcher code. The new launcher code needs to work on three cases: - StaticallyLaunchedCudaKernel - triton.compile.CompiledKernel - AOTInductor Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0 Test Plan: Before: ``` $ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 1.893x ``` ``` $ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency x_val nop_python_function-walltime nop_triton_kernel-walltime nop_triton_compiled_kernel_run-walltime nop_inductor_kernel-walltime nop_inductor_kernel_cudagraph-walltime ------- ------------------------------ ---------------------------- ----------------------------------------- ------------------------------ ---------------------------------------- 0 0.00760921 1.80298 0.623282 5.25024 0.203722 19 0.00799885 4.78223 1.00226 5.8213 0.239084 average 0.00780403 3.29261 0.812769 5.53577 0.221403 ``` After: ``` buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency x_val nop_python_function-walltime nop_triton_kernel-walltime nop_triton_compiled_kernel_run-walltime nop_inductor_kernel-walltime nop_inductor_kernel_cudagraph-walltime ------- ------------------------------ ---------------------------- ----------------------------------------- ------------------------------ ---------------------------------------- 0 0.00747067 1.92589 0.726509 4.35459 0.204205 19 0.00747823 7.36852 1.26241 6.28208 0.239278 average 0.00747445 4.6472 0.994459 5.31834 0.221741 ``` ``` $ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 1.985x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160000 Approved by: https://github.com/jansel, https://github.com/mlazos Co-authored-by: Xu Zhao <xzhao9@meta.com>	2025-08-14 21:04:08 +00:00
PyTorch MergeBot	eac2d9d695	Revert "appending the pythonpath (#160219 )" This reverts commit 1d80d697a269234b47ec7ede192faf3bb9b159e3. Reverted https://github.com/pytorch/pytorch/pull/160219 on behalf of https://github.com/clee2000 due to broke inductor? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16970222746/job/48108262003) [HUD commit link](`1d80d697a2`) ([comment](https://github.com/pytorch/pytorch/pull/160219#issuecomment-3189850381))	2025-08-14 20:58:14 +00:00
Lucas Kabela	3fe19a7a0a	[Test Fix] Delete dynamo skipfile for OpenMP test_one_thread (#160562 ) Fixes #120648 During issue scrubbing I could not repro these failing tests, so reenabling them to close out the issue ### Test Original repro command: ``` PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_openmp.py -v -k test_one_thread ``` Now results in ``` platform linux -- Python 3.12.11, pytest-8.4.1, pluggy-1.6.0 -- /home/lucaskabela/.conda/envs/pytorch-3.12/bin/python3.12 cachedir: .pytest_cache hypothesis profile 'default' rootdir: /home/lucaskabela/pytorch configfile: pytest.ini plugins: hypothesis-6.138.0 collected 2 items / 1 deselected / 1 selected Running 1 items in this shard test/test_openmp.py::TestOpenMP_ParallelFor::test_one_thread PASSED [3.6874s] [100%] ===================================================== 1 passed, 1 deselected in 6.07s ===================================================== ``` And: ``` PYTORCH_TEST_WITH_DYNAMO=1 python test/test_openmp.py TestOpenMP_ParallelFor.test_one_thread ``` ``` PYTORCH_TEST_WITH_DYNAMO=1 python test/test_sort_and_select.py TestSortAndSelectCPU.test_sort_overflow_cpu_int16 ``` Both result in: ``` . ---------------------------------------------------------------------- Ran 1 test in 0.003s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160562 Approved by: https://github.com/zou3519	2025-08-14 20:55:59 +00:00
Dev Sashidhar	4a90dc0c1f	Update checkpoint warning to target PyTorch 2.9 (#160643 ) Fixes #160534 Updates the warning in torch.utils.checkpoint to state that starting in PyTorch 2.9, calling checkpoint without explicitly passing use_reentrant will raise an exception. Follows the guidance from the issue discussion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160643 Approved by: https://github.com/soulitzer	2025-08-14 20:53:17 +00:00
Paul Zhang	1fc683cf17	[Inductor] Allow indexing a flexible layout for extract_input_node_reduction_ranges (#160645 ) Differential Revision: D79831747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160645 Approved by: https://github.com/eellison	2025-08-14 20:43:35 +00:00
AaronWang04	b9d7de3a09	[Inductor] addmm + activation function fusion (#158137 ) PR implements a pass in post_grad to fuse activation(add + mm) This was previously done similarly here #106912 but was reverted for performance reasons. it was replaced with a pass that unfuses the activation and add from addmm/addmm_activation and let inductor handle the fusion. however since then cuBLAS team has made a lot of perf improvements on this, will update this post with more benchmarks but preliminary benchmark show good results perf dash board <img width="3371" height="1240" alt="Screenshot from 2025-08-07 13-41-35" src="https://github.com/user-attachments/assets/d44d6205-b33a-4a20-9f0f-d9db176b3738" /> Relu works with both training and inference but gelu only works with inference mode due to some fundamental limitations since gelu's derivative depends on input and relu's doesnt. don't think this is fixable with the current addmm_activation API Graph module before and after this pass Relu(addmm) ``` graph(): %primals_1 : [num_users=1] = placeholder[target=primals_1] %primals_2 : [num_users=2] = placeholder[target=primals_2] %primals_3 : [num_users=2] = placeholder[target=primals_3] %addmm : [num_users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {}) %relu : [num_users=2] = call_function[target=torch.ops.aten.relu.default](args = (%addmm,), kwargs = {}) %le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%relu, 0), kwargs = {}) %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {}) return (relu, primals_2, le, permute_1) graph(): %primals_1 : [num_users=1] = placeholder[target=primals_1] %primals_2 : [num_users=2] = placeholder[target=primals_2] %primals_3 : [num_users=2] = placeholder[target=primals_3] %_addmm_activation_default : [num_users=2] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {}) %le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%_addmm_activation_default, 0), kwargs = {}) %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {}) return (_addmm_activation_default, primals_2, le, permute_1) ``` Gelu (addmm) ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %addmm : [num_users=4] = call_function[target=torch.ops.aten.addmm.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {}) %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, %addmm), kwargs = {}) %mul_1 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul, %addmm), kwargs = {}) %mul_2 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_1, 0.044715), kwargs = {}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%addmm, %mul_2), kwargs = {}) %mul_3 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%add, 0.7978845608028654), kwargs = {}) %mul_4 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, 0.5), kwargs = {}) %tanh : [num_users=1] = call_function[target=torch.ops.aten.tanh.default](args = (%mul_3,), kwargs = {}) %add_1 : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%tanh, 1), kwargs = {}) %mul_5 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_4, %add_1), kwargs = {}) return (mul_5,) graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %_addmm_activation_default : [num_users=1] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {use_gelu: True}) return (_addmm_activation_default,) ``` Benchmark setup: NGC pytorch 25.06 container cublas version: 12.9.1.4 torch.compile ran with dynamic = False and max_autotune H100 ``` Testing with M=1024, N=1024, K=1024, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 0.0107 ms Average Time per Iteration (torch compile): 0.0296 ms ============================================================ Testing with M=2048, N=2048, K=2048, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 0.0262 ms Average Time per Iteration (torch compile): 0.0327 ms ============================================================ Testing with M=4096, N=4096, K=4096, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 0.1763 ms Average Time per Iteration (torch compile): 0.2457 ms ============================================================ Testing with M=8192, N=8192, K=8192, dtype=bfloat16 ============================================================ Average Time per Iteration (cublas): 1.5280 ms Average Time per Iteration (torch compile): 1.9437 ms ``` A100 ``` ############################################################ Testing with dtype: float16 ############################################################ ============================================================ Testing with M=1024, N=1024, K=1024, dtype=float16 ============================================================ Average Time per Iteration (cublas): 0.0313 ms Average Time per Iteration (torch compile): 0.0643 ms ============================================================ Testing with M=2048, N=2048, K=2048, dtype=float16 ============================================================ Average Time per Iteration (cublas): 0.1149 ms Average Time per Iteration (torch compile): 0.1255 ms ============================================================ Testing with M=4096, N=4096, K=4096, dtype=float16 ============================================================ Average Time per Iteration (cublas): 0.6297 ms Average Time per Iteration (torch compile): 0.7547 ms ============================================================ Testing with M=8192, N=8192, K=8192, dtype=float16 ============================================================ Average Time per Iteration (cublas): 4.3821 ms Average Time per Iteration (torch compile): 5.0740 ms ``` Script ```py import torch torch.manual_seed(0) warmup, numrun= 10, 100 sizes = [1024, 2048, 4096, 8192] dtypes = [torch.float16, torch.bfloat16, torch.float32] device = torch.device("cuda") for dtype in dtypes: dtype_name = str(dtype).split('.')[-1] print(f"\n{'#'60}") print(f"Testing with dtype: {dtype_name}") print(f"{'#'60}") for size in sizes: M, N, K = size, size, size print(f"\n{'='60}") print(f"Testing with M={M}, N={N}, K={K}, dtype={dtype_name}") print(f"{'='60}") A = torch.randn(M, K, device=device, dtype=dtype) B = torch.randn(K, N, device=device, dtype=dtype) C = torch.randn(M, device=device, dtype=dtype) def func1(): return torch._addmm_activation(C, A, B, use_gelu=True) def func2(): return torch.nn.functional.gelu(torch.add(C, torch.mm(A, B)), approximate="tanh") func2_compiled = torch.compile( func2, dynamic=False, options={ "force_disable_caches": True, "max_autotune": True, "max_autotune_gemm": True, "max_autotune_gemm_backends": "TRITON", "autotune_fallback_to_aten": False, } ) for _ in range(warmup): func1() torch.cuda.synchronize(device=device) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) total_time_ms = 0.0 start_event.record() for _ in range(numrun): func1() end_event.record() torch.cuda.synchronize(device=device) total_time_ms += start_event.elapsed_time(end_event) avg_time_ms = total_time_ms / numrun print(f"Average Time per Iteration (cublas):\t {avg_time_ms:.4f} ms") for _ in range(warmup): func2_compiled() torch.cuda.synchronize(device=device) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) total_time_ms = 0.0 start_event.record() for _ in range(numrun): func2_compiled() end_event.record() torch.cuda.synchronize(device=device) total_time_ms += start_event.elapsed_time(end_event) avg_time_ms = total_time_ms / numrun print(f"Average Time per Iteration (torch compile):\t {avg_time_ms:.4f} ms") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158137 Approved by: https://github.com/eellison	2025-08-14 20:41:38 +00:00
Guilherme Leobas	1028c5e2d5	[Dynamo] Add CPython default dict tests (#155263 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155263 Approved by: https://github.com/zou3519	2025-08-14 20:22:22 +00:00
vishalgoyal316	19b4283884	Typo correction in variable name uninitalized_val in resize() function (#160636 ) Fixes #160633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160636 Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007	2025-08-14 20:11:43 +00:00
Michael Lazos	8d6d324631	[Dynamo][Hierarchical-Compile] Don't allow node duplicates to be added (#160605 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160605 Approved by: https://github.com/StrongerXi	2025-08-14 20:02:10 +00:00
Alex Malyshev	fdfd69bb05	Set PYTHONHOME for inductor subprocesses using torch (#160008 ) This is needed for subprocesses that are trying to call back into torch functionality, i.e. anything that's also setting `PYTHONPATH`. If they're part of an application that bundles the Python runtime, then they should use the bundled runtime to keep their view of the world consistent. There are more `sys.executable` subprocesses in torch/ but it seems like they're fine. Previous PR at https://github.com/pytorch/pytorch/pull/159382, but was reverted because it caused macOS jobs on GitHub to timeout. What was happening was inductor subprocesses were scheduling C++ compilation tasks that were failing to find the Python.h header. This was because they were running in venvs and now trying to find the CPython headers inside the venv, where the headers do not exist. This PR gates the new behavior to internal builds only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160008 Approved by: https://github.com/aorenste	2025-08-14 19:57:14 +00:00
Logan Thomas	0d3461bac0	DOC: update CrossEntropyLoss with note and example of incorrect target specification (#155649 ) Fixes #134771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155649 Approved by: https://github.com/mikaylagawarecki Co-authored-by: Svetlana Karslioglu <svekars@meta.com> Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>	2025-08-14 18:34:57 +00:00
Howard Huang	65053c03a3	[FR] Don't check incomplete ranks for printing (#160195 ) When just printing the ranks (`-j` option) we should skip the check for "incomplete ranks" since that doesn't affect the print Pull Request resolved: https://github.com/pytorch/pytorch/pull/160195 Approved by: https://github.com/fduwjj ghstack dependencies: #160097	2025-08-14 18:19:45 +00:00
Howard Huang	96f9fbe21a	Fix flight recorder for P2P ops (#160097 ) Fixes errors in debugging a trace as mentioned in https://docs.google.com/document/d/1EKVJYmW2hj_VsvDvnSggXhZzJyvMu9dA0iDJWOZAtjY/edit?tab=t.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160097 Approved by: https://github.com/fduwjj	2025-08-14 18:19:45 +00:00
Thomas Germer	1c25871191	Allow torch.hub.load with unauthorized GITHUB_TOKEN (#159896 ) Allow torch.hub.load with unauthorized GITHUB_TOKEN `torch.hub.load` fails if a `GITHUB_TOKEN` with few permissions is set, as can be seen in the following example. Make sure that the model has not been cached before, for example with `rm ~/.cache/torch`. If the model has been downloaded already, it will not be downloaded again and the authorization error will not occur. ```python export GITHUB_TOKEN="" python >>> import torch >>> torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 567, in load repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load", ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 231, in _get_cache_or_reload _validate_not_a_forked_repo(repo_owner, repo_name, ref) File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 191, in _validate_not_a_forked_repo response = json.loads(_read_url(Request(url, headers=headers))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 174, in _read_url with urlopen(url) as r: ^^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/urllib/request.py", line 215, in urlopen return opener.open(url, data, timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/urllib/request.py", line 521, in open response = meth(req, response) ^^^^^^^^^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/urllib/request.py", line 630, in http_response response = self.parent.error( ^^^^^^^^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/urllib/request.py", line 559, in error return self._call_chain(args) ^^^^^^^^^^^^^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/urllib/request.py", line 492, in _call_chain result = func(args) ^^^^^^^^^^^ File "~/miniconda3/lib/python3.12/urllib/request.py", line 639, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 401: Unauthorized ``` The cause of the error is that the function `_validate_not_a_forked_repo` in `hub.py` always uses `GITHUB_TOKEN` for authorization, even when downloading does not require authorization. `0ba09a6d34/torch/hub.py (L194)` This fix simply retries the download without the token in case of a failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159896 Approved by: https://github.com/albanD	2025-08-14 18:15:49 +00:00
Xilun Wu	6c05ea6475	[DTensor] add op support: aten.squeeze_.dim (#159532 ) Summary This PR enables in-place op `aten.squeeze_.dim` on DTensor with a change to DTensor dispatch logic: when processing in-place operator, we should assign `output_sharding.output_spec` back to the first argument. This is because the in-place op_call on `arg._local_tensor` could also shift the tensor meta. Test `pytest test/distributed/tensor/test_view_ops.py -s -k test_squeeze_` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159532 Approved by: https://github.com/zpcore	2025-08-14 18:01:19 +00:00
Howard Huang	5665dc9ab7	[PP] Allow larger world_size schedule tests (#160559 ) Update schedule tests to use `world_size=4`, changes needed: - Move some tests that require world_size=2 to new class - Move helper methods from class level to function level - Update some initialization to pass assert since gradients were super small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160559 Approved by: https://github.com/wconstab ghstack dependencies: #159591, #160558	2025-08-14 17:41:58 +00:00
Howard Huang	2ff7c1c774	[PP] Rename _load_actions and validate (#160558 ) Rename method and add validation Pull Request resolved: https://github.com/pytorch/pytorch/pull/160558 Approved by: https://github.com/wconstab ghstack dependencies: #159591	2025-08-14 17:41:58 +00:00
Guilherme Leobas	3028fa6ce9	Wrap class definitions in `set_fullgraph(False)` in `test_list`/`tuple` (#160277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160277 Approved by: https://github.com/zou3519 ghstack dependencies: #160216, #160217, #160276, #160278, #160330, #160331	2025-08-14 17:29:45 +00:00
Matthew Haddock	077cb38974	Add dtype checks in meta dispatch for various ordering ops (#159556 ) This adds data type checks for the unsupported bool and complex types for argmax/min topk, sort, minimum, maximum. As listed here: `0a99b026d6/torch/testing/_internal/common_methods_invocations.py (L21076)` Currently the ops will fail on CPU or CUDA calculation, rather than at meta dispatch stage as with for example max: `0a99b026d6/aten/src/ATen/native/TensorCompare.cpp (L285)` . This will catch it early. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159556 Approved by: https://github.com/janeyx99	2025-08-14 17:06:27 +00:00
Jovian Anthony Jaison	cd8d8c18f5	[pytorch][dynamo_compile] Log graph_node_shape to dynamo_compile (#160556 ) This PR adds the dynamo graph node shape logging to dynamo compile. Also added unit tests to check if correct graph node shape is being logged. Test Plan: $ python -m test_utils Ran 12 tests in 36.447s OK Note: Will merge after D80185628 lands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160556 Approved by: https://github.com/masnesral, https://github.com/jingsh	2025-08-14 16:42:35 +00:00
Lucas Kabela	63654ba4c5	[BE][Dynamo] Type improvements in `_dynamo/utils` to generics (#159824 ) Follow up to #159580 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159824 Approved by: https://github.com/williamwen42	2025-08-14 16:06:50 +00:00
Ke Wen	7e27347fd3	[SymmMem] Check return of nvshmem_malloc (#160603 ) `nvshmem_malloc` returns a null pointer when allocation fails. We should check here. Otherwise, the nullptr can go down the road and into the device kernel, causing CUDA illegal memory access. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160603 Approved by: https://github.com/fduwjj, https://github.com/ngimel	2025-08-14 15:57:55 +00:00
Raman Kumar	1d80d697a2	appending the pythonpath (#160219 ) Fixes #160193 `PYTHONPATH=/torchbench` to `PYTHONPATH=/torchbench:$PYTHONPATH` in [pytorch/.ci/pytorch/test.sh](`b5fd7223b1/.ci/pytorch/test.sh (L1715)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160219 Approved by: https://github.com/malfet	2025-08-14 15:55:31 +00:00
Xinya Zhang	b6b74aed60	[ROCm] Support large inputs for coalesceValuesKernel (#158281 ) # Description `.coalesce` cannot handle large inputs on ROCM due to maximal grid size limit. This PR splits axis `X` into axes `X` and `Y`, and repurposes `Z` for original `Y` on ROCm to avoid such limitation. Confirmed the new approach can handle large inputs. Correctness needs validation. # Testing Command `python torch_spmv.py 22500000 272500000` ## Script `torch_spmv.py` ``` python import torch import argparse def parse_args(): parser = argparse.ArgumentParser( description="Sparse COO Matrix by Dense Vector Multiplication using PyTorch" ) parser.add_argument("n", type=int, help="Size of the NxN matrix") parser.add_argument("nnz", type=int, help="Number of non-zero entries") return parser.parse_args() def main(): args = parse_args() n = args.n nnz = args.nnz dtype = torch.float32 device = torch.device('cuda') # Generate random indices for the sparse matrix in COO format. torch.manual_seed(42) rows = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device) cols = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device) indices = torch.stack([rows, cols], dim=0) # Generate random values. values = torch.randn(nnz, dtype=torch.float32, device=device) # Create the sparse COO matrix and move it to the target device. sparse_matrix = torch.sparse_coo_tensor(indices, values, size=(n, n), dtype=torch.float32, device=device) sparse_matrix = sparse_matrix.coalesce() # Generate a random dense vector. dense_vector = torch.randn(n, dtype=torch.float32, device=device) # Perform sparse matrix - dense vector multiplication. # Using torch.sparse.mm which expects a 2D tensor for the vector. result = torch.sparse.mm(sparse_matrix, dense_vector.unsqueeze(1)).squeeze() # result = torch.mv(sparse_matrix, dense_vector) # Print the result. print("Result of the multiplication:") print(torch.sum(result)) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158281 Approved by: https://github.com/jeffdaily	2025-08-14 15:09:16 +00:00
Tugsbayasgalan Manlaibaatar	4a773e1e86	Warn when there is side effect in strict mode (#160060 ) Differential Revision: [D79784354](https://our.internmc.facebook.com/intern/diff/D79784354) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160060 Approved by: https://github.com/zhxchen17, https://github.com/StrongerXi	2025-08-14 14:59:44 +00:00
Howard Huang	198b5fd2d4	[PP] Add DualPipeV schedule (#159591 ) Added the DualPipeV schedule according to http://github.com/deepseek-ai/DualPipe/blob/main/dualpipe/dualpipev.py#L11 <img width="3633" height="486" alt="image" src="https://github.com/user-attachments/assets/4e843bb9-87cd-4d11-936c-7dfe8ee12f16" /> This schedule doesn't perform the actual "overlap" during execution, but provides the scaffolding and schedule definition we need to run it E2E in torchtitan. Supporting the overlapped operation will be worked on in following PRs. Tests: ```sh python test/distributed/pipelining/test_schedule_multiproc.py -k test_v_shape_schedules python test/distributed/pipelining/test_schedule.py -k test_pipeline_order_for_v_schedules ``` Also tested in TorchTitan and is running. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159591 Approved by: https://github.com/wconstab	2025-08-14 14:58:35 +00:00
blaine-rister	20bdabbb3c	[Dynamo] Fix MTIA dynamo backend by avoiding has_trition() at import time (#160604 ) # Summary MTIA's torch.compile tests were broken by D80037015. (For details, see internal task T234563969.) The root cause was that `has_triton` can change state after we call `torch.mtia.init()`, but it was used in a way that fixes Inductor's behavior at import time. (Note that `has_triton` is cached, and there's no opportunity to call `torch.mtia.init()` prior to `import torch`.) To fix this, we use `try: import triton` as opposed to `has_triton()` at the module level. # Test Plan See the internal diff. As a follow-up, we will add appropriate unit tests and/or CI hints so this type of issue can be caught at PR/diff time. Differential Revision: D80228000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160604 Approved by: https://github.com/PaulZhang12, https://github.com/eellison	2025-08-14 14:54:49 +00:00
Alexander Grund	d556586448	[cutlass backend] re-add pip cutlass path (#160180 ) Revert #156651 to allow using the cutlass PIP package which is easier for users than the Git checkout or similar method. Also fix a bug where the PIP cutlass path wouldn't be available to subprocesses spawned during benchmarking for algorithm selection. Looks like the "spawn" method does not inherit the (potentially) already set up `config.cuda.cutlass_dir` so in the subprocess the include paths will still be set to `"../third_party/cutlass/"` leading to compilation failure due to missing headers. Ensure `try_import_cutlass` is called at that point, which due to caching is a no-op in most cases, so doesn't hurt. Change the logic to return `None` when cutlass isn't available returning more useful values for include paths, namely an empty list. This is in line with other inductor code which disables the CUTLASS backend when `try_import_cutlass` returns False Pull Request resolved: https://github.com/pytorch/pytorch/pull/160180 Approved by: https://github.com/henrylhtsang, https://github.com/mlazos	2025-08-14 14:48:31 +00:00
Isuru Fernando	781e9a7724	Fix meta for constant_pad_nd (#159878 ) Fixes https://github.com/pytorch/pytorch/issues/144187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159878 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2025-08-14 14:47:47 +00:00
atalman	e4de93f6a3	Add sm50 and sm60 back to windows builds (#160586 ) Addresses the issue reported in https://github.com/pytorch/pytorch/issues/160575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160586 Approved by: https://github.com/malfet	2025-08-14 12:46:35 +00:00
Wang, Chuanqi	a5652407e4	[CI] Fix triton xpu build on Windows (#160442 ) Pin the ninja version to 1.11 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160442 Approved by: https://github.com/atalman	2025-08-14 12:43:49 +00:00
Laith Sakka	6f0f4e0c3e	reduce threshold to suggest changes to expected results (#160463 ) Since we increase threshold to 10% i would like suggestions to show up to update those +-2% instead of 3.3% now Pull Request resolved: https://github.com/pytorch/pytorch/pull/160463 Approved by: https://github.com/jamesjwu	2025-08-14 09:11:27 +00:00
fengqing.lu	db763b1717	[Intel GPU] Support SDPA backend selection and priority setting on XPU (#159464 ) Currentlly SPDA XPU use own `priority_order` instead of the one from global context. Hence it does not support `with sdpa_kernel(order, set_priority=True)` with set_priority=True. This PR enables this feature. To make default `priority_order` from global context works for XPU, I also move MATH backend to lowest priority, otherwise `cudnn attention` and `overrideable attention` will never be selected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159464 Approved by: https://github.com/guangyey, https://github.com/drisspg Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: mayuyuace <qiming1.zhang@intel.com>	2025-08-14 08:55:31 +00:00
Phil Xiaojun Hu	089c4a1ba0	Fix wrong log file name in the docs of `torch.distributed.elastic.multiprocessing.start_processes()` (#160396 ) Fixes #160395 In https://docs.pytorch.org/docs/stable/elastic/multiprocessing.html#starting-multiple-workers and also in the code comment of the function[1], it was specified that: ``` For each process, the ``log_dir`` will contain: #. ``{local_rank}/error.json``: if the process failed, a file with the error info #. ``{local_rank}/stdout.json``: if ``redirect & STDOUT == STDOUT`` #. ``{local_rank}/stderr.json``: if ``redirect & STDERR == STDERR`` ``` While in code[2], the files are `stdout.log` and `stderr.log`, instead of the `.json` ones listed in the doc. [1]: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/__init__.py#L144-L145 [2]: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L354-L357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160396 Approved by: https://github.com/fduwjj	2025-08-14 08:24:07 +00:00
zpcore	97c8c98f8d	measure dispatch overhead (#160504 ) Reopen https://github.com/pytorch/pytorch/pull/159699 to merge to main. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160504 Approved by: https://github.com/wconstab	2025-08-14 06:13:53 +00:00
FFFrog	39aa3d1471	Remove the dead code in setup.py (#160515 ) The following line has no effect. `34ec5ed275/setup.py (L1205)` This code was originally introduced in this PR: `dd7cec680c`, and clang11 and later now support `-fstack-clash-protection`. Can we remove this line? @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/160515 Approved by: https://github.com/isuruf, https://github.com/albanD	2025-08-14 06:02:11 +00:00
Yang Wang	639778b3ee	[2/3 step][ vllm ci build setup] Add vlllm buld logic and dockerfile (#160089 ) # set up vllm build logic - dockerfile: please notice the dockfile introduced here is only temporary, once we migrate this file to vllm, we will fetch it directly from there - VllmBuildRunner: - implement logic to prepare and run vllm build with dockerfile - Pull Request resolved: https://github.com/pytorch/pytorch/pull/160089 Approved by: https://github.com/huydhn ghstack dependencies: #160043	2025-08-14 05:51:45 +00:00
Yang Wang	00d7d6f123	[1/3][ghstack] [vllm ci build setup ]setup lumen_cli (#160043 ) # Description set up torch_cli using argparses ## Details: - add vllm placeholer in the cli - add unittest for cli command see Readme.md to see how to run the cli Pull Request resolved: https://github.com/pytorch/pytorch/pull/160043 Approved by: https://github.com/huydhn	2025-08-14 05:51:45 +00:00
Jeff Daily	c6d78d4dbd	[ROCm] enable miopen channels last 3d for conv and batchnorm (#160529 ) miopen batchnorm for channels last is guarded by env var PYTORCH_MIOPEN_SUGGEST_NHWC_BATCHNORM similar to existing PYTORCH_MIOPEN_SUGGEST_NHWC for conv. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160529 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-14 05:30:19 +00:00
Boyuan Feng	2898d3f965	[Lowering] Add assertion msg to sym_size and sym_stride (#160591 ) Summary: Add assertion msg to sym_size and sym_stride lowering function. Test Plan: Will test in mast job. Rollback Plan: Differential Revision: D80187693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160591 Approved by: https://github.com/angelayi	2025-08-14 04:55:32 +00:00
PyTorch UpdateBot	34358f335d	[vllm hash update] update the pinned vllm hash (#160594 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160594 Approved by: https://github.com/pytorchbot	2025-08-14 04:21:28 +00:00
zeshengzong	fe3f5fe4ea	Optimize `min`, `max` gradient behavior description (#160312 ) Fixes #160273 ## Test Result <img width="897" height="593" alt="image" src="https://github.com/user-attachments/assets/6ebcdb2c-8a2c-4f0d-8195-656089e88325" /> <img width="985" height="653" alt="image" src="https://github.com/user-attachments/assets/606a7264-e223-4d2b-8c3f-f153ce43b208" /> <img width="903" height="607" alt="image" src="https://github.com/user-attachments/assets/0ae2f56f-820f-4194-b15c-a02a078c0487" /> <img width="903" height="607" alt="image" src="https://github.com/user-attachments/assets/79c38a17-45ac-4808-829f-d538178de36b" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160312 Approved by: https://github.com/ngimel	2025-08-14 04:18:49 +00:00
Aidyn-A	45ba7ecda8	Flex Attention heuristics: a Blackwell config (#160192 ) Fixes #160074 and more. This is the working config for B200 and RTX 5080. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160192 Approved by: https://github.com/drisspg	2025-08-14 03:47:02 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	194fcfcfbd	Add support for param mutation under inference mode (#159661 ) Summary: In HF model rwkv, we have parameter mutation under inference mode which should be safe. This PR does multiple things to make sure it works: 1. We execute global autograd mutation while tracing so that we can actually trace through parameter inplace mutation 2. Add support for parameter mutation under inference mode in AOTAutograd 3. Add support for parameter mutation under inference mode in export. Test Plan: test Rollback Plan: Differential Revision: D79460136 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159661 Approved by: https://github.com/ydwu4	2025-08-14 03:34:04 +00:00
Michael Lazos	29d20d49f0	[cutlass] fix dictionary iteration error (#160552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160552 Approved by: https://github.com/henrylhtsang, https://github.com/jingsh	2025-08-14 03:23:46 +00:00
Guilherme Leobas	3faee0a631	Update nullcontext to return input args (#158776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158776 Approved by: https://github.com/zou3519	2025-08-14 03:02:44 +00:00
Yu, Guangye	8cfaf51d4e	Generalize support of background thread in pinned allocator (#160505 ) # Motivation https://github.com/pytorch/pytorch/pull/135524 only introduces the support of background thread for CUDA, this PR intends to support it for other backend such as XPU as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160505 Approved by: https://github.com/albanD	2025-08-14 02:22:39 +00:00
Guilherme Leobas	af3cabc55d	Wrap class definitions in `set_fullgraph(False)` in `test_sort` (#160331 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160331 Approved by: https://github.com/zou3519 ghstack dependencies: #160216, #160217, #160276, #160278, #160330	2025-08-14 02:12:20 +00:00
Guilherme Leobas	74bbe7b4a3	Wrap class definitions in `set_fullgraph(False)` in `test_math`/`cmath` (#160330 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160330 Approved by: https://github.com/zou3519 ghstack dependencies: #160216, #160217, #160276, #160278	2025-08-14 02:12:20 +00:00
Guilherme Leobas	7bfc424a61	Wrap class definitions in `set_fullgraph(False)` in `test_iter` (#160278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160278 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #160216, #160217, #160276	2025-08-14 02:12:20 +00:00
RajeshvShiyal	5ace061254	finfo eps doc fix (#160502 ) Existing documentation for torch.finfo().eps is as below: \| eps \| float \| The smallest representable number such that ``1.0 + eps != 1.0``. \| Proposed documentation for torch.finfo().eps is as below: \| eps \| float \| The difference between 1.0 and the next smallest representable float larger than 1.0. \| Fixes #160397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160502 Approved by: https://github.com/ngimel	2025-08-14 01:49:35 +00:00
drisspg	15e49f6164	Factor out the strings to templates for better editor integration (#160357 ) # Summary More code motion, tldr is that install 'Better Jinja' in vscode and now you can get highlighting Before <img width="776" height="926" alt="Screenshot 2025-08-11 at 2 41 08 PM" src="https://github.com/user-attachments/assets/10868b31-f8ac-4cf5-99fe-19b8789ce06b" /> After: <img width="1184" height="1299" alt="Screenshot 2025-08-11 at 2 40 27 PM" src="https://github.com/user-attachments/assets/45203765-589e-4d76-8196-d895a2f2fbf6" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160357 Approved by: https://github.com/eellison	2025-08-14 01:07:53 +00:00
Laith Sakka	dd21c8a578	refresh expected results (#160537 ) regression introduced by https://github.com/pytorch/pytorch/pull/160314 not much worried about it since it did not effect other inductor benchmarks could not repo locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/160537 Approved by: https://github.com/eellison	2025-08-14 00:56:14 +00:00
Nikita Shulga	a06ec54d40	[MPS] Add API to query GPU core count (#160414 ) Using good old IOKit to get `gpu-core-count` property from device implementing `AGXAccelerator` service Expose this one as `torch.backend.mps.get_core_count()` and make it accessible via `MpsInterface` to the inductor Test Plan: Run `python3 -c "import torch;print(torch.backends.mps.get_name(), torch.backends.mps.get_core_count())"` and compare it to `system_profiler SPDisplaysDataType\|head -n10` ``` % python3 -c "import torch;print(torch.backends.mps.get_name(), torch.backends.mps.get_core_count())" Apple M1 Pro 16 % system_profiler SPDisplaysDataType\|head -n10 Graphics/Displays: Apple M1 Pro: Chipset Model: Apple M1 Pro Type: GPU Bus: Built-In Total Number of Cores: 16 Vendor: Apple (0x106b) Metal Support: Metal 3 ``` This would significantly improve occupancy for torch.compile generated kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/160414 Approved by: https://github.com/dcci	2025-08-14 00:05:17 +00:00
Mikayla Gawarecki	50a8c11875	Add getCurrentDeviceIndex to torch::stable::accelerator (#160453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160453 Approved by: https://github.com/janeyx99 ghstack dependencies: #159679	2025-08-13 23:42:24 +00:00
Mikayla Gawarecki	e4e4dbd2f8	Add beginnings of torch::stable::accelerator (#159679 ) Adds - `torch::stable::accelerator::DeviceGuard`: `std::unique_ptr` to `DeviceGuardOpauqe` mostly copied from the below (but made generic) `50eac811a6/torch/csrc/inductor/aoti_runtime/utils_cuda.h (L30-L46)` - constructor `DeviceGuard(DeviceIndex)` (this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device) - `set_index(DeviceIndex)` - `torch::stable::accelerator::Stream`: `std::shared_ptr` to `StreamOpaque` - constructor `Stream(StreamHandle stream)` (similar to torch::stable::Tensor) - `id() -> StreamId` - `getCurrentStream(DeviceIndex device_index) -> stable::accelerator::Stream` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159679 Approved by: https://github.com/guangyey, https://github.com/janeyx99	2025-08-13 23:42:24 +00:00
Aidyn-A	d670304001	[ATen][CUDA] Use new CCCL API in v2.8 (#160554 ) Silences deprecation warnings like: ``` In file included from tmpxft_003a195d_00000000-6_Nonzero.cudafe1.stub.c:1: /tmp/tmpxft_003a195d_00000000-6_Nonzero.cudafe1.stub.c: At global scope: /tmp/tmpxft_003a195d_00000000-6_Nonzero.cudafe1.stub.c:243:219: warning: 'template<class ValueType, class OffsetT> class at_cuda_detail::cub::CountingInputIterator' is deprecated: Use thrust::counting_iterator instead [-Wdeprecated-declarations] 243 \| static void __device_stub__ZN2at6native43_GLOBAL__N__3cee4041_10_Nonzero_cu_cba1aaa011flag_kernelILi512ELi16EhEEvPKT1_PlPKllli( const _ZN3c104impl20ScalarTypeToCPPTypeTILNS_10ScalarTypeE0EEE __par0, int64_t __par1, const int64_t __par2, int64_t __par3, int64_t __par4, int __par5) { __cudaLaunchPrologue(6); __cudaSetupArgSimple(__par0, 0UL); __cudaSetupArgSimple(__par1, 8UL); __cudaSetupArgSimple(__par2, 16UL); __cudaSetupArgSimple(__par3, 24UL); __cudaSetupArgSimple(__par4, 32UL); __cudaSetupArgSimple(__par5, 40UL); __cudaLaunch(((char )((void ( )(const _ZN3c104impl20ScalarTypeToCPPTypeTILNS_10ScalarTypeE0EEE , int64_t , const int64_t , int64_t, int64_t, int))at::native::_NV_ANON_NAMESPACE::flag_kernel<(int)512, (int)16, unsigned char> ))); }namespace at{ \| ^~~~~~~~~~~~~~~~~~~~~ /usr/local/cuda-12.9/include/cub/iterator/counting_input_iterator.cuh:93:63: note: declared here 93 \| class CCCL_DEPRECATED_BECAUSE("Use thrust::counting_iterator instead") CountingInputIterator \| ^~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160554 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/atalman	2025-08-13 23:15:53 +00:00
Sheng Fu	c5efc5c8a6	Fix unit test test_equivalent_template_code (#160432 ) Summary: Fix unit test test_equivalent_template_code https://github.com/pytorch/pytorch/pull/159920 treats ReinterpretView as a not-realized node when searching FX origin nodes for fused triton kernel. In test_equivalent_template_code, there is a transpose node (which is a ReinterpretView) before matmul. It was not in FX graph segment before PR 159920. FX origin nodes are used to define the name of triton kernel. That is the reason test_equivalent_template_code failed with PR 159920 since it uses hard-coded triton kernel name to check the result. The fix is to update the triton kernel name in the unit test. Test Plan: buck2 run mode/opt caffe2/test/inductor:benchmark_fusion -- caffe2.test.inductor.test_benchmark_fusion.BenchmarkMultiTemplateFusionCudaTest Rollback Plan: Differential Revision: D80101711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160432 Approved by: https://github.com/clee2000	2025-08-13 23:14:51 +00:00
Will Constable	6da11d9aaf	[C10D] Add check_rng_sync util (#160283 ) Debugs RNG desync by checking the current state on each rank in the group and summarizing the differences if any are detected. Notes: - used allgather instead of gather since its simpler to do this SPMD rather than add conditional behavior, though I could be convinced we only want to log on rank0. Usage: `check_rng_sync(generator, group)` Prints something like this: (cuda): ``` [rank0]:E0808 ] Generator desync detected: [rank0]:E0808 ] Ranks (Seed, Offset) values [rank0]:E0808 ] ------- ----------------------- [rank0]:E0808 ] 0 (456, 0) [rank0]:E0808 ] 1 (123, 4) [rank0]:E0808 ] 2-3 (123, 0) ``` (cpu): ``` [rank2]:E0810 ] Generator desync detected: [rank2]:E0810 ] Ranks Generator State Hash values [rank2]:E0810 ] ------- ----------------------------- [rank2]:E0810 ] 0 7633364531954955665 [rank2]:E0810 ] 1 8807615394212033278 [rank2]:E0810 ] 2-3 -6150027303226666531 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160283 Approved by: https://github.com/ezyang	2025-08-13 23:05:29 +00:00
Markus Hoehnerbach	182efe31db	[inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160 ) (#158462 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158462 Approved by: https://github.com/eellison	2025-08-13 22:54:18 +00:00
William Wen	1ea688f9a2	[dynamo] fix EXTENDED_ARG starts_line dropping bug (#160478 ) Fixes https://github.com/pytorch/pytorch/issues/160471 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160478 Approved by: https://github.com/Lucaskabela, https://github.com/billmguo	2025-08-13 22:27:40 +00:00
Isabella Ni	53e3949495	[MTIA-T][CFF] Pass backend parameter into GPU vertical pass file and pattern matcher (#160404 ) Summary: As titled Please see https://fb.workplace.com/groups/1075192433118967/posts/1735215827116621/?comment_id=1735220747116129&reply_comment_id=1735242997113904 Basically, for MTIA, we want mtia_afg to show up in the counters and backend, instead of Inductor. MTIA is not using inductor yet. Using env var TORCHINDUCTOR_PATTERN_MATCH_BACKEND to pass in the actual backend. The env var default value is "inductor", so nothing should break for GPU. Test Plan: Default is always "inductor", so existing test should not break. CI tests Rollback Plan: Differential Revision: D80069072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160404 Approved by: https://github.com/BoyuanFeng	2025-08-13 22:24:27 +00:00
PyTorch MergeBot	33d9401866	Revert "[BE][Dynamo] Type improvements in `_dynamo/utils` to generics (#159824 )" This reverts commit 3ef2e1ef769582a82c6ddf150e9d11bf4bf1c44f. Reverted https://github.com/pytorch/pytorch/pull/159824 on behalf of https://github.com/clee2000 due to I think this broke dynamo/test_trace_rules.py::TraceRuleTests::test_almost_impossible_missing_name [GH job link](https://github.com/pytorch/pytorch/actions/runs/16948305999/job/48035192324) [HUD commit link](`3ef2e1ef76`) ([comment](https://github.com/pytorch/pytorch/pull/159824#issuecomment-3186003531))	2025-08-13 22:17:29 +00:00
Shangdi Yu	d1950d4bb5	Change IR node's stack trace to be computed lazily (#160487 ) Summary: When an IR node is an inherited class, post_init is called once for each super().__init__() call. To avoid duplicated calls, we make stack trace computation happen lazily. Test Plan: CI Rollback Plan: Differential Revision: D80137870 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160487 Approved by: https://github.com/angelayi	2025-08-13 21:41:25 +00:00
Mikayla Gawarecki	1196bb1c2e	Add utility to get computed kernel in torch.library (#158393 ) Adds `OperatorEntry::getComputedKernelForDispatchKey` which returns the KernelFunction corresponding to `OperatorEntry.dispatchTable_[dispatch_ix]` for a given dispatch key - Specifically it returns a `SafeKernelFunction` that holds a `KernelToken`. This `KernelToken` is registered to the `KernelFunction` in `OperatorEntry.kernels_` and will be invalidated when the `KernelFunction` is destructed (i.e. when the `AnnotatedKernel` that holds this `KernelFunction` is removed from `kernels_`, which happens when the corresponding impl is deregistered). - `SafeKernelFunction` can be called via `callBoxed`, the validity of the token will be checked before this happens - `SafeKernelFunction` is pybinded and `getComputedKernelForDispatchKey` is exposed to the frontend ia `torch.library.get_kernel` Related to https://github.com/pytorch/pytorch/issues/155330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158393 Approved by: https://github.com/albanD	2025-08-13 21:00:59 +00:00
henrylhtsang	e9eb2096a5	[cutlass backend] Allow bmm use cases when batch stride is 0 (#160356 ) Differential Revision: [D80035771](https://our.internmc.facebook.com/intern/diff/D80035771/) The motivation and the original change is to reduce the number parameters we pass into the kernel, which was motivated by aesthetic reasons only. But seeing the need to use different batch stride, we should just pass in the batch stride. That would be a good long term fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160356 Approved by: https://github.com/mlazos	2025-08-13 20:52:24 +00:00
Lucas Kabela	3ef2e1ef76	[BE][Dynamo] Type improvements in `_dynamo/utils` to generics (#159824 ) Follow up to #159580 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159824 Approved by: https://github.com/williamwen42	2025-08-13 20:17:01 +00:00
Jithun Nair	4cde0acc0e	Make triton build ROCm library version-agnostic (#158408 ) Fixes maintenance of triton packaging script when library versions change from one ROCm version to next. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158408 Approved by: https://github.com/jeffdaily Co-authored-by: Ethan Wee <Ethan.Wee@amd.com>	2025-08-13 19:49:23 +00:00
Jerry Mannil	70ccdec44b	[ROCm] Improve reduction sum performance (#160466 ) * Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 Reproducer: ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` Before (MI300X): Avg time for shape (5079670, 128): 1629.99 us After (MI300X) Avg time for shape (5079670, 128): 1008.59 us Pull Request resolved: https://github.com/pytorch/pytorch/pull/160466 Approved by: https://github.com/petrex, https://github.com/jeffdaily	2025-08-13 18:46:58 +00:00
Nikita Shulga	db0b7f1cc9	[BE][CI] Adjust `error_inputs` for cat and complex (#160378 ) MPS backend does not support double, so errors should be different Pull Request resolved: https://github.com/pytorch/pytorch/pull/160378 Approved by: https://github.com/dcci	2025-08-13 18:35:06 +00:00
ILCSFNO	1c26c53851	Fix the Doc of `pivot` in `torch.lu` (#159617 ) Fixes #159616 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159617 Approved by: https://github.com/lezcano, https://github.com/jansel	2025-08-13 18:30:54 +00:00
Alexander Grund	adcca7d9a1	Do not rpath CUDA stubs folder in JIT generated code (#160179 ) `_transform_cuda_paths` intentionally includes the CUDA stubs folder. However this path must not be added to the rpath as otherwise any CUDA command will fail at runtime with > CUDA_ERROR_STUB_LIBRARY: "CUDA driver is a stub library" This results in e.g. non-descriptive errors like ``` cutlass_library/source/tools/util/include/cutlass/util/device_memory.h:67 cutlass::device_memory::allocate: cudaMalloc failed: bytes=4096 terminate called after throwing an instance of 'cutlass::cuda_exception' what(): std::exception ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160179 Approved by: https://github.com/jansel	2025-08-13 18:29:24 +00:00
Dmitry Nikolaev	01584d2a7d	[ROCm] remove extra transposes in NHWC convolutions on MIOpen (#160435 ) remove aten::contiguous for NHWC convolutions on ROCm Tests: - nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float32 - nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float16 Before: <img width="1255" height="228" alt="image" src="https://github.com/user-attachments/assets/b125ccab-00c2-4d3a-a341-4583e51d8d57" /> After: <img width="874" height="153" alt="image" src="https://github.com/user-attachments/assets/ec200754-3622-488e-8762-bff1c2d22818" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160435 Approved by: https://github.com/jeffdaily	2025-08-13 17:58:22 +00:00
ILCSFNO	87e6c4079d	Fix the Doc issue on the description of edge_order in torch.gradient() (#159130 ) Fixes #159129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159130 Approved by: https://github.com/soulitzer	2025-08-13 16:48:47 +00:00
Nikita Shulga	7d87e358ac	Fix MPS conv3d autocast bias dtype mismatch (#160423 ) ## Summary - register conv3d with MPS autocast to ensure bias dtypes match under AMP - add regression test chaining two Conv3d layers on MPS autocast Written by Codex, see https://chatgpt.com/codex/tasks/task_e_689b64192df883278648935963d2776d Pull Request resolved: https://github.com/pytorch/pytorch/pull/160423 Approved by: https://github.com/dcci	2025-08-13 16:23:21 +00:00
Saurabh Mishra	6ee175195a	[DCP][OSS] Rank local checkpointing in DCP without collectives (#147758 ) Summary: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147758 Approved by: https://github.com/meetv18	2025-08-13 16:20:28 +00:00
zhangfei	db32b60662	[ci] Add riscv opt-int build (#143979 ) Hi, @malfet Based on the previous discussion: [RISCV CI support · Issue #141550 · pytorch/pytorch](https://github.com/pytorch/pytorch/issues/141550) I have cross-compiled PyTorch for the RISC-V architecture on x86_64 Ubuntu 24.04 and created a new PR for it. Could you please help review it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/143979 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-13 16:12:02 +00:00
Paul Zhang	56c828bef9	Followup of #160002 , gracefully fail if Triton functions don't contain attributes (#160436 ) Summary: Fixes internal test failures of D80037015 Test Plan: CI Rollback Plan: Differential Revision: D80094187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160436 Approved by: https://github.com/clee2000	2025-08-13 16:04:56 +00:00
Natalia Gimelshein	a2fd106d67	guard cuMulticastUnbind call (#160499 ) Fixes builds for old compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/160499 Approved by: https://github.com/Skylion007	2025-08-13 15:45:51 +00:00
PyTorch MergeBot	c656334120	Revert "Factor out the strings to templates for better editor integration (#160357 )" This reverts commit cbffde774557752cf20447d42d99ec6102673c31. Reverted https://github.com/pytorch/pytorch/pull/160357 on behalf of https://github.com/clee2000 due to broke a bunch of internal builds due to not being able to find the file No such file or directory: torch/_inductor/kernel/flex/templates/flex_decode.py.jinja D80145761, might need a buck targets change? ([comment](https://github.com/pytorch/pytorch/pull/160357#issuecomment-3184435581))	2025-08-13 15:40:50 +00:00
fduwjj	31c9ac4319	[c10d] Fix test test_nccl_user_buffer_registration (#160497 ) Fixed `test_nccl_user_buffer_registration ` due to https://github.com/pytorch/pytorch/pull/160145, somehow CI didn't capture it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160497 Approved by: https://github.com/ngimel	2025-08-13 15:29:41 +00:00
Catherine Lee	deea71a90e	[ez][CI] Set timeout for linux-jammy-py3_13-clang12-test from 600min -> default val of 240 (#160500 ) 10 hours is very long Pull Request resolved: https://github.com/pytorch/pytorch/pull/160500 Approved by: https://github.com/huydhn	2025-08-13 15:14:24 +00:00
Svetlana Karslioglu	114a6c4043	Add placeholder for the User Guide (#159379 ) - Add pytorch_overview.md - Add pytorch_main_components.md - Reorganize top nav to have Get Started, User Guide, Reference API, Community, Tutorials - Move notes under user guide Pull Request resolved: https://github.com/pytorch/pytorch/pull/159379 Approved by: https://github.com/albanD Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-13 14:56:04 +00:00
libohao	ee1b0412b9	[1/N]Port 3 distributed/_tools test cases to Intel GPU (#159543 ) For [#114850](https://github.com/pytorch/pytorch/issues/114850), we will port distributed tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. use "torch.accelerator.current_accelerator()" to determine the accelerator backend 2. enabled XPU for some test path 3. skip some test cases which Intel GPU does not support Pull Request resolved: https://github.com/pytorch/pytorch/pull/159543 Approved by: https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-08-13 12:49:01 +00:00
Han, Chao1	42e51cd4b3	Support ddp zero hook XCCL path (#159240 ) XCCL backend no https://github.com/pytorch/pytorch/issues/62300 issue, add xccl path here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159240 Approved by: https://github.com/guangyey, https://github.com/Skylion007, https://github.com/EikanWang	2025-08-13 12:37:33 +00:00
Laith Sakka	96bd33b2de	Fix get_free_symbol_uses for several nodes (#160314 ) get_free_symbol_uses is used to know what unbacked symbols are used by a given node. not having correct get_free_symbol_uses defined properly leads to : - eliminating of some nodes due to not detection of any users. (See the added unit test) - Incorrect topological sort. Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel. for ComputedBuffer with NonOwningLayout its interesting case. when layout is NonOwningLayout we need to access the actual view op base layout and use detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160314 Approved by: https://github.com/eellison	2025-08-13 12:28:29 +00:00
Michael Lazos	ecde76c764	[Hierarchical Compile] Sort all regions identically (#158814 ) Before we would topologically sort each region individually, this works well except if some nodes have no arguments, then their order may change. To rectify this, we sort the first region as the reference region and use that sort order to sort the remaining regions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158814 Approved by: https://github.com/williamwen42	2025-08-13 11:55:23 +00:00
Michael Lazos	34ec5ed275	[Dynamo][Hierarchical Compile] Allow parameters to be propagated to submodules (#157979 ) Fixes issue with HF Gen AI models where we mark a param as static and a get_attr node gets put in the region. The effect of this is lifting get_attr nodes to be inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157979 Approved by: https://github.com/williamwen42	2025-08-13 09:12:10 +00:00
PyTorch MergeBot	641ee74781	Revert "Add `label_smoothing` param in `nn.BCELoss` and `nn.BCEWithLogitsLoss` (#150282 )" This reverts commit f990490a23815ea6ee27e487c70ba2cf513ba43d. Reverted https://github.com/pytorch/pytorch/pull/150282 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150282#issuecomment-3182844949))	2025-08-13 09:01:52 +00:00
Deng, Daisy	6e8865fbc1	port 3 distributed test to Intel GPU and unified some common functions (#158533 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - enabled XPU for some test path - Unify some common code under torch/testing/_internal for multiple backend, for example: - requires_nccl_version - _dynamo_dist_per_rank_init - DynamoDistributedSingleProcTestCase - DistTestCases - FSDPTestMultiThread Pull Request resolved: https://github.com/pytorch/pytorch/pull/158533 Approved by: https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-08-13 08:13:23 +00:00
Edward Yang	9a06e6d031	[claude-code] Add top-level module doc for torch/distributed/tensor/_op_schema.py (#157804 ) Not sure how good the description is, seeking insight from maintainers. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157804 Approved by: https://github.com/wanchaol	2025-08-13 07:27:11 +00:00
Erxin Shang	6ea8376f84	Enable XPU for test_autograd_function.py (#160309 ) # Description Fixes #114850, we will port dynamo tests to Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: # Changes 1. Get device type from get_devtype() method. 2. Replace the requires_cuda_and_triton with requires_gpu. 3. Add HAS_XPU_AND_TRITON into the scope. # Notify Pull Request resolved: https://github.com/pytorch/pytorch/pull/160309 Approved by: https://github.com/guangyey, https://github.com/ezyang	2025-08-13 06:38:34 +00:00
FFFrog	8eee08d227	Replace TORCH_INTERNAL_ASSERT with TORCH_CHECK (#160411 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160411 Approved by: https://github.com/ezyang	2025-08-13 06:31:10 +00:00
Masaki Kozuki	e497620260	Add `compile_id: Optional[CompileID]` to `torch._logging._internal.trace_structured_artifact` (#160440 ) Context: When writing a custom `torch.compile` backend, I quite frequently (ab)use `trace_structured_artifact` because I'm too lazy to customize tlparse (ref: `6d8b13c867`). I recently notice some of the artifacts I want to store are generated where CompileID cannot be correlated and `tlparse` html says > Sometimes, logs are made without a compile id. This makes it difficult to correlate related logs. This stack trie shows all places where log entries occurred without compile context; to fix, look an appropriate place in the stack where compile id should have been specified. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160440 Approved by: https://github.com/ezyang	2025-08-13 06:28:23 +00:00
kshitij12345	199e9abb6a	[fx] fix split_module with symint (#160093 ) Fixes https://github.com/pytorch/pytorch/issues/155220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160093 Approved by: https://github.com/ezyang	2025-08-13 05:50:15 +00:00
PyTorch UpdateBot	685f15dbea	[vllm hash update] update the pinned vllm hash (#160484 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160484 Approved by: https://github.com/pytorchbot	2025-08-13 04:54:03 +00:00
Guilherme Leobas	85db508af5	Wrap class definitions in `set_fullgraph(False)` in `test_int`/`bool`/`float`/`complex` (#160276 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160276 Approved by: https://github.com/zou3519 ghstack dependencies: #160216, #160217	2025-08-13 04:53:03 +00:00
Guilherme Leobas	27156ec804	Wrap class definitions in `set_fullgraph(False)` in `test_operator` (#160217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160217 Approved by: https://github.com/zou3519 ghstack dependencies: #160216	2025-08-13 04:53:03 +00:00
Guilherme Leobas	6746bc59df	Wrap class definitions in `set_fullgraph(False)` in `test_set` (#160216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160216 Approved by: https://github.com/zou3519	2025-08-13 04:53:03 +00:00
Nikita Shulga	3008d985a8	[CD] Do not build pytorch with nvshem on ARM (#160465 ) As nvshmem binary from 3.3.9 is not compatible with manylinux2_28, and 3.3.20 is not available for download yet Also, package nvshmem binary into full wheel Fixes https://github.com/pytorch/pytorch/issues/160425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160465 Approved by: https://github.com/atalman, https://github.com/huydhn	2025-08-13 04:10:43 +00:00
PyTorch MergeBot	652a6f5954	Revert "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#160403 )" This reverts commit 5a9c4cfce42b9eb87da0de40c5633f083115c307. Reverted https://github.com/pytorch/pytorch/pull/160403 on behalf of https://github.com/malfet due to It indeed consistently broken inductor, see `118bc97b14/1` ([comment](https://github.com/pytorch/pytorch/pull/160403#issuecomment-3182101130))	2025-08-13 04:05:46 +00:00
Ankita George	118bc97b14	Write full tensors out at once in HF consolidation script (#159394 ) Not all storage systems support writing at random offsets. This PR changes the writes of the consolidation script to write each tensor to a buffer, and then write out the buffer, sequentially going through every tensor in the output file. This will also help in the case where the sharded files weren't just sharded in the row-wise dimension. The reason is because small writes are expensive and we were writing each write for every chunk that was the largest number of contiguous bytes in the final tensor, but this could be a small amount of bytes for col-wise sharding. Now the full tensor is needed for the write, making the number of small writes smaller. Differential Revision: [D78684452](https://our.internmc.facebook.com/intern/diff/D78684452/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159394 Approved by: https://github.com/saumishr ghstack dependencies: #159392, #159393	2025-08-13 03:51:16 +00:00
Nikita Shulga	305fa22393	[GHF] Remove `app { name databaseId}` query (#160494 ) From `PRCheckSuites` fragment, as it's causes security exception when used with new GITHUB_TOKEN, that will looks as follows ``` RuntimeError: GraphQL query fragment PRReviews on PullRequestReviewConnection { nodes { author { login } bodyText createdAt authorAssociation editor { login } databaseId url state } pageInfo { startCursor hasPreviousPage } } fragment PRCheckSuites on CheckSuiteConnection { edges { node { app { name databaseId } workflowRun { workflow { name databaseId } databaseId url } checkRuns(first: 50) { nodes { name conclusion detailsUrl databaseId title summary } pageInfo { endCursor hasNextPage } } conclusion } cursor } pageInfo { hasNextPage } } fragment CommitAuthors on PullRequestCommitConnection { nodes { commit { authors(first: 2) { nodes { user { login } email name } } oid } } pageInfo { endCursor hasNextPage } } query ($owner: String!, $name: String!, $number: Int!) { repository(owner: $owner, name: $name) { pullRequest(number: $number) { closed isCrossRepository author { login } title body headRefName headRepository { nameWithOwner } baseRefName baseRefOid baseRepository { nameWithOwner isPrivate defaultBranchRef { name } } mergeCommit { oid } commits_with_authors: commits(first: 100) { ...CommitAuthors totalCount } commits(last: 1) { nodes { commit { checkSuites(first: 10) { ...PRCheckSuites } status { contexts { context state targetUrl } } oid } } } changedFiles files(first: 100) { nodes { path } pageInfo { endCursor hasNextPage } } reviews(last: 100) { ...PRReviews } comments(last: 5) { nodes { bodyText createdAt author { login } authorAssociation editor { login } databaseId url } pageInfo { startCursor hasPreviousPage } } labels(first: 100) { edges { node { name } } } } } } , args {'name': 'pytorch', 'owner': 'pytorch', 'number': 159820} failed: [{'type': 'FORBIDDEN', 'path': ['repository', 'pullRequest', 'commits', 'nodes', 0, 'commit', 'checkSuites', 'edges', 4, 'node', 'app'], 'extensions': {'saml_failure': False}, 'locations': [{'line': 26, 'column': 7}], 'message': 'Resource not accessible by integration'}] ``` But the same query works fine if executed using one's Personal Access Token Updated mocks file by running ``` sed -i -e s/a32a7ca3a2f6e2c9de07aef821b0111539758b4ac254f8a3432af32314f94876/8e262b0495bd934d39dda198d4c09144311c5ddd6cca6a227194bd48dbfe7201/ gql_mocks.json sed -i -e s/157add81c519f614388f3a67e287bdf4fbb1791e6d0bffe312e169d02ac2813f/28349cb4c891bbf85255fab2c33c770baf77c3e02b29ca9a0e4c6c97bed041db/ gql_mocks.json sed '/"app": {/,+3d' gql_mocks-orig.json >gql_mocks.json sed '/"app": null/d' gql_mocks-orig.json >gql_mocks.json ``` Undisable offending jobs Fixes https://github.com/pytorch/pytorch/issues/159894 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160494 Approved by: https://github.com/huydhn ghstack dependencies: #160490, #160492	2025-08-13 03:46:39 +00:00
Nikita Shulga	1151b40cbf	[BE] Filter unused mocks (#160492 ) Somebody checked in twice the number of mocks into the archive Filter them out by running following script ```python import json with open("gql_mocks-orig.json") as f: mocks = json.load(f) keys = list(mocks.keys()) good_shas = {'a32a7ca3a2f6e2c9de07aef821b0111539758b4ac254f8a3432af32314f94876', '157add81c519f614388f3a67e287bdf4fbb1791e6d0bffe312e169d02ac2813f', '4715ed05b382e572135c049664939f22f9b1249bc0c499ae278d655ad8cb598b', 'a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5', 'e5130469b5373479776bfbccade8039ce4741b97873bb3bec4e279fed08602be', '5dc32efeb8306f03744f6804ef4b500882f2759f7ac17fdc9f123669bfe4805a', '0a34acb829d8aca9dd28a8ba388dfa52f6ecdde7e903ace1caabdcfaba87de98', '8b50878b010492fe64005cc4b4ed34ac5f6695ce093f06b0d8d5403b7787c2c0', '2877b3b1e8630ca4ae797b9d85d5673d25ca8488c01141e11ff55f4a1359fca7'} for k in keys: if any(sha in k for sha in good_shas): continue del mocks[k] with open("gql_mocks.json","w") as f: json.dump(mocks, f, indent=2) f.write("\n") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160492 Approved by: https://github.com/huydhn ghstack dependencies: #160490	2025-08-13 03:46:39 +00:00
Nikita Shulga	d0f9785af3	[CI] Prevent accidental gql_mocks updates by test_trymerge (#160490 ) As they could not longer be fetched from GitHub, see https://github.com/pytorch/pytorch/issues/160489 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160490 Approved by: https://github.com/huydhn	2025-08-13 03:46:32 +00:00
Jerry Mannil	ba47821f52	[ROCm] Set thread_work_size to 16 for vectorized elementwise kernels for MI300X (#160444 ) * thread_work_size of 16 is giving better perf with many workloads for MI300X cherry-pick of `fb81400d34` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160444 Approved by: https://github.com/jeffdaily	2025-08-13 03:41:25 +00:00
Ankita George	2c5e10a5fc	Add new function consolidate_safetensors_files_on_every_rank for HF consolidation (#159393 ) Currently we are only using rank-0 for HF consolidation. But we should be able to use every rank to consolidate the sharded files, which will speed up the consolidation by Nx (where N is the number of ranks). Adding a new method consolidate_safetensors_files_on_every_rank to do this. Differential Revision: [D79000720](https://our.internmc.facebook.com/intern/diff/D79000720/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159393 Approved by: https://github.com/saumishr ghstack dependencies: #159392	2025-08-13 03:31:36 +00:00
Jane Xu	355462e127	Add stable Tensor get_device_index, use more stable DeviceIndex (#160143 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160143 Approved by: https://github.com/mikaylagawarecki	2025-08-13 03:27:10 +00:00
Xu Han	41673110cd	[inductor] Windows inductor use intel-openmp. (#160258 ) After some debug work, I found PyTorch torch_cpu.dll is using intel-openmp, but not MSVC openmp. So, switch Windows inductor to intel-openmp. It fixed: `c8205cb354/test/inductor/test_aot_inductor.py (L2405-L2408)` <img width="896" height="230" alt="image" src="https://github.com/user-attachments/assets/273b00f8-7dc1-43c9-9b7f-752e16355a80" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160258 Approved by: https://github.com/ezyang	2025-08-13 02:36:19 +00:00
Yu, Guangye	6be6d06295	Avoid potential deadlocks in host allocator (#159352 ) # Motivation This PR fixes a potential deadlock in the host allocator. When calling `event->record(stream)`, the `record_stream` implementation may acquire the Python GIL. In places such as `842cc77ab9/aten/src/ATen/cuda/CachingHostAllocator.cpp (L145-L151)`, and `842cc77ab9/aten/src/ATen/xpu/CachingHostAllocator.cpp (L22-L28)` `record_stream` is invoked while holding the allocator lock. To prevent deadlocks, we must ensure the locking order is: GIL → Allocator Lock. Reversing the order (Allocator Lock → GIL) can cause a deadlock. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159352 Approved by: https://github.com/cyyever, https://github.com/ezyang	2025-08-13 02:30:17 +00:00
nandesuka	f15ada5c6f	Enable output padding when only outermost dim is dynamic (#159404 ) Summary: When the shape of the output tensor has a dynamic outer most dim, the stride can still be padded to conform to configured alignment if required. Test Plan: CI Rollback Plan: Differential Revision: D79146886 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159404 Approved by: https://github.com/blaine-rister, https://github.com/eellison	2025-08-13 01:28:22 +00:00
Nikhil Patel	69a0a9aa7f	[Inductor][Triton] Pass GPUTarget param to updated make_ir function (#160422 ) Summary: A recent Triton commit changed `ASTSource.make_ir` to a 5-arg signature that includes a `GPUTarget`. We need to pass in this new argument. Test Plan: `buck2 test 'fbcode//mode/opt' -m ovr_config//triton:trunk fbcode//caffe2/test/inductor:test_inductor_cuda -- triton_kernel` Rollback Plan: Reviewed By: davidberard98 Differential Revision: D80069909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160422 Approved by: https://github.com/davidberard98, https://github.com/mlazos	2025-08-13 01:27:57 +00:00
Nikita Shulga	32099961d5	[EZ] Delete CircleCI case (#160479 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160479 Approved by: https://github.com/izaitsevfb ghstack dependencies: #160477	2025-08-13 01:19:09 +00:00
Nikita Shulga	8d1cf52922	[EZ][BE] Remove unused `conda-env-macOS-ARM64` (#160477 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160477 Approved by: https://github.com/atalman	2025-08-12 23:41:25 +00:00
fduwjj	b1f43548ca	[c10d] Error out the case when registering symmetric memory without eager init (#160145 ) Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160145 Approved by: https://github.com/kwen2501	2025-08-12 23:25:04 +00:00
Zain Rizvi	0d71ca2c46	[EZ] Replace `pytorch-labs` with `meta-pytorch` (#160459 ) This PR replaces all instances of 'pytorch-labs' with 'meta-pytorch' in this repository now that the 'pytorch-labs' org has been renamed to 'meta-pytorch' ## Changes Made - Replaced all occurrences of 'pytorch-labs' with 'meta-pytorch' - Only modified files with extensions: .py, .md, .sh, .rst, .cpp, .h, .txt, .yml - Skipped binary files and files larger than 1MB due to GitHub api payload limits in the script to cover all repos in this org. Will do a more manual second pass later to cover any larger files ## Files Modified This PR updates files that contained the target text. Generated by automated script on 2025-08-12T20:41:29.888681+00:00Z Pull Request resolved: https://github.com/pytorch/pytorch/pull/160459 Approved by: https://github.com/huydhn, https://github.com/clee2000, https://github.com/atalman, https://github.com/malfet	2025-08-12 22:44:25 +00:00
deedongala	5737372862	[CI] Switch ROCm MI300 GitHub Actions workflows from 2-GPU to 1-GPU runners (#158882 ) Updated .github/actionlint.yaml to replace linux.rocm.gpu.mi300.2 with linux.rocm.gpu.mi300.1 in the supported runner list Modified all affected workflows (inductor-perf-test-nightly-rocm.yml, inductor-periodic.yml, inductor-rocm-mi300.yml, and rocm-mi300.yml) to run jobs on 1-GPU MI300 runners instead of 2-GPU runners This should help increase available runners even with same number of CI nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158882 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-12 22:42:40 +00:00
Isalia20	2e4e5ab4be	[MPS] Add mps keys to `indices` and `values` ops (#160223 ) enable indices and values on sparse mps Pull Request resolved: https://github.com/pytorch/pytorch/pull/160223 Approved by: https://github.com/malfet	2025-08-12 22:08:44 +00:00
Zhengxu Chen	16d15445f8	Fullgraph graph capture with dynamo. (#159749 ) Summary: Following up on Avik's doc https://docs.google.com/document/d/11RW0Bbkp1QwFbEu8rCNW5d7wUFaEkxbL0uLyqcc2jTk/edit?tab=t.0 We are experimenting with a new API which utilizes torch.compile(fullgraph=True) and intend to use it to replace the old dynamo.export() API. This PR adds a prototype for the API described in the doc. Test Plan: test_misc -- -k test_aot_capture Rollback Plan: Differential Revision: D79534608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159749 Approved by: https://github.com/tugsbayasgalan	2025-08-12 22:06:18 +00:00
henrylhtsang	101276f81b	[BE] Save attributes for CppCompileError for pickleing (#160294 ) Differential Revision: [D79977408](https://our.internmc.facebook.com/intern/diff/D79977408/) Context: When testing cutlass backend and used autotune with subproc, sometimes I would see C++ compilation error (expected) followed by ``` Traceback (most recent call last): File "/torch/_inductor/autotune_process.py", line 175, in get result = TuningProcess.recv(self.read_pipe) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/torch/_inductor/autotune_process.py", line 99, in recv return pickle.load(read_pipe) ^^^^^^^^^^^^^^^^^^^^^^ TypeError: CppCompileError.__init__() missing 1 required positional argument: 'output' ``` which is unexpected. After asking claude, it seems > Now I can see the issue. The `CppCompileError` class requires two arguments: `cmd` (a list of strings) and `output` (a string). However, when exceptions are being pickled and unpickled across process boundaries, the pickling process might not be preserving the constructor arguments correctly. > > The problem is likely that when a `CppCompileError` is raised in the subprocess and then pickled/unpickled through the `recv` function, the unpickling process is trying to reconstruct the exception but doesn't have the required constructor arguments. > > The issue is clear now. The `CppCompileError` class doesn't have custom pickle methods (`__reduce__`, `__getstate__`, `__setstate__`), so when it's pickled and unpickled across process boundaries, Python's default pickling mechanism tries to reconstruct it but fails because it doesn't preserve the constructor arguments properly. > > The solution is to add a `__reduce__` method to the `CppCompileError` class to ensure it can be properly pickled and unpickled. Let me implement this fix: Adding these seem to help. fbcode repro: [D79977541](https://www.internalfb.com/diff/D79977541) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160294 Approved by: https://github.com/masnesral	2025-08-12 22:03:36 +00:00
drisspg	cbffde7745	Factor out the strings to templates for better editor integration (#160357 ) # Summary More code motion, tldr is that install 'Better Jinja' in vscode and now you can get highlighting Before <img width="776" height="926" alt="Screenshot 2025-08-11 at 2 41 08 PM" src="https://github.com/user-attachments/assets/10868b31-f8ac-4cf5-99fe-19b8789ce06b" /> After: <img width="1184" height="1299" alt="Screenshot 2025-08-11 at 2 40 27 PM" src="https://github.com/user-attachments/assets/45203765-589e-4d76-8196-d895a2f2fbf6" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160357 Approved by: https://github.com/eellison	2025-08-12 21:59:54 +00:00
David Berard	78a2fe1d42	[TorchScript] thread-safe ErrorReport::CallStack (#160386 ) Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings. The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called. When this happens, it causes a segfault. This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults. Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults. Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160386 Approved by: https://github.com/eellison	2025-08-12 21:59:04 +00:00
Ivan Zaitsev	f8f0414a59	fix cpp builder to avoid missing-source compile error (#160354 ) Summary: the condition ``` if config.is_fbcode() and (not self._aot_mode or self._use_relative_path): sources = [os.path.basename(i) for i in sources] ``` unintentionally (?) stripped paths even when use_relative_path was False (as long as aot_mode was False), breaking local tests that rely on absolute temp-file paths. Fixes internal issue: ``` FAILED (errors=1) CppCompileError: C++ compile error Command: /mnt/gvfs/third-party2/llvm-fb/0f1f083aa5508772f3db24bf4f697bc118ba0958/17/platform010/72a2ff8/bin/clang-17 czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp -shared -fPIC -O3 -DNDEBUG -fno-trapping-math -funsafe-math-optimizations -ffinite-math-only -fno-signed-zeros -fno-math-errno -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -Werror=ignored-optimization-argument -g -o /re_tmp/tmpsp58ya2h/zy/test_symbol.so Output: clang-17: error: no such file or directory: 'czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp' clang-17: error: no input files ``` Reviewed By: clee2000 Differential Revision: D80025417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160354 Approved by: https://github.com/benjaminglass1, https://github.com/clee2000	2025-08-12 21:36:22 +00:00
Mikayla Gawarecki	4d419a7461	Add pad and narrow to torch/csrc/stable/ops.h (#159328 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159328 Approved by: https://github.com/janeyx99 ghstack dependencies: #159507	2025-08-12 21:29:49 +00:00
Mikayla Gawarecki	655137b678	Update torch::stable::Tensor() default constructor (#159507 ) Allows things like ```cpp Tensor cu_seqlens_q; if (...) { cu_seqlens_q = ... } ... ``` Also adds `torch::stable::Tensor.defined()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159507 Approved by: https://github.com/janeyx99	2025-08-12 21:29:49 +00:00
Gheorghe-Teodor Bercea	f27232a213	[ROCm] Limit number of values per thread for reductions on three dimensions (#159652 ) In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159652 Approved by: https://github.com/jeffdaily	2025-08-12 21:15:56 +00:00
Anshul Sinha	c24ca7f4bf	[FSDP][Collectives] skipping allgather when world size is 1 (#160135 ) Summary: In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_params group to skip the foreach_all_gather and foreach_all_gather_copy_out APIs when world_size ‎ = 1. I have created a test that uses CommDebugMode to verify that the all gather comm has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. Below, I have included the link to the profile trace verifying these two APIs were skipped and two test commands. https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/anshulsi_f846ac3b-9467-4060-8e36-8cc3bc4449c3_devgpu263.prn2.facebook.com_652183.1753822140871934814.pt.trace.json Pull Request resolved: https://github.com/pytorch/pytorch/pull/160135 Approved by: https://github.com/weifengpy	2025-08-12 21:13:29 +00:00
AaronWang04	b4596895b9	[DTensor] Registers sharding rule for rms_norm (#159692 ) Reduces collective calls in the forward pass from 2 to 1 In #158716 I added the sharding rule for the backward pass but didn't add the forward pass as it didn't get dispatched. After #159324 this should get properly dispatched hence I am adding it now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159692 Approved by: https://github.com/tianyu-l	2025-08-12 21:05:24 +00:00
xinan.lin	5a9c4cfce4	[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#160403 ) Fixes #160243, Fixes #160244, Fixes #160245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160403 Approved by: https://github.com/janeyx99	2025-08-12 21:02:44 +00:00
Chien-Lin Chen	a354fa91e2	added class or module info for functions blocked by weight-only load (#159935 ) Fixes #152985 In #152985, users are confused why weights-only load failed even though functions were registered in safe_globals. Because the error message doesn't make the critical failure reason clear, they couldn't figure out only some functions are missing from safe_globals registration. This fix is to make that point more clear. Here's the new errror message, the blocked function information will be following the warning message with a line breaker to make it stand out. ``` _pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Trying to call reduce for unrecognized function <built-in method _unpickle of type object at 0x641e8a57d1f0> which belongs to <class 'zoneinfo.ZoneInfo'> Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. To execute this test, run the following from the base repo dir: python test/test_serialization.py TestSerialization.test_weights_only_with_safe_zoneinfo_unpickle_registration_success This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159935 Approved by: https://github.com/mikaylagawarecki	2025-08-12 20:52:25 +00:00
Ankita George	f95b58c284	Remove usage of fsspec in HF consolidation script (#159392 ) Moving towards just supporting local storage to take advantage of HF apis such as safe_open. This was already done in Storage component in https://github.com/pytorch/pytorch/pull/159405. This PR removes fsspec usages in consolidation script and relies on local storage only Differential Revision: [D78997975](https://our.internmc.facebook.com/intern/diff/D78997975/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159392 Approved by: https://github.com/sibuachu	2025-08-12 20:41:06 +00:00
albanD	8e6a313858	Add ownership token when needed on GradientEdge (#160098 ) We can avoid the token by introducing PyObject preservation for THPFunction. But I think it will be too much complexity given that this kind of issue is very rare. Happy to be talked into doing it though if someone really wants to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160098 Approved by: https://github.com/ezyang, https://github.com/soulitzer	2025-08-12 20:14:18 +00:00
Paul de Supinski	7e91394955	Support NUMA Binding for Callable Entrypoints (#160163 ) # Context This is an extension of #149334. # This PR Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`. Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and #160006 for discussion of alternatives and why this is necessary. Other changes: * Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).) * Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints. # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran ``` $ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 \| tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 \| tee none_callable.txt ``` and observed * 6.6% remote memory accesses with 'node' bindings * 11.6% remote without bindings I also ran similar with `str` entrypoints as before just to be sure it's still working. NOTE: [--run-path triggers the code to be run inside a `Callable`.](`017259f9c6/torch/distributed/run.py (L870)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160163 Approved by: https://github.com/d4l3k	2025-08-12 20:08:49 +00:00
Markus Hoehnerbach	89654db1ab	[inductor] fix triton bucketize mask propagation (#159961 ) See `6b414f56a4` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159961 Approved by: https://github.com/eellison	2025-08-12 19:59:32 +00:00
Natalia Gimelshein	2d0cdee394	move thread-local capture mode guard to include work.isStarted (#160398 ) Per title, should fix capture errors that happen because nccl watchdog races with capture start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160398 Approved by: https://github.com/aorenste	2025-08-12 19:25:04 +00:00
eqy	9903ca4f70	[cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140 ) The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN https://github.com/pytorch/pytorch/issues/155225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140 Approved by: https://github.com/ngimel, https://github.com/atalman	2025-08-12 18:07:41 +00:00
PyTorch MergeBot	f341077ce4	Revert "[ROCm] Support large inputs for coalesceValuesKernel (#158281 )" This reverts commit a7abf57aabec0ce686092e2d66e53ba185dbc56b. Reverted https://github.com/pytorch/pytorch/pull/158281 on behalf of https://github.com/clee2000 due to broke windows cuda build? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16915172288/job/47927141460) [HUD commit link](`a7abf57aab`). Not caught b/c PR didn't have ciflow/trunk ([comment](https://github.com/pytorch/pytorch/pull/158281#issuecomment-3180408766))	2025-08-12 17:57:57 +00:00
Edward Z. Yang	3cec82a7e9	Ensure outer aliasing on DTensor matches inner aliasing (#158954 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158954 Approved by: https://github.com/albanD, https://github.com/wconstab	2025-08-12 17:47:48 +00:00
Jerry Mannil	ee9f8ba11d	[ROCm] Use opportunistic fastatomics based on hueristics (#159430 ) * Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address Co-author: @amd-hhashemi Reproducer: ``` import time import torch x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float) ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda') src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float) for _ in range(20): x.index_add_(0, ind, src) start_time = time.time() for i in range(100): x.index_add_(0, ind, src) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/100 print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us") ``` Perf numbers: ``` Before: Avg time for index_add_: 25652.16 us After: Avg time for index_add_: 2675.15 us ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159430 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-08-12 17:13:54 +00:00
David Berard	1f4057c11a	[inductor] remove no_x_dim (#159810 ) no_x_dim is used to indicate that a reduction operates on a single row, and data loaded for the reduction is 1-dimensional. no_x_dim was introduced in https://github.com/pytorch/pytorch/pull/102444 - in which there was bad perf in some reductions, and using 1D tensors fixed the perf issue. However, it appears that this perf issue no longer exists in current Triton versions. https://github.com/pytorch/pytorch/pull/118822 checked this, and we can also check this on H100 benchmarks (linked below). And another motivation for removing this behavior is that it enables larger loads, which we observe is necessary for good performance on certain shapes on Blackwell. H100 inference benchmarks: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a H100 training benchmarks: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a Overall, the benchmarks show minimal change in performance. Differential Revision: [D79599286](https://our.internmc.facebook.com/intern/diff/D79599286) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159810 Approved by: https://github.com/ngimel, https://github.com/eellison	2025-08-12 17:10:31 +00:00
Jovian Anthony Jaison	94b91a8763	[redone][pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#160352 ) Summary: Writing torch.compile worked logs to dedicated_log_rank{RANK} if we're running on mast. ref: D79456310 (got reverted because of linter) Testing: Refer differential Revision: D79917440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160352 Approved by: https://github.com/masnesral	2025-08-12 16:49:08 +00:00
Xinya Zhang	a7abf57aab	[ROCm] Support large inputs for coalesceValuesKernel (#158281 ) # Description `.coalesce` cannot handle large inputs on ROCM due to maximal grid size limit. This PR splits axis `X` into axes `X` and `Y`, and repurposes `Z` for original `Y` on ROCm to avoid such limitation. Confirmed the new approach can handle large inputs. Correctness needs validation. # Testing Command `python torch_spmv.py 22500000 272500000` ## Script `torch_spmv.py` ``` python import torch import argparse def parse_args(): parser = argparse.ArgumentParser( description="Sparse COO Matrix by Dense Vector Multiplication using PyTorch" ) parser.add_argument("n", type=int, help="Size of the NxN matrix") parser.add_argument("nnz", type=int, help="Number of non-zero entries") return parser.parse_args() def main(): args = parse_args() n = args.n nnz = args.nnz dtype = torch.float32 device = torch.device('cuda') # Generate random indices for the sparse matrix in COO format. torch.manual_seed(42) rows = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device) cols = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device) indices = torch.stack([rows, cols], dim=0) # Generate random values. values = torch.randn(nnz, dtype=torch.float32, device=device) # Create the sparse COO matrix and move it to the target device. sparse_matrix = torch.sparse_coo_tensor(indices, values, size=(n, n), dtype=torch.float32, device=device) sparse_matrix = sparse_matrix.coalesce() # Generate a random dense vector. dense_vector = torch.randn(n, dtype=torch.float32, device=device) # Perform sparse matrix - dense vector multiplication. # Using torch.sparse.mm which expects a 2D tensor for the vector. result = torch.sparse.mm(sparse_matrix, dense_vector.unsqueeze(1)).squeeze() # result = torch.mv(sparse_matrix, dense_vector) # Print the result. print("Result of the multiplication:") print(torch.sum(result)) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158281 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-08-12 16:42:55 +00:00
PyTorch MergeBot	f7b2f3314c	Revert "[triton_heuristics] Optimize the triton launcher in pt2 (#160000 )" This reverts commit d0e2240f680ea2a553f7ee8188f52482e130bfd0. Reverted https://github.com/pytorch/pytorch/pull/160000 on behalf of https://github.com/davidberard98 due to D80054972 failing with test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1_tdlp_1 ([comment](https://github.com/pytorch/pytorch/pull/160000#issuecomment-3180144676))	2025-08-12 16:33:02 +00:00
Jeff Daily	9d37c960a4	[ROCm][CI] use new benchmark image for dynamo (#160421 ) Follow-up to #160047 that separated the rocm image into default CI and benchmarks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160421 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-12 16:07:19 +00:00
PyTorch MergeBot	b219ca2a00	Revert "Update triton xpu commit to support python 3.14 (#160183 )" This reverts commit 7fbc22855c17741ae016992803b2e147a13aa22d. Reverted https://github.com/pytorch/pytorch/pull/160183 on behalf of https://github.com/clee2000 due to I'm not sure how, but it seems to have broken inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration [GH job link](https://github.com/pytorch/pytorch/actions/runs/16911267995/job/47917091939) [HUD commit link](`7fbc22855c`). Maybe because the docker build changed? Note to self: not bad TD ([comment](https://github.com/pytorch/pytorch/pull/160183#issuecomment-3179840160))	2025-08-12 15:29:19 +00:00
atalman	b7db86600a	Fix Tensor illustration, use permalinks for image embedding in Readme.md (#160416 ) Fixes Tensor illustration being broken on pypi.org. Also uses permalinks instead of links to images for embedding as per this suggestion of Alban: https://github.com/pytorch/pytorch/pull/160187#discussion_r2262978006 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160416 Approved by: https://github.com/malfet	2025-08-12 15:15:12 +00:00
James Wu	9708fcf92d	Account for triton kernel source code hidden in custom ops properly in AOTAutogradCache (#160120 ) This PR fixes a bug where user defined triton kernels hidden behind `triton_op` do not register source code changes. If a user only changes a triton kernel source_code, because triton kernels are hidden under the custom op, dynamo hasn't traced into them yet. This means at AOTAutograd time, we don't know the list of triton kernels that are defined by custom ops. This is an initial fix for the issue by parsing the AST of the custom op looking for triton kernels. This won't catch more degenerate cases if the custom op calls other custom ops/functions that then call triton kernels, and then the toplevel compiled graph doesn't know about it. To handle that, we'd have to trace through the custom op at dynamo time. This should handle 99% of cases, though. I added an expectedFailure test to show the limitation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160120 Approved by: https://github.com/zou3519	2025-08-12 14:11:06 +00:00
Wang, Chuanqi	a288b15ea9	[CI] Reduce XPU Windows build time (#159763 ) Reduce the time cost from 2.5 hours to about 1.5 hours. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159763 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-08-12 14:04:29 +00:00
Wang, Chuanqi	7fbc22855c	Update triton xpu commit to support python 3.14 (#160183 ) Follow PR #159725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160183 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-08-12 14:02:36 +00:00
IvanKobzarev	f33ce40bc0	[bucketing] Bucket only adjacent collectives to prevent reordering (#159983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159983 Approved by: https://github.com/wconstab, https://github.com/eellison	2025-08-12 11:57:00 +00:00
Animesh Jain	4d5b3f2d5a	[dynamo][guards] Install dict watchers for recrusive dict tag optimization (#159796 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159796 Approved by: https://github.com/jansel	2025-08-12 09:49:11 +00:00
zeshengzong	f990490a23	Add `label_smoothing` param in `nn.BCELoss` and `nn.BCEWithLogitsLoss` (#150282 ) Fixes #91545 ## Changes - Add `label_smoothing` param and docs - Add test case for `label_smoothing` - Remove duplicate description in `nn.BCELoss` and `nn.BCEWithLogitsLoss` ## Test Result ```bash pytest -s test/test_nn.py -k test_bce ``` ![image](https://github.com/user-attachments/assets/30c0b7fe-fe49-4aa0-9b05-4d70403a7b05) ![image](https://github.com/user-attachments/assets/4fe3fd1c-54b8-4012-afd9-133ce9fb4964) ![image](https://github.com/user-attachments/assets/5cad019a-3a4c-475a-9fde-9c1acad5792d) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150282 Approved by: https://github.com/cyyever, https://github.com/mikaylagawarecki	2025-08-12 09:37:03 +00:00
morrison-turnansky	b9003ed3d8	Dynamo Deep Dive Documentation Fix (#158860 ) changed SourceBuilder to VariableBuilder Fixes #158447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158860 Approved by: https://github.com/mlazos	2025-08-12 08:53:33 +00:00
Laith Sakka	fea7e9dd37	extract shape in _view_has_unbacked_input (#160255 ) Summary: We were getting DDE on reshape still!! i looked deeper and found an issue in _view_has_unbacked_input namely when input is [[,,]] it need to be normalized to [..] Test Plan: existing tests. Rollback Plan: Differential Revision: D79951119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160255 Approved by: https://github.com/bobrenjc93	2025-08-12 08:38:19 +00:00
Jovian Anthony Jaison	9a0f7a3bb0	[retry-land][pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#160348 ) refer: https://github.com/pytorch/pytorch/pull/159655 Earlier pr failed on dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed. Updated test_dynamo_timed + re-ran locally to test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160348 Approved by: https://github.com/masnesral	2025-08-12 06:24:54 +00:00
Animesh Jain	01bcf9a40d	Bump transformers pin (#159291 ) Trying to update hf pin. Benchmarking run to figure out issues <img width="1356" height="123" alt="image" src="https://github.com/user-attachments/assets/fbc435f3-a7cb-4280-9636-2ea6d15d7b6d" /> Retrying - https://github.com/pytorch/pytorch/pull/156118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159291 Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-08-12 05:14:17 +00:00
Animesh Jain	8d3d1c8443	[dynamo] fixes to propagate tag safeness (#159807 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159807 Approved by: https://github.com/jansel	2025-08-12 04:50:13 +00:00
PyTorch UpdateBot	0f3b10b8ee	[audio hash update] update the pinned audio hash (#160384 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160384 Approved by: https://github.com/pytorchbot	2025-08-12 04:38:04 +00:00
Boyuan Feng	5f1010fbb3	[Graph Partition] Pass all OSS unit tests (#154667 ) Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315). Run the same diff on two days and both show speedup on average. [first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d) <img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" /> [second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf) <img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667 Approved by: https://github.com/eellison	2025-08-12 04:37:58 +00:00
Nikita Shulga	edaa151d0d	[CI] Move CUDA tests to trunk workflow (#160379 ) Which is getting run before PR is merged anyway, but according to 3X less frequently than pull workflow according to [Flambeau](https://pytorchci.grafana.net/public-dashboards/1c571e79090443eaaa9811db71f8d23b) <img width="796" height="573" alt="image" src="https://github.com/user-attachments/assets/0235e610-4e1c-4be5-88bf-ea8278d1c656" /> I.e. that will probably results in some longer time to signal, but considering that frequency of changes to eager PyTorch-on-CUDA slowed down and Inductor changes are decorated with ciflow/inductor, this looks like an acceptable tradeoff to reduce costs Pull Request resolved: https://github.com/pytorch/pytorch/pull/160379 Approved by: https://github.com/izaitsevfb	2025-08-12 04:23:50 +00:00
rzou	10bc36fe84	Get tensor subclasses and torch.library.triton_op to dispatch correctly (#160341 ) Short-term fix for https://github.com/pytorch/pytorch/issues/160333 The problem is: 1) `triton_op` adds a decomposition for FunctionalTensorMode for this operation 2) Tensor Subclasses rely on FunctionalTensorMode's `__torch_dispatch__` returning NotImplemented. 3) `triton_op`'s FunctionalTensorMode decomposition takes precedence over FunctionalTensorMode's decomposition. The easy fix is to copy-paste the FunctionalTensorMode's NotImplemented return logic into the decomposition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160341 Approved by: https://github.com/drisspg	2025-08-12 04:09:37 +00:00
PyTorch UpdateBot	32e5e2f596	[vllm hash update] update the pinned vllm hash (#160259 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160259 Approved by: https://github.com/pytorchbot	2025-08-12 04:04:53 +00:00
Scott Todd	bfc873d02e	[ROCm][Windows] Revert copying hipblaslt and rocblas dirs. (#159083 ) This reverts the changes from `b367e5f6a6`. This will also close https://github.com/pytorch/pytorch/pull/158922. Since `30387ab2e4`, ROCm is bootstrapped using the 'rocm' Python module which contains these files (see https://github.com/ROCm/TheRock/blob/main/docs/packaging/python_packaging.md), so they do not need to be bundled into torch/lib. There was also a bug in here - if `ROCM_DIR` is unset, the code crashes: ``` File "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\setuptools\_distutils\dist.py", line 1002, in run_command cmd_obj.run() File "D:\b\pytorch_main\setup.py", line 853, in run rocm_dir_path = Path(os.environ["ROCM_DIR"]) ~~~~~~~~~~^^^^^^^^^^^^ File "<frozen os>", line 714, in __getitem__ KeyError: 'ROCM_DIR' ``` The code could have checked for `ROCM_PATH` too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159083 Approved by: https://github.com/jeffdaily	2025-08-12 02:45:49 +00:00
Scott Todd	eed9dbf70f	[ROCm] Add torch/_rocm_init.py to .gitignore. (#159806 ) Follow-up to https://github.com/pytorch/pytorch/pull/155285. Build scripts like https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py generate this file with contents like: ```python def initialize(): import rocm_sdk rocm_sdk.initialize_process( preload_shortnames=['amd_comgr', 'amdhip64', 'hiprtc', 'hipblas', 'hipfft', 'hiprand', 'hipsparse', 'hipsolver', 'hipblaslt', 'miopen'], check_version='7.0.0rc20250804') ``` We may also have https://github.com/pytorch/pytorch/blob/main/tools/amd_build/build_amd.py do the same thing as more of that build support moves here into the upstream PyTorch repository itself (see https://github.com/pytorch/pytorch/issues/159520). This file is then loaded if present here: `a7f3bdf550/torch/__init__.py (L145-L157)` Given that the file is generated by build scripts, I think adding it to `.gitignore` makes sense, as that will prevent accidental check-ins and keep local history cleaner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159806 Approved by: https://github.com/jeffdaily	2025-08-12 02:24:21 +00:00
Natalia Gimelshein	be53f609aa	fix retaining multimem in symmetric memory (#160343 ) fixes OOM in #160289 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160343 Approved by: https://github.com/eqy	2025-08-12 02:03:20 +00:00
Zain Rizvi	95210cc409	[BE] Isolate pre-push hook dependencies in dedicated virtual environment (#160048 ) This adds two changes: - Isolates pre-push hook dependencies into an isolated venv, no longer affect your system environment - Lets you manually run the pre-push lintrunner (including with lintrunner -a) by invoking `python scripts/lintrunner.py [-a]` (it's ugly, but better than nothing...for now) This is a follow up to: - https://github.com/pytorch/pytorch/pull/158389 ## Problem The current pre-push hook setup installs lintrunner and related dependencies globally, which makes developers nervous about system pollution and can cause version conflicts with existing installations. Also, if the pre-push lintrunner found errors, you had to hope your normal lintrunner could fix them (which wasn't always the case, e.g. if those errors only manifested in certain python versions) ## Key Changes: - Isolated Environment: Creates .git/hooks/linter/.venv/ with Python 3.9 (the python used in CI) and an isolated lintrunner installation - User-Friendly CLI: New python scripts/lintrunner.py wrapper allows developers to run lintrunner (including -a auto-fix) from any environment - Simplified Architecture: Eliminates pre-commit dependency entirely - uses direct git hooks File Changes: - scripts/setup_hooks.py: Rewritten to create isolated uv-managed virtual environment - scripts/lintrunner.py: New wrapper script with shared hash management logic - scripts/run_lintrunner.py: Removed (functionality merged into lintrunner.py) - .pre-commit-config.yaml: Removed (no longer needed) ## Usage: ``` # Setup (run once) python scripts/setup_hooks.py # Manual linting (works from any environment) python scripts/lintrunner.py # Check mode python scripts/lintrunner.py -a # Auto-fix mode # Git hooks work automatically git push # Runs lintrunner in isolated environment # Need to skip the pre-push hook? git push --no-verify ``` ## Benefits: - ✅ Zero global dependency installation - ✅ Per-repository isolation prevents version conflicts - ✅ Full lintrunner functionality is now accessible ## Implementation Notes: - Virtual env is kept in a dedicated dir in .git, to keep per-repo mechanics - lintrunner.py does not need to be invoked from a specific venv. It'll invoke the right venv itself. A minor bug: It tends to garble the lintrunner output a bit, like the screenshot below shows, but I haven't found a workaround so far and it remains understandable to users: <img width="241" height="154" alt="image" src="https://github.com/user-attachments/assets/9496f925-8524-4434-8486-dc579442d688" /> ## What's next? Features that could be added: - Check for lintrunner updates, auto-update if needed - Depending on dev response, this could be enabled by default for all pytorch/pytorch environments Pull Request resolved: https://github.com/pytorch/pytorch/pull/160048 Approved by: https://github.com/seemethere	2025-08-12 01:58:46 +00:00
Ramya Ramineni	7a974a88f2	[ROCm] Fix resource_strings.h (#159996 ) This PR fixes the errors like below: ``` [rank7]: RuntimeError: /tmp/comgr-c3c81b/input/CompileSourceejOPx6:34:8: error: unknown type name 'uint64_t'; did you mean '__hip_internal::uint64_t'? [rank7]: 34 \| if(((uint64_t) t0.data) % (4 * sizeof(half)) != 0) flag_vec4 = false; ``` The following datatypes needs to be defined in `torch/csrc/jit/codegen/fuser/cuda/resource_strings.h` for ROCm versions >= 7.0. ``` typedef unsigned char uint8_t; typedef signed char int8_t; typedef short int int16_t; typedef long long int int64_t; typedef unsigned long long int uint64_t; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159996 Approved by: https://github.com/pruthvistony, https://github.com/Skylion007, https://github.com/jeffdaily	2025-08-12 01:58:02 +00:00
henrylhtsang	f3f159ff8c	[BE][cutlass backend] Reduce severity of log message for no cutlass config found (#160148 ) This is not really a problem. Sometimes we cannot find a cutlass config due to shape, e.g. when k is odd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160148 Approved by: https://github.com/mlazos, https://github.com/Skylion007	2025-08-12 01:41:58 +00:00
henrylhtsang	b90feeac86	[BE][cutlass backend] Fix subproc addmm tests (#160295 ) Differential Revision: [D79977421](https://our.internmc.facebook.com/intern/diff/D79977421/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160295 Approved by: https://github.com/jingsh	2025-08-12 01:41:06 +00:00
Han, Xu	0d40ff3b49	[inductor] fix test_different_file_paths_local_pgo on Windows. (#160382 ) fix test_different_file_paths_local_pgo on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160382 Approved by: https://github.com/angelayi	2025-08-12 01:35:39 +00:00
Scott Todd	cae2b5e3d2	[ROCm][Windows] Enable USE_ROCM, disable USE_RCCL on Windows. (#159079 ) This allows setting `USE_ROCM` on Windows. A few other patches are still required to build (see https://github.com/ROCm/TheRock/issues/589), but we have instructions using open source code and rocm python packages available at https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#build-pytorch-with-rocm-support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159079 Approved by: https://github.com/jeffdaily	2025-08-12 01:28:20 +00:00
Scott Todd	ee89cc7a0a	[ROCm][Windows] Fix LoadHIP handling of environment variable paths on Windows. (#159080 ) See https://cmake.org/cmake/help/latest/command/file.html#path-conversion. Paths stored in environment variables may use `/` or `\` (e.g. on Windows), while cmake-style paths always use `/`. This fixes configure errors like: ``` CMake Error at D:/b/pytorch_main/build/CMakeFiles/CMakeScratch/TryCompile-srhq07/CMakeLists.txt:2 (set): Syntax error in cmake code at D:/b/pytorch_main/build/CMakeFiles/CMakeScratch/TryCompile-srhq07/CMakeLists.txt:2 when parsing string D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\_rocm_sdk_devel/cmake/;D:/b/pytorch_main/cmake/Modules Invalid character escape '\p'. CMake Error at D:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/cmake/data/share/cmake-3.31/Modules/Internal/CheckSourceCompiles.cmake:108 (try_compile): Failed to configure test project build system. ``` (note the mixed usage of `\` and `/` in that string) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159080 Approved by: https://github.com/jeffdaily	2025-08-12 00:18:19 +00:00
Howard Huang	e63c2b21c1	[PP] Initialize P2P communicators on first step (#160210 ) Was hitting hangs in multi-node settings and initializing the NCCL communicators needed for batch p2p ops ahead of time fixes this. This change adds extra communication since it communicates a dummy tensor to next and previous stage ranks. However, this is only paid on the first step so it is negligible. Debug history: https://docs.google.com/document/d/1EKVJYmW2hj_VsvDvnSggXhZzJyvMu9dA0iDJWOZAtjY/edit?tab=t.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160210 Approved by: https://github.com/wconstab	2025-08-11 23:46:58 +00:00
drisspg	3626ba711b	[FlexAttention] Swap from and to & for new triton (#160227 ) Fixes #158463 On B200 I am getting a bunch of error spew: ```Shell /tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline /tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` Triton compilation failed: triton_tem_fused_zeros_1 def triton_tem_fused_zeros_1(arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0): PRESCALE_QK : tl.constexpr = False ``` ```Shell 74 = arith.subi %170, %166 : i32 %175 = arith.muli %174, %c128_i32 : i32 %176 = arith.subi %175, %c64_i32 : i32 %177 = arith.extui %173 : i1 to i32 %178 = arith.muli %176, %177 : i32 %179 = arith.subi %c1_i32, %177 : i32 %180 = arith.muli %179, %c64_i32 : i32 %181 = arith.addi %178, %180 : i32 %182 = arith.muli %181, %c64_i32 : i32 %183 = tt.splat %182 : i32 -> tensor<64x64xi32> %184 = tt.addptr %arg19, %183 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %185 = tt.addptr %arg20, %183 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %186 = tt.splat %181 : i32 -> tensor<64xi32> %187 = arith.addi %arg21, %186 : tensor<64xi32> scf.yield %163, %184, %185, %187 : tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32> } %114 = tt.expand_dims %113#3 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %115 = arith.cmpi slt, %114, %cst_7 : tensor<1x64xi32> %116 = tt.broadcast %115 : tensor<1x64xi1> -> tensor<64x64xi1> %117 = tt.load %113#1, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>> %118 = tt.dot %46, %117, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %119 = arith.mulf %118, %cst_13 : tensor<64x64xf32> %120 = arith.mulf %119, %cst_3 : tensor<64x64xf32> %121 = arith.select %116, %120, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %122 = arith.select %115, %cst_4, %cst_5 : tensor<1x64xi1>, tensor<1x64xi1> %123 = tt.broadcast %122 : tensor<1x64xi1> -> tensor<64x64xi1> %124 = arith.select %123, %121, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %125 = arith.mulf %124, %cst_2 : tensor<64x64xf32> %126 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32> %127 = arith.subf %125, %126 : tensor<64x64xf32> %128 = math.exp2 %127 : tensor<64x64xf32> %129 = tt.load %113#2, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>> %130 = tt.dot %51, %129, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %131 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> %132 = tt.broadcast %131 : tensor<64x1xf32> -> tensor<64x64xf32> %133 = arith.subf %130, %132 : tensor<64x64xf32> %134 = arith.mulf %128, %133 : tensor<64x64xf32> %135 = arith.mulf %134, %cst_3 : tensor<64x64xf32> %136 = arith.select %116, %135, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %137 = arith.select %115, %122, %cst_5 : tensor<1x64xi1>, tensor<1x64xi1> %138 = tt.broadcast %137 : tensor<1x64xi1> -> tensor<64x64xi1> %139 = arith.select %138, %136, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %140 = arith.truncf %139 : tensor<64x64xf32> to tensor<64x64xf16> %141 = tt.trans %117 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %142 = tt.dot %140, %141, %113#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> scf.yield %142 : tensor<64x64xf32> } else { scf.yield %cst_9 : tensor<64x64xf32> } %84 = tt.addptr %arg13, %22 : !tt.ptr<i32>, i32 %85 = tt.load %84 : !tt.ptr<i32> %86 = arith.muli %85, %c128_i32 : i32 %87 = tt.addptr %arg12, %21 : !tt.ptr<i32>, i32 %88 = tt.load %87 : !tt.ptr<i32> %89 = tt.splat %86 : i32 -> tensor<64xi32> %90 = arith.addi %89, %14 : tensor<64xi32> %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %92 = arith.muli %91, %cst_11 : tensor<1x64xi32> %93 = tt.addptr %71, %92 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32> %94 = tt.broadcast %93 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %95 = tt.addptr %94, %74 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %96 = tt.addptr %76, %92 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32> %97 = tt.broadcast %96 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %98 = tt.addptr %97, %74 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %99 = arith.muli %88, %c2_i32 : i32 %100 = arith.minsi %99, %c4_i32 : i32 %101 = arith.cmpi sge, %100, %c1_i32 : i32 %102 = scf.if %101 -> (tensor<64x64xf32>) { %112 = arith.subi %100, %c1_i32 : i32 %113:4 = scf.for %arg17 = %c0_i32 to %112 step %c1_i32 iter_args(%arg18 = %83, %arg19 = %95, %arg20 = %98, %arg21 = %90) -> (tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>) : i32 { %137 = tt.expand_dims %arg21 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %138 = arith.cmpi slt, %137, %cst_7 : tensor<1x64xi32> %139 = tt.broadcast %138 : tensor<1x64xi1> -> tensor<64x64xi1> %140 = tt.load %arg19, %139, %cst_8 : tensor<64x64x!tt.ptr<f16>> %141 = tt.dot %46, %140, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %142 = arith.mulf %141, %cst_13 : tensor<64x64xf32> %143 = arith.mulf %142, %cst_3 : tensor<64x64xf32> %144 = arith.mulf %143, %cst_2 : tensor<64x64xf32> %145 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32> %146 = arith.subf %144, %145 : tensor<64x64xf32> %147 = math.exp2 %146 : tensor<64x64xf32> %148 = tt.load %arg20, %139, %cst_8 : tensor<64x64x!tt.ptr<f16>> %149 = tt.dot %51, %148, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %150 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> %151 = tt.broadcast %150 : tensor<64x1xf32> -> tensor<64x64xf32> %152 = arith.subf %149, %151 : tensor<64x64xf32> %153 = arith.mulf %147, %152 : tensor<64x64xf32> %154 = arith.mulf %153, %cst_3 : tensor<64x64xf32> %155 = arith.truncf %154 : tensor<64x64xf32> to tensor<64x64xf16> %156 = tt.trans %140 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %157 = tt.dot %155, %156, %arg18, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %158 = arith.divsi %arg17, %c2_i32 : i32 %159 = tt.addptr %84, %158 : !tt.ptr<i32>, i32 %160 = tt.load %159 evictionPolicy = evict_last : !tt.ptr<i32> %161 = arith.addi %158, %c1_i32 : i32 %162 = arith.cmpi slt, %161, %88 : i32 %163 = tt.addptr %159, %c1_i32 : !tt.ptr<i32>, i32 %164 = tt.load %163, %162 evictionPolicy = evict_last : !tt.ptr<i32> %165 = arith.addi %arg17, %c1_i32 : i32 %166 = arith.remsi %165, %c2_i32 : i32 %167 = arith.cmpi eq, %166, %c0_i32 : i32 %168 = arith.subi %164, %160 : i32 %169 = arith.muli %168, %c128_i32 : i32 %170 = arith.subi %169, %c64_i32 : i32 %171 = arith.extui %167 : i1 to i32 %172 = arith.muli %170, %171 : i32 %173 = arith.subi %c1_i32, %171 : i32 %174 = arith.muli %173, %c64_i32 : i32 %175 = arith.addi %172, %174 : i32 %176 = arith.muli %175, %c64_i32 : i32 %177 = tt.splat %176 : i32 -> tensor<64x64xi32> %178 = tt.addptr %arg19, %177 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %179 = tt.addptr %arg20, %177 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %180 = tt.splat %175 : i32 -> tensor<64xi32> %181 = arith.addi %arg21, %180 : tensor<64xi32> scf.yield %157, %178, %179, %181 : tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32> } %114 = tt.expand_dims %113#3 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %115 = arith.cmpi slt, %114, %cst_7 : tensor<1x64xi32> %116 = tt.broadcast %115 : tensor<1x64xi1> -> tensor<64x64xi1> %117 = tt.load %113#1, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>> %118 = tt.dot %46, %117, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %119 = arith.mulf %118, %cst_13 : tensor<64x64xf32> %120 = arith.mulf %119, %cst_3 : tensor<64x64xf32> %121 = arith.select %116, %120, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %122 = arith.mulf %121, %cst_2 : tensor<64x64xf32> %123 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32> %124 = arith.subf %122, %123 : tensor<64x64xf32> %125 = math.exp2 %124 : tensor<64x64xf32> %126 = tt.load %113#2, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>> %127 = tt.dot %51, %126, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %128 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> %129 = tt.broadcast %128 : tensor<64x1xf32> -> tensor<64x64xf32> %130 = arith.subf %127, %129 : tensor<64x64xf32> %131 = arith.mulf %125, %130 : tensor<64x64xf32> %132 = arith.mulf %131, %cst_3 : tensor<64x64xf32> %133 = arith.select %116, %132, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %134 = arith.truncf %133 : tensor<64x64xf32> to tensor<64x64xf16> %135 = tt.trans %117 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %136 = tt.dot %134, %135, %113#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> scf.yield %136 : tensor<64x64xf32> } else { scf.yield %83 : tensor<64x64xf32> } %103 = tt.splat %33 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %104 = tt.addptr %103, %37 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %105 = tt.broadcast %104 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %106 = tt.addptr %105, %42 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %107 = arith.mulf %102, %cst_13 : tensor<64x64xf32> %108 = arith.cmpi slt, %40, %cst_11 : tensor<1x64xi32> %109 = tt.broadcast %108 : tensor<1x64xi1> -> tensor<64x64xi1> %110 = arith.andi %45, %109 : tensor<64x64xi1> %111 = arith.truncf %107 : tensor<64x64xf32> to tensor<64x64xf16> tt.store %106, %111, %110 : tensor<64x64x!tt.ptr<f16>> } else { %16 = arith.divsi %0, %c2_i32 : i32 %17 = arith.muli %0, %c64_i32 : i32 %18 = tt.splat %17 : i32 -> tensor<64xi32> %19 = arith.addi %18, %14 : tensor<64xi32> %20 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %21 = arith.muli %20, %cst_14 : tensor<64x1xi32> %22 = tt.splat %11 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %23 = tt.addptr %22, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %24 = tt.expand_dims %14 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %25 = tt.broadcast %23 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %26 = tt.broadcast %24 : tensor<1x64xi32> -> tensor<64x64xi32> %27 = tt.addptr %25, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %28 = arith.cmpi slt, %20, %cst_10 : tensor<64x1xi32> %29 = tt.broadcast %28 : tensor<64x1xi1> -> tensor<64x64xi1> %30 = tt.load %27, %29, %cst_8 : tensor<64x64x!tt.ptr<f16>> %31 = tt.splat %12 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %32 = tt.addptr %31, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %33 = tt.broadcast %32 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %34 = tt.addptr %33, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %35 = tt.load %34, %29, %cst_8 : tensor<64x64x!tt.ptr<f16>> %36:2 = scf.for %arg17 = %c0_i32 to %c4_i32 step %c1_i32 iter_args(%arg18 = %cst_9, %arg19 = %cst_9) -> (tensor<64x64xf32>, tensor<64x64xf32>) : i32 { %55 = arith.muli %2, %c4_i32 : i32 %56 = arith.addi %55, %arg17 : i32 %57 = arith.muli %56, %c2048_i32 : i32 %58 = arith.muli %1, %c32768_i32 : i32 %59 = arith.addi %57, %58 : i32 %60 = arith.extsi %59 : i32 to i64 %61 = arith.muli %1, %c16_i32 : i32 %62 = arith.addi %61, %56 : i32 %63 = arith.muli %62, %c32_i32 : i32 %64 = arith.extsi %63 : i32 to i64 %65 = tt.addptr %arg0, %60 : !tt.ptr<f16>, i64 %66 = tt.addptr %arg5, %60 : !tt.ptr<f16>, i64 %67 = tt.addptr %arg3, %64 : !tt.ptr<f32>, i64 %68 = tt.addptr %arg4, %64 : !tt.ptr<f32>, i64 %69 = arith.remsi %56, %c16_i32 : i32 %70 = arith.muli %3, %c16_i32 : i32 %71 = arith.addi %70, %69 : i32 %72 = arith.muli %71, %c2_i32 : i32 %73 = arith.addi %72, %16 : i32 %74 = tt.addptr %arg11, %73 : !tt.ptr<i32>, i32 %75 = tt.load %74 : !tt.ptr<i32> %76 = arith.muli %75, %c128_i32 : i32 %77 = tt.addptr %arg10, %73 : !tt.ptr<i32>, i32 %78 = tt.load %77 : !tt.ptr<i32> %79 = tt.splat %76 : i32 -> tensor<64xi32> %80 = arith.addi %79, %14 : tensor<64xi32> %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %82 = arith.muli %81, %cst_11 : tensor<1x64xi32> %83 = tt.splat %65 : !tt.ptr<f16> -> tensor<1x64x!tt.ptr<f16>> %84 = tt.addptr %83, %82 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32> %85 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %86 = tt.broadcast %84 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %87 = tt.broadcast %85 : tensor<64x1xi32> -> tensor<64x64xi32> %88 = tt.addptr %86, %87 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %89 = tt.expand_dims %80 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %90 = arith.muli %89, %cst_14 : tensor<64x1xi32> %91 = tt.splat %66 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %92 = tt.addptr %91, %90 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %93 = tt.broadcast %92 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %94 = tt.addptr %93, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %95 = arith.muli %78, %c2_i32 : i32 %96 = arith.minsi %95, %c1_i32 : i32 %97 = arith.cmpi sge, %96, %c1_i32 : i32 %98:2 = scf.if %97 -> (tensor<64x64xf32>, tensor<64x64xf32>) { %120 = arith.subi %96, %c1_i32 : i32 %121:5 = scf.for %arg20 = %c0_i32 to %120 step %c1_i32 iter_args(%arg21 = %arg18, %arg22 = %arg19, %arg23 = %88, %arg24 = %94, %arg25 = %80) -> (tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>) : i32 { %167 = tt.expand_dims %arg25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %168 = arith.cmpi slt, %167, %cst_1 : tensor<1x64xi32> %169 = tt.broadcast %168 : tensor<1x64xi1> -> tensor<64x64xi1> %170 = tt.load %arg23, %169, %cst_8 : tensor<64x64x!tt.ptr<f16>> %171 = arith.cmpi slt, %arg25, %cst_17 : tensor<64xi32> %172 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %173 = tt.addptr %172, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %174 = tt.load %173, %171 : tensor<64x!tt.ptr<f32>> %175 = arith.cmpf oeq, %174, %cst_16 : tensor<64xf32> %176 = arith.select %175, %cst_15, %174 : tensor<64xi1>, tensor<64xf32> %177 = tt.dot %30, %170, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %178 = arith.mulf %177, %cst_13 : tensor<64x64xf32> %179 = arith.mulf %178, %cst_3 : tensor<64x64xf32> %180 = arith.mulf %179, %cst_2 : tensor<64x64xf32> %181 = tt.expand_dims %176 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %182 = tt.broadcast %181 : tensor<1x64xf32> -> tensor<64x64xf32> %183 = arith.subf %180, %182 : tensor<64x64xf32> %184 = math.exp2 %183 : tensor<64x64xf32> %185 = tt.expand_dims %arg25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %186 = arith.cmpi slt, %185, %cst_12 : tensor<64x1xi32> %187 = tt.broadcast %186 : tensor<64x1xi1> -> tensor<64x64xi1> %188 = tt.load %arg24, %187, %cst_8 : tensor<64x64x!tt.ptr<f16>> %189 = arith.truncf %184 : tensor<64x64xf32> to tensor<64x64xf16> %190 = tt.dot %189, %188, %arg22, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %191 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %192 = tt.addptr %191, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %193 = tt.load %192, %171 : tensor<64x!tt.ptr<f32>> %194 = tt.trans %188 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %195 = tt.dot %35, %194, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %196 = tt.expand_dims %193 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %197 = tt.broadcast %196 : tensor<1x64xf32> -> tensor<64x64xf32> %198 = arith.subf %195, %197 : tensor<64x64xf32> %199 = arith.mulf %184, %198 : tensor<64x64xf32> %200 = arith.mulf %199, %cst_3 : tensor<64x64xf32> %201 = arith.truncf %200 : tensor<64x64xf32> to tensor<64x64xf16> %202 = tt.trans %170 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %203 = tt.dot %201, %202, %arg21, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %204 = arith.divsi %arg20, %c2_i32 : i32 %205 = tt.addptr %74, %204 : !tt.ptr<i32>, i32 %206 = tt.load %205 evictionPolicy = evict_last : !tt.ptr<i32> %207 = arith.addi %204, %c1_i32 : i32 %208 = arith.cmpi slt, %207, %78 : i32 %209 = tt.addptr %205, %c1_i32 : !tt.ptr<i32>, i32 %210 = tt.load %209, %208 evictionPolicy = evict_last : !tt.ptr<i32> %211 = arith.addi %arg20, %c1_i32 : i32 %212 = arith.remsi %211, %c2_i32 : i32 %213 = arith.cmpi eq, %212, %c0_i32 : i32 %214 = arith.subi %210, %206 : i32 %215 = arith.muli %214, %c128_i32 : i32 %216 = arith.subi %215, %c64_i32 : i32 %217 = arith.extui %213 : i1 to i32 %218 = arith.muli %216, %217 : i32 %219 = arith.subi %c1_i32, %217 : i32 %220 = arith.muli %219, %c64_i32 : i32 %221 = arith.addi %218, %220 : i32 %222 = arith.muli %221, %c64_i32 : i32 %223 = tt.splat %222 : i32 -> tensor<64x64xi32> %224 = tt.addptr %arg23, %223 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %225 = tt.addptr %arg24, %223 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %226 = tt.splat %221 : i32 -> tensor<64xi32> %227 = arith.addi %arg25, %226 : tensor<64xi32> scf.yield %203, %190, %224, %225, %227 : tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32> } %122 = tt.expand_dims %121#4 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %123 = arith.cmpi slt, %122, %cst_1 : tensor<1x64xi32> %124 = tt.broadcast %123 : tensor<1x64xi1> -> tensor<64x64xi1> %125 = tt.load %121#2, %124, %cst_8 : tensor<64x64x!tt.ptr<f16>> %126 = arith.cmpi slt, %121#4, %cst_17 : tensor<64xi32> %127 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %128 = tt.addptr %127, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %129 = tt.load %128, %126 : tensor<64x!tt.ptr<f32>> %130 = arith.cmpf oeq, %129, %cst_16 : tensor<64xf32> %131 = arith.select %130, %cst_15, %129 : tensor<64xi1>, tensor<64xf32> %132 = tt.dot %30, %125, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %133 = arith.mulf %132, %cst_13 : tensor<64x64xf32> %134 = arith.mulf %133, %cst_3 : tensor<64x64xf32> %135 = arith.select %29, %134, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %136 = arith.select %28, %cst, %cst_0 : tensor<64x1xi1>, tensor<64x1xi1> %137 = tt.broadcast %136 : tensor<64x1xi1> -> tensor<64x64xi1> %138 = arith.select %137, %135, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %139 = arith.mulf %138, %cst_2 : tensor<64x64xf32> %140 = tt.expand_dims %131 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %141 = tt.broadcast %140 : tensor<1x64xf32> -> tensor<64x64xf32> %142 = arith.subf %139, %141 : tensor<64x64xf32> %143 = math.exp2 %142 : tensor<64x64xf32> %144 = tt.expand_dims %121#4 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %145 = arith.cmpi slt, %144, %cst_12 : tensor<64x1xi32> %146 = tt.broadcast %145 : tensor<64x1xi1> -> tensor<64x64xi1> %147 = tt.load %121#3, %146, %cst_8 : tensor<64x64x!tt.ptr<f16>> %148 = arith.truncf %143 : tensor<64x64xf32> to tensor<64x64xf16> %149 = tt.dot %148, %147, %121#1, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %150 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %151 = tt.addptr %150, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %152 = tt.load %151, %126 : tensor<64x!tt.ptr<f32>> %153 = tt.trans %147 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %154 = tt.dot %35, %153, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %155 = tt.expand_dims %152 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %156 = tt.broadcast %155 : tensor<1x64xf32> -> tensor<64x64xf32> %157 = arith.subf %154, %156 : tensor<64x64xf32> %158 = arith.mulf %143, %157 : tensor<64x64xf32> %159 = arith.mulf %158, %cst_3 : tensor<64x64xf32> %160 = arith.select %29, %159, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %161 = arith.select %28, %136, %cst_0 : tensor<64x1xi1>, tensor<64x1xi1> %162 = tt.broadcast %161 : tensor<64x1xi1> -> tensor<64x64xi1> %163 = arith.select %162, %160, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %164 = arith.truncf %163 : tensor<64x64xf32> to tensor<64x64xf16> %165 = tt.trans %125 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %166 = tt.dot %164, %165, %121#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> scf.yield %166, %149 : tensor<64x64xf32>, tensor<64x64xf32> } else { scf.yield %arg18, %arg19 : tensor<64x64xf32>, tensor<64x64xf32> } %99 = tt.addptr %arg15, %73 : !tt.ptr<i32>, i32 %100 = tt.load %99 : !tt.ptr<i32> %101 = arith.muli %100, %c128_i32 : i32 %102 = tt.addptr %arg14, %73 : !tt.ptr<i32>, i32 %103 = tt.load %102 : !tt.ptr<i32> %104 = tt.splat %101 : i32 -> tensor<64xi32> %105 = arith.addi %104, %14 : tensor<64xi32> %106 = tt.expand_dims %105 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %107 = arith.muli %106, %cst_11 : tensor<1x64xi32> %108 = tt.addptr %83, %107 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32> %109 = tt.broadcast %108 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %110 = tt.addptr %109, %87 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %111 = tt.expand_dims %105 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %112 = arith.muli %111, %cst_14 : tensor<64x1xi32> %113 = tt.addptr %91, %112 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %114 = tt.broadcast %113 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %115 = tt.addptr %114, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %116 = arith.muli %103, %c2_i32 : i32 %117 = arith.minsi %116, %c1_i32 : i32 %118 = arith.cmpi sge, %117, %c1_i32 : i32 %119:2 = scf.if %118 -> (tensor<64x64xf32>, tensor<64x64xf32>) { %120 = arith.subi %117, %c1_i32 : i32 %121:5 = scf.for %arg20 = %c0_i32 to %120 step %c1_i32 iter_args(%arg21 = %98#0, %arg22 = %98#1, %arg23 = %110, %arg24 = %115, %arg25 = %105) -> (tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>) : i32 { %161 = tt.expand_dims %arg25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %162 = arith.cmpi slt, %161, %cst_1 : tensor<1x64xi32> %163 = tt.broadcast %162 : tensor<1x64xi1> -> tensor<64x64xi1> %164 = tt.load %arg23, %163, %cst_8 : tensor<64x64x!tt.ptr<f16>> %165 = arith.cmpi slt, %arg25, %cst_17 : tensor<64xi32> %166 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %167 = tt.addptr %166, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %168 = tt.load %167, %165 : tensor<64x!tt.ptr<f32>> %169 = arith.cmpf oeq, %168, %cst_16 : tensor<64xf32> %170 = arith.select %169, %cst_15, %168 : tensor<64xi1>, tensor<64xf32> %171 = tt.dot %30, %164, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %172 = arith.mulf %171, %cst_13 : tensor<64x64xf32> %173 = arith.mulf %172, %cst_3 : tensor<64x64xf32> %174 = arith.mulf %173, %cst_2 : tensor<64x64xf32> %175 = tt.expand_dims %170 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %176 = tt.broadcast %175 : tensor<1x64xf32> -> tensor<64x64xf32> %177 = arith.subf %174, %176 : tensor<64x64xf32> %178 = math.exp2 %177 : tensor<64x64xf32> %179 = tt.expand_dims %arg25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %180 = arith.cmpi slt, %179, %cst_12 : tensor<64x1xi32> %181 = tt.broadcast %180 : tensor<64x1xi1> -> tensor<64x64xi1> %182 = tt.load %arg24, %181, %cst_8 : tensor<64x64x!tt.ptr<f16>> %183 = arith.truncf %178 : tensor<64x64xf32> to tensor<64x64xf16> %184 = tt.dot %183, %182, %arg22, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %185 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %186 = tt.addptr %185, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %187 = tt.load %186, %165 : tensor<64x!tt.ptr<f32>> %188 = tt.trans %182 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %189 = tt.dot %35, %188, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %190 = tt.expand_dims %187 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %191 = tt.broadcast %190 : tensor<1x64xf32> -> tensor<64x64xf32> %192 = arith.subf %189, %191 : tensor<64x64xf32> %193 = arith.mulf %178, %192 : tensor<64x64xf32> %194 = arith.mulf %193, %cst_3 : tensor<64x64xf32> %195 = arith.truncf %194 : tensor<64x64xf32> to tensor<64x64xf16> %196 = tt.trans %164 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %197 = tt.dot %195, %196, %arg21, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %198 = arith.divsi %arg20, %c2_i32 : i32 %199 = tt.addptr %99, %198 : !tt.ptr<i32>, i32 %200 = tt.load %199 evictionPolicy = evict_last : !tt.ptr<i32> %201 = arith.addi %198, %c1_i32 : i32 %202 = arith.cmpi slt, %201, %103 : i32 %203 = tt.addptr %199, %c1_i32 : !tt.ptr<i32>, i32 %204 = tt.load %203, %202 evictionPolicy = evict_last : !tt.ptr<i32> %205 = arith.addi %arg20, %c1_i32 : i32 %206 = arith.remsi %205, %c2_i32 : i32 %207 = arith.cmpi eq, %206, %c0_i32 : i32 %208 = arith.subi %204, %200 : i32 %209 = arith.muli %208, %c128_i32 : i32 %210 = arith.subi %209, %c64_i32 : i32 %211 = arith.extui %207 : i1 to i32 %212 = arith.muli %210, %211 : i32 %213 = arith.subi %c1_i32, %211 : i32 %214 = arith.muli %213, %c64_i32 : i32 %215 = arith.addi %212, %214 : i32 %216 = arith.muli %215, %c64_i32 : i32 %217 = tt.splat %216 : i32 -> tensor<64x64xi32> %218 = tt.addptr %arg23, %217 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %219 = tt.addptr %arg24, %217 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %220 = tt.splat %215 : i32 -> tensor<64xi32> %221 = arith.addi %arg25, %220 : tensor<64xi32> scf.yield %197, %184, %218, %219, %221 : tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32> } %122 = tt.expand_dims %121#4 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %123 = arith.cmpi slt, %122, %cst_1 : tensor<1x64xi32> %124 = tt.broadcast %123 : tensor<1x64xi1> -> tensor<64x64xi1> %125 = tt.load %121#2, %124, %cst_8 : tensor<64x64x!tt.ptr<f16>> %126 = arith.cmpi slt, %121#4, %cst_17 : tensor<64xi32> %127 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %128 = tt.addptr %127, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %129 = tt.load %128, %126 : tensor<64x!tt.ptr<f32>> %130 = arith.cmpf oeq, %129, %cst_16 : tensor<64xf32> %131 = arith.select %130, %cst_15, %129 : tensor<64xi1>, tensor<64xf32> %132 = tt.dot %30, %125, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %133 = arith.mulf %132, %cst_13 : tensor<64x64xf32> %134 = arith.mulf %133, %cst_3 : tensor<64x64xf32> %135 = arith.select %29, %134, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32> %136 = arith.mulf %135, %cst_2 : tensor<64x64xf32> %137 = tt.expand_dims %131 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %138 = tt.broadcast %137 : tensor<1x64xf32> -> tensor<64x64xf32> %139 = arith.subf %136, %138 : tensor<64x64xf32> %140 = math.exp2 %139 : tensor<64x64xf32> %141 = tt.expand_dims %121#4 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %142 = arith.cmpi slt, %141, %cst_12 : tensor<64x1xi32> %143 = tt.broadcast %142 : tensor<64x1xi1> -> tensor<64x64xi1> %144 = tt.load %121#3, %143, %cst_8 : tensor<64x64x!tt.ptr<f16>> %145 = arith.truncf %140 : tensor<64x64xf32> to tensor<64x64xf16> %146 = tt.dot %145, %144, %121#1, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %147 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>> %148 = tt.addptr %147, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32> %149 = tt.load %148, %126 : tensor<64x!tt.ptr<f32>> %150 = tt.trans %144 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %151 = tt.dot %35, %150, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> %152 = tt.expand_dims %149 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32> %153 = tt.broadcast %152 : tensor<1x64xf32> -> tensor<64x64xf32> %154 = arith.subf %151, %153 : tensor<64x64xf32> %155 = arith.mulf %140, %154 : tensor<64x64xf32> %156 = arith.mulf %155, %cst_3 : tensor<64x64xf32> %157 = arith.select %29, %156, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32> %158 = arith.truncf %157 : tensor<64x64xf32> to tensor<64x64xf16> %159 = tt.trans %125 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16> %160 = tt.dot %158, %159, %121#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32> scf.yield %160, %146 : tensor<64x64xf32>, tensor<64x64xf32> } else { scf.yield %98#0, %98#1 : tensor<64x64xf32>, tensor<64x64xf32> } scf.yield %119#0, %119#1 : tensor<64x64xf32>, tensor<64x64xf32> } %37 = tt.splat %13 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>> %38 = tt.addptr %37, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32> %39 = tt.broadcast %38 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>> %40 = tt.addptr %39, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %41 = arith.cmpi slt, %24, %cst_11 : tensor<1x64xi32> %42 = tt.broadcast %41 : tensor<1x64xi1> -> tensor<64x64xi1> %43 = arith.andi %29, %42 : tensor<64x64xi1> %44 = arith.truncf %36#1 : tensor<64x64xf32> to tensor<64x64xf16> tt.store %40, %44, %43 : tensor<64x64x!tt.ptr<f16>> %45 = arith.mulf %36#0, %cst_13 : tensor<64x64xf32> %46 = tt.broadcast %21 : tensor<64x1xi32> -> tensor<64x64xi32> %47 = arith.addi %26, %46 : tensor<64x64xi32> %48 = tt.splat %4 : i32 -> tensor<64x64xi32> %49 = arith.addi %47, %48 : tensor<64x64xi32> %50 = tt.splat %8 : i32 -> tensor<64x64xi32> %51 = arith.addi %49, %50 : tensor<64x64xi32> %52 = tt.splat %arg16 : !tt.ptr<f16> -> tensor<64x64x!tt.ptr<f16>> %53 = tt.addptr %52, %51 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32> %54 = arith.truncf %45 : tensor<64x64xf32> to tensor<64x64xf16> tt.store %53, %54, %29 : tensor<64x64x!tt.ptr<f16>> } tt.return } } {-# external_resources: { mlir_reproducer: { pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=90}, sccp, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", disable_threading: false, verify_each: true } } #-} /tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline /tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` Triton compilation failed: triton_tem_fused_zeros_1 def triton_tem_fused_zeros_1(arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0): PRESCALE_QK : tl.constexpr = False ROWS_GUARANTEED_SAFE : tl.constexpr = False BLOCKS_ARE_CONTIGUOUS : tl.constexpr = False WRITE_DQ : tl.constexpr = True OUTPUT_LOGSUMEXP : tl.constexpr = True FLOAT32_PRECISION : tl.constexpr = 'tf32' IS_DIVISIBLE : tl.constexpr = False SM_SCALE : tl.constexpr = 0.125 GQA_SHARED_HEADS : tl.constexpr = 4 HAS_FULL_BLOCKS : tl.constexpr = True QK_HEAD_DIM : tl.constexpr = 64 QK_HEAD_DIM_ROUNDED : tl.constexpr = 64 V_HEAD_DIM : tl.constexpr = 64 V_HEAD_DIM_ROUNDED : tl.constexpr = 64 SAFE_HEAD_DIM : tl.constexpr = True BLOCK_M1 : tl.constexpr = 64 BLOCK_N1 : tl.constexpr = 64 BLOCK_M2 : tl.constexpr = 64 BLOCK_N2 : tl.constexpr = 64 SPARSE_Q_BLOCK_SIZE : tl.constexpr = 128 SPARSE_KV_BLOCK_SIZE : tl.constexpr = 128 Q = arg_Q K = arg_K V = arg_V LSE = arg_LSE DELTA = arg_DELTA DO = arg_DO DQ = arg_DQ DV = arg_DV KV_NUM_BLKS = arg_KV_NUM_BLKS KV_IDX = arg_KV_IDX Q_NUM_BLKS = arg_Q_NUM_BLKS Q_IDX = arg_Q_IDX FULL_KV_NUM_BLKS = arg_FULL_KV_NUM_BLKS FULL_KV_IDX = arg_FULL_KV_IDX FULL_Q_NUM_BLKS = arg_FULL_Q_NUM_BLKS FULL_Q_IDX = arg_FULL_Q_IDX # Sub notation for this kernel: # # Q: Query, K: Key, V: Value # LSE: logsumexp (logsumexp is always stored in fp32 regardless of the input dtype) # DELTA: Precomputed sum(OUTDO, axis=-1) # DO: Derivative of Output, DQ: Derivative of Query, DV: Derivative of Value # DK: Derivative of Key, is the written to via the store_output call due to some limitations with # inductor codegen # M: Number of queries, N: Number of keys/values # QK_HEAD_DIM: The dimension of the query and key embeddings # V_HEAD_DIM: The dimension of the value embeddings # z: Batch size, h: Number of heads, m: Number of queries or keys/values, d: Head dim # GQA_SHARED_HEADS: number of query heads sharing one kv head in GQA setups. # (Modifiable) Performance tuning options # BLOCK_M1: when calculating DK & DV, iterate over BLOCK_M1 across the seqlen dim of Q in each thread block. # BLOCK_N1: when calculating DK & DV, the thread block size across the seqlen dim of K/V. # BLOCK_M2: when calculating DQ, the thread block size across the seqlen dim of Q. # BLOCK_N2: when calculating DQ, iterate over BLOCK_N2 across the seqlen dim of K/V in each thread block. # # The following FULL_ and PARTIAL_* is defined in the block sparse mask grid, rather than the thread block grid. # KV_NUM_BLKS: The number of KV blocks (that may or may not require masking) for each query. # KV_IDX: The indices of KV blocks (that may or may not require masking) for each query. # Q_NUM_BLKS: The number of Q blocks (that may or may not require masking) for each query. # Q_IDX: The indices of Q blocks (that may or may not require masking) for each query. # FULL_KV_NUM_BLKS: The number of fully unmasked KV blocks (so we don't need masking) for each query. # FULL_KV_IDX: The indices of fully unmasked KV blocks (so we don't need masking) for each query. # FULL_Q_NUM_BLKS: The number of fully unmasked Q blocks (so we don't need masking) for each query. # FULL_Q_IDX: The indices of fully unmasked Q blocks (so we don't need masking) for each query. # The below are kernel options that can be applied for certain score_mods, # or involve a numerics vs. perf tradeoff # PRESCALE_QK: Whether to pre-scale QK by 1/sqrt(d) and change of base. Has # about 20% more numerical error, but slightly faster. # Define strides of inputs stride_qz, stride_qh, stride_qm, stride_qd = 32768, 2048, 64, 1 stride_kz, stride_kh, stride_kn, stride_kd = 65536, 16384, 64, 1 stride_vz, stride_vh, stride_vn, stride_vd = 65536, 16384, 64, 1 stride_doz, stride_doh, stride_dom, stride_dod = 32768, 2048, 64, 1 stride_dqz, stride_dqh, stride_dqm, stride_dqd = 32768, 2048, 64, 1 stride_dvz, stride_dvh, stride_dvm, stride_dvd = 65536, 16384, 64, 1 ZQ = 2 HQ = 16 HKV = 4 Q_LEN = 32 ZKV = 2 KV_LEN = 256 MATMUL_PRECISION = Q.dtype.element_ty pid = tl.program_id(0) NUM_KV_BLOCKS = tl.cdiv(KV_LEN, BLOCK_N1) NUM_Q_BLOCKS = tl.cdiv(Q_LEN, BLOCK_M2) off_zq = tl.program_id(1) # q batch idx off_hkv = tl.program_id(2) # kv head idx off_zkv = off_zq % ZKV # kv batch idx SPARSE_Z = 2 SPARSE_HQ = 16 sparse_idx_z = off_zq % SPARSE_Z k_adj = (stride_kh * off_hkv + stride_kz * off_zkv).to(tl.int64) v_adj = (stride_vh * off_hkv + stride_vz * off_zkv).to(tl.int64) # first compute broadcasted dv of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM] # then reduce to dv of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM] dv_adj = (stride_dvh * off_hkv + stride_dvz * off_zq).to(tl.int64) # offset K, V, DV pointers for batch/kv-head K += k_adj V += v_adj DV += dv_adj RCP_LN2 = 1.44269504 offs_k = tl.arange(0, QK_HEAD_DIM_ROUNDED) offs_v = tl.arange(0, V_HEAD_DIM_ROUNDED) if pid >= NUM_KV_BLOCKS: off_pid = pid - NUM_KV_BLOCKS # THIS BLOCK DOES DQ SPARSE_Q_MULTIPLE = (SPARSE_Q_BLOCK_SIZE // BLOCK_M2) SPARSE_KV_MULTIPLE = (SPARSE_KV_BLOCK_SIZE // BLOCK_N2) off_hq2 = off_pid // NUM_Q_BLOCKS + off_hkv * GQA_SHARED_HEADS start_m2_block = off_pid % NUM_Q_BLOCKS off_pid_mask = start_m2_block // SPARSE_Q_MULTIPLE stride_kv_num_blks_h = 1 stride_kv_idx_h = 2 stride_kv_idx_m = 2 sparse_idx_hq2 = off_hq2 % SPARSE_HQ sparse_hz_offset = sparse_idx_z * SPARSE_HQ + sparse_idx_hq2 sparse_kv_num_blks_offset = sparse_hz_offset * stride_kv_num_blks_h + off_pid_mask sparse_kv_idx_offset = sparse_hz_offset * stride_kv_idx_h + off_pid_mask * stride_kv_idx_m # noqa: B950 # Offset Q, DQ, DO, DELTA & LSE. These inputs are offsetted by query heads. q_adj2 = (stride_qh * off_hq2 + stride_qz * off_zq).to(tl.int64) do_adj2 = (stride_doh * off_hq2 + stride_doz * off_zq).to(tl.int64) dq_adj2 = (stride_dqh * off_hq2 + stride_dqz * off_zq).to(tl.int64) off_chz2 = ((off_zq * HQ + off_hq2) * Q_LEN).to(tl.int64) Q2 = Q + q_adj2 DO2 = DO + do_adj2 # TODO: This does not work if DQ is not the same layout as Q (for example, # if Q is broadcasted) DQ2 = DQ + dq_adj2 LSE2 = LSE + off_chz2 DELTA2 = DELTA + off_chz2 # dq = tl.zeros([BLOCK_M2, QK_HEAD_DIM], dtype=tl.float32) dq = tl.zeros([BLOCK_M2, QK_HEAD_DIM_ROUNDED], dtype=tl.float32) start_m2 = start_m2_block * BLOCK_M2 offs_m2 = start_m2 + tl.arange(0, BLOCK_M2) # load Q and do: they stay in SRAM throughout the inner loop. q = load_checked_2d(Q2, offs_m2, offs_k, stride_qm, stride_qd, IS_DIVISIBLE, SAFE_HEAD_DIM, Q_LEN, QK_HEAD_DIM) do = load_checked_2d(DO2, offs_m2, offs_v, stride_dom, stride_dod, IS_DIVISIBLE, SAFE_HEAD_DIM, Q_LEN, V_HEAD_DIM) if PRESCALE_QK: q = (q * SM_SCALE * RCP_LN2).to(MATMUL_PRECISION) if IS_DIVISIBLE: Di = tl.load(DELTA2 + offs_m2) lse = tl.load(LSE2 + offs_m2) else: Di = tl.load(DELTA2 + offs_m2, mask=offs_m2 < Q_LEN) lse = tl.load(LSE2 + offs_m2, mask=offs_m2 < Q_LEN) lse = tl.where(lse == -float("inf"), 0.0, lse) lse = lse[:, None] # ~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # KV_IDX and KV_NUM_BLKS are always contiguous. kv_indices = KV_IDX + sparse_kv_idx_offset kv_start = tl.load(kv_indices) * SPARSE_KV_BLOCK_SIZE # first kv block we're loading sparse_kv_num_blocks = tl.load(KV_NUM_BLKS + sparse_kv_num_blks_offset) offs_n2 = kv_start + tl.arange(0, BLOCK_N2) dq = bwd_dq_inner( arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0, K, V, dq, q, do, Di, lse, off_zq, off_hq2, offs_m2, offs_n2, stride_kn, stride_kd, stride_vn, stride_vd, kv_indices, sparse_kv_num_blocks, MATMUL_PRECISION, IS_FULL_BLOCKS=False, ) if HAS_FULL_BLOCKS: # ~~~~~~~~~~~ partial unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # FULL_KV_IDX and FULL_KV_NUM_BLKS are always contiguous. kv_indices = FULL_KV_IDX + sparse_kv_idx_offset kv_start = tl.load(kv_indices) * SPARSE_KV_BLOCK_SIZE # first kv block we're loading sparse_kv_num_blocks = tl.load(FULL_KV_NUM_BLKS + sparse_kv_num_blks_offset) offs_n2 = kv_start + tl.arange(0, BLOCK_N2) dq = bwd_dq_inner( arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0, K, V, dq, q, do, Di, lse, off_zq, off_hq2, offs_m2, offs_n2, stride_kn, stride_kd, stride_vn, stride_vd, kv_indices, sparse_kv_num_blocks, MATMUL_PRECISION, IS_FULL_BLOCKS=True, ) # Write back dQ. dq_ptrs = DQ2 + offs_m2[:, None] * stride_dqm + offs_k[None, :] * stride_dqd dq = SM_SCALE if IS_DIVISIBLE and SAFE_HEAD_DIM: tl.store(dq_ptrs, dq) else: tl.store(dq_ptrs, dq, mask=(offs_m2[:, None] < Q_LEN) & (offs_k[None, :] < QK_HEAD_DIM)) else: # THIS BLOCK DOES DK & DV SPARSE_Q_MULTIPLE = (SPARSE_Q_BLOCK_SIZE // BLOCK_M1) SPARSE_KV_MULTIPLE = (SPARSE_KV_BLOCK_SIZE // BLOCK_N1) pid_mask = pid // SPARSE_KV_MULTIPLE stride_q_num_blks_h = 2 stride_q_idx_h = 2 stride_q_idx_n = 1 dv = tl.zeros([BLOCK_N1, V_HEAD_DIM_ROUNDED], dtype=tl.float32) dk = tl.zeros([BLOCK_N1, QK_HEAD_DIM_ROUNDED], dtype=tl.float32) start_n1 = pid BLOCK_N1 offs_n1 = start_n1 + tl.arange(0, BLOCK_N1) # load K and V: they stay in SRAM throughout the inner loop. k = load_checked_2d(K, offs_n1, offs_k, stride_kn, stride_kd, IS_DIVISIBLE, SAFE_HEAD_DIM, KV_LEN, QK_HEAD_DIM) v = load_checked_2d(V, offs_n1, offs_v, stride_vn, stride_vd, IS_DIVISIBLE, SAFE_HEAD_DIM, KV_LEN, V_HEAD_DIM) if PRESCALE_QK: k = (k * SM_SCALE * RCP_LN2).to(MATMUL_PRECISION) for off_g in range(0, GQA_SHARED_HEADS): off_hq1 = off_hkv * GQA_SHARED_HEADS + off_g # Offset Q, DQ, DO, DELTA & LSE. These inputs are offsetted by query heads. q_adj1 = (stride_qh * off_hq1 + stride_qz * off_zq).to(tl.int64) do_adj1 = (stride_doh * off_hq1 + stride_doz * off_zq).to(tl.int64) dq_adj1 = (stride_dqh * off_hq1 + stride_dqz * off_zq).to(tl.int64) off_chz1 = ((off_zq * HQ + off_hq1) * Q_LEN).to(tl.int64) Q1 = Q + q_adj1 DO1 = DO + do_adj1 # TODO: This does not work if DQ is not the same layout as Q (for example, # if Q is broadcasted) LSE1 = LSE + off_chz1 DELTA1 = DELTA + off_chz1 sparse_idx_hq1 = off_hq1 % SPARSE_HQ sparse_hz_offset = sparse_idx_z * SPARSE_HQ + sparse_idx_hq1 sparse_q_num_blks_offset = sparse_hz_offset * stride_q_num_blks_h + pid_mask sparse_q_idx_offset = sparse_hz_offset * stride_q_idx_h + pid_mask * stride_q_idx_n # noqa: B950 # ~~~~~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Q_IDX and Q_NUM_BLKS are always contiguous. q_indices = Q_IDX + sparse_q_idx_offset q_start = tl.load(q_indices) * SPARSE_Q_BLOCK_SIZE # first q block we're loading sparse_q_num_blocks = tl.load(Q_NUM_BLKS + sparse_q_num_blks_offset) offs_m1 = q_start + tl.arange(0, BLOCK_M1) dk, dv = bwd_dkdv_inner( arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0, Q1, DO1, DELTA1, LSE1, dk, dv, k, v, off_zq, off_hq1, offs_n1, offs_m1, stride_qm, stride_qd, stride_dom, stride_dod, q_indices, sparse_q_num_blocks, MATMUL_PRECISION, IS_FULL_BLOCKS=False, ) if HAS_FULL_BLOCKS: # ~~~~~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # FULL_Q_IDX and FULL_Q_NUM_BLKS are always contiguous. q_indices = FULL_Q_IDX + sparse_q_idx_offset q_start = tl.load(q_indices) * SPARSE_Q_BLOCK_SIZE # first q block we're loading sparse_q_num_blocks = tl.load(FULL_Q_NUM_BLKS + sparse_q_num_blks_offset) offs_m1 = q_start + tl.arange(0, BLOCK_M1) dk, dv = bwd_dkdv_inner( arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0, Q1, DO1, DELTA1, LSE1, dk, dv, k, v, off_zq, off_hq1, offs_n1, offs_m1, stride_qm, stride_qd, stride_dom, stride_dod, q_indices, sparse_q_num_blocks, MATMUL_PRECISION, IS_FULL_BLOCKS=True, ) # Write back dV and dK. dv_ptrs = DV + offs_n1[:, None] * stride_dvm + offs_v[None, :] * stride_dvd index_n = offs_n1[:, None] index_k = offs_k[None, :] index_v = offs_v[None, :] if IS_DIVISIBLE and SAFE_HEAD_DIM: tl.store(dv_ptrs, dv) else: tl.store(dv_ptrs, dv, mask=(index_n < KV_LEN) & (index_v < V_HEAD_DIM)) dk = SM_SCALE if SAFE_HEAD_DIM: mask = index_n < KV_LEN else: mask = (index_n < KV_LEN) & (index_k < QK_HEAD_DIM) # first compute broadcasted dk of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM] # then reduce to dk of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM] xindex = index_k + 64index_n + 16384off_hkv + 65536off_zq tl.store(out_ptr0 + (tl.broadcast_to(xindex, dk.shape)), dk, mask) metadata: {'signature': {'arg_Q': 'fp16', 'arg_K': 'fp16', 'arg_V': 'fp16', 'arg_LSE': 'fp32', 'arg_DELTA': 'fp32', 'arg_DO': 'fp16', 'arg_DQ': 'fp16', 'arg_DV': 'fp16', 'arg_KV_NUM_BLKS': 'i32', 'arg_KV_IDX': 'i32', 'arg_Q_NUM_BLKS': 'i32', 'arg_Q_IDX': 'i32', 'arg_FULL_KV_NUM_BLKS': 'i32', 'arg_FULL_KV_IDX': 'i32', 'arg_FULL_Q_NUM_BLKS': 'i32', 'arg_FULL_Q_IDX': 'i32', 'out_ptr0': 'fp16'}, 'device': 0, 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (4,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (9,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (14,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]]}], 'device_type': 'cuda', 'num_warps': 4, 'num_stages': 3, 'debug': True, 'cc': 100} Traceback (most recent call last): File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 748, in _precompile_config binary = triton.compile(compile_args, *compile_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/compiler/compiler.py", line 359, in compile next_module = compile_ir(module, metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 456, in <lambda> stages["ttgir"] = lambda src, metadata: self.make_ttgir(src, metadata, options, capability) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 298, in make_ttgir pm.run(mod) RuntimeError: PassManager::run failed frames [('total', 3), ('ok', 3)] inline_call [] stats [('calls_captured', 8), ('unique_graphs', 3)] aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('ok', 1)] inductor [('triton_bundler_save_kernel', 8), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1), ('fxgraph_cache_bypass', 1)] graph_break [] F ==================================================== FAILURES ===================================================== _____________________________ TestFlexAttentionCUDA.test_GQA_score_mod1_cuda_float16 ______________________________ Traceback (most recent call last): File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper method(args, *kwargs) File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper method(args, kwargs) File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 446, in instantiated_test raise rte File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test result = test(self, param_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 1349, in dep_fn return fn(self, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 1215, in dep_fn return fn(slf, args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/test/inductor/test_flex_attention.py", line 1430, in test_GQA self.run_test(inputs) File "/home/drisspg/meta/pytorch/test/inductor/test_flex_attention.py", line 566, in run_test compiled_out.backward(backward_grad) File "/home/drisspg/meta/pytorch/torch/_tensor.py", line 625, in backward torch.autograd.backward( File "/home/drisspg/meta/pytorch/torch/autograd/__init__.py", line 354, in backward _engine_run_backward( File "/home/drisspg/meta/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/autograd/function.py", line 315, in apply return user_fn(self, args) ^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2303, in backward return impl_fn() ^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2289, in impl_fn out = CompiledFunction._backward_impl(ctx, all_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2394, in _backward_impl CompiledFunction.compiled_bw = aot_config.bw_compiler( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/schemas.py", line 1256, in __call__ return self.compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_dynamo/backends/common.py", line 76, in _wrapped_bw_compiler disable( File "/home/drisspg/meta/pytorch/torch/_dynamo/eval_frame.py", line 1005, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_utils_internal.py", line 92, in wrapper_function return function(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 2428, in bw_compiler return inner_compile( ^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 773, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_dynamo/repro/after_aot.py", line 124, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 952, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 1652, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 1506, in codegen_and_compile compiled_module = graph.compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2318, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2328, in _compile_to_module mod = self._compile_to_module_lines(wrapper_code) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2396, in _compile_to_module_lines mod = PyCodeCache.load_by_key_path( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/codecache.py", line 3466, in load_by_key_path mod = _reload_python_module(key, path, set_sys_modules=in_toplevel) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/compile_tasks.py", line 33, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/tmp/tmp0yiz3c94/az/caza2gzmsagyuusmf2ka3oat3na4xv6zudssk244xmlzsbv2knze.py", line 117, in <module> File "/home/drisspg/meta/pytorch/torch/_inductor/async_compile.py", line 489, in triton kernel.precompile( File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 437, in precompile self._precompile_worker() File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 459, in _precompile_worker compile_results.append(self._precompile_config(c)) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 748, in _precompile_config binary = triton.compile(compile_args, **compile_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/compiler/compiler.py", line 359, in compile next_module = compile_ir(module, metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 456, in <lambda> stages["ttgir"] = lambda src, metadata: self.make_ttgir(src, metadata, options, capability) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 298, in make_ttgir pm.run(mod) RuntimeError: PassManager::run failed To execute this test, run the following from the base repo dir: python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_score_mod1_cuda_float16 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ============================================= short test summary info ============================================= FAILED [5.1441s] test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_GQA_score_mod1_cuda_float16 - RuntimeError: PassManager::run failed ================================== 1 failed, 1 passed, 1404 deselected in 18.10s ================================== ~/meta/pytorch flex-warning !1 ❯ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160227 Approved by: https://github.com/Skylion007, https://github.com/Chillee	2025-08-11 23:30:20 +00:00
Sherlock Huang	99bc2f94c1	Update export/schema.py (#160220 ) Summary: Model could have multiple ExportedPrograms - for different methods. They can have different weights. - for different delegates. They can also have different weights. For this reason, we make weight per ExportedProgram. Also, we cleanup Model, and Program. IIUC, Model and Program are not used anywhere, so it's ok to make BC breaking change. Test Plan: CI Rollback Plan: Differential Revision: D79917395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160220 Approved by: https://github.com/angelayi, https://github.com/dolpm, https://github.com/jingsh	2025-08-11 23:14:08 +00:00
Yidi Wu	fc25c68f20	[hop][exc] make UncapturedHigherOrderOpError print user code and avoid re-raise (#159296 ) After the change, the error stacktrace is attached with user code stack and is suppressed into 1 (without the scrolling up mssage). For example: ```python class Test(torch.nn.Module): def forward(self, c, x): def cond_fn(c, x): return c > 0 and x.size(0) < 20 def body_fn(c, x): return c - 1, x.sin() return torch._higher_order_ops.while_loop(cond_fn, body_fn, (c, x)) ``` Now gives the following error message: ```python Traceback (most recent call last): File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1705, in test_while_loop_size_mismatch_tensor_expansion self._run_test( ~~~~~~~~~~~~~~^ model=WhileLoopModels.SizeMismatchTensorExpansion(), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...<2 lines>... dynamic=dynamic, ^^^^^^^^^^^^^^^^ ) ^ File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1417, in _run_test result = model(inputs_with_counters) File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(args, *kwargs) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(args, *kwargs) File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1053, in forward return torch._higher_order_ops.while_loop(cond_fn, body_fn, (c, x)) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 176, in while_loop return torch.compile( ~~~~~~~~~~~~~~ _while_loop_op_wrapper, backend=backend, fullgraph=True ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ )(flat_cond_fn, flat_body_fn, tuple(flat_inputs), tuple()) ~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 804, in compile_wrapper return fn(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1595, in __call__ result = self._torchdynamo_orig_backend( frame, cache_entry, self.hooks, frame_state, skip=1 ) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1353, in __call__ result = self._inner_convert( frame, cache_entry, hooks, frame_state, skip=skip + 1 ) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 682, in __call__ result = _compile( frame.f_code, ...<16 lines>... convert_frame_box=self._box, ) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1172, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/home/yidi/local/pytorch/torch/_utils_internal.py", line 98, in wrapper_function return function(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 858, in compile_inner return _compile_inner(code, one_graph, hooks, transform) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 897, in _compile_inner out_code = transform_code_object(code, transform) File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1461, in transform_code_object transformations(instructions, code_options) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 300, in _fn return fn(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 818, in transform tracer.run() ~~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3528, in run super().run() ~~~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run while self.step(): ~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step self.dispatch_table[inst.opcode](self, inst) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 852, in wrapper return inner_fn(self, inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2240, in CALL_FUNCTION_EX self.call_function(fn, argsvars.items, kwargsvars) ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1200, in call_function self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type] ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward return getattr(self.realize(), name)(args, *kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 91, in graph_break_as_hard_error raise exc.with_traceback(sys.exc_info()[2]) from None File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 77, in graph_break_as_hard_error return fn(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1287, in call_function ) = speculate_subgraph( ~~~~~~~~~~~~~~~~~~^ tx, ^^^ ...<33 lines>... supports_aliasing=self.supports_aliasing, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 877, in speculate_subgraph raise ex File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 718, in speculate_subgraph output = f.call_function(tx, args, sub_kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 580, in call_function return super().call_function(tx, args, kwargs) ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 334, in call_function return tx.inline_user_function_return(self, [self.self_args(), args], kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1217, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3733, in inline_call return tracer.inline_call_() ~~~~~~~~~~~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3936, in inline_call_ self.run() ~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run while self.step(): ~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step self.dispatch_table[inst.opcode](self, inst) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 852, in wrapper return inner_fn(self, inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2240, in CALL_FUNCTION_EX self.call_function(fn, argsvars.items, kwargsvars) ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1200, in call_function self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type] ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward return getattr(self.realize(), name)(args, *kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 580, in call_function return super().call_function(tx, args, kwargs) ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 334, in call_function return tx.inline_user_function_return(self, [self.self_args(), args], kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1217, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3733, in inline_call return tracer.inline_call_() ~~~~~~~~~~~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3936, in inline_call_ self.run() ~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run while self.step(): ~~~~~~~~~^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step self.dispatch_table[inst.opcode](self, inst) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 830, in inner unimplemented_v2( ~~~~~~~~~~~~~~~~^ gb_type="Data-dependent branching", ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...<5 lines>... ], ^^ ) ^ File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 580, in unimplemented_v2 raise Unsupported(msg) torch._dynamo.exc.UncapturedHigherOrderOpError: while_loop doesn't work unless it is captured completely with torch.compile. Got Data-dependent branching Explanation: Detected data-dependent branching (e.g. `if my_tensor.sum() > 0:`). Dynamo does not support tracing dynamic control flow. Hint: This graph break is fundamental - it is unlikely that Dynamo will ever be able to trace through your code. Consider finding a workaround. Hint: Use `torch.cond` to express dynamic control flow. Developer debug context: attempted to jump with TensorVariable() For more details about this graph break, please visit: https://pytorch-labs.github.io/compile-graph-break-site/gb/gb0170.html from user code: File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 167, in _while_loop_op_wrapper return while_loop_op(args, *kwargs) File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 137, in flat_cond_fn return cond_fn(carried, *additional) File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1047, in cond_fn return c > 0 and x.size(0) < 20 Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" To execute this test, run the following from the base repo dir: python test/inductor/test_control_flow.py WhileLoopTests.test_while_loop_size_mismatch_tensor_expansion_device_cpu_dynamic_False This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159296 Approved by: https://github.com/zou3519	2025-08-11 22:48:10 +00:00
Pat Vignola	5a40c57844	[MTIA] Implement isAvailable() for MTIA hooks (#160304 ) Summary: MTIA is missing the `isAvailable()` override, which is necessary for some of the device agnostic methods. Test Plan: `torch._C._get_accelerator()` Rollback Plan: Differential Revision: D79981115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160304 Approved by: https://github.com/nautsimon	2025-08-11 21:45:11 +00:00
Nikita Shulga	7d2ec704e4	Fix MPS autocast for ConvTranspose3d (#160345 ) ## Summary - ensure ConvTranspose3d uses fp32 under MPS autocast - add MPS autocast test for ConvTranspose3d Generated by Codex, see https://chatgpt.com/codex/tasks/task_e_689a360388288327a2cac6f55bbfc42c Fixes https://github.com/pytorch/pytorch/issues/160332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160345 Approved by: https://github.com/dcci	2025-08-11 21:01:52 +00:00
Sandeep Narendranath Karjala	fc80f6859e	Fix collective schedule logging and runtime tests (#160260 ) Summary: - Fix collective schedule logging so that only logs when collectives present - Fix runtime estimate test to check if each op has a number value Pull Request resolved: https://github.com/pytorch/pytorch/pull/160260 Approved by: https://github.com/Skylion007	2025-08-11 20:58:52 +00:00
PaulZhang12	cf0a0dcb0a	Make user defined Triton kernels serializable for fx_graph_runnable (#160002 ) Resolves issue https://github.com/pytorch/pytorch/issues/153475 where `fx_graph_runnable` didn't work with user defined triton kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160002 Approved by: https://github.com/eellison	2025-08-11 20:54:33 +00:00
PyTorch MergeBot	b149c7204c	Revert "port distributed pipeline test files for Intel GPU (#159033 )" This reverts commit 76a0609b6bddb2bc40f1eb4ade12885023653d59. Reverted https://github.com/pytorch/pytorch/pull/159033 on behalf of https://github.com/clee2000 due to broke test_cpp_extensions_stream_and_event.py::TestCppExtensionStreamAndEvent::test_stream_event [GH job link](https://github.com/pytorch/pytorch/actions/runs/16890370216/job/47849586456) [HUD commit link](`76a0609b6b`) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/159033#issuecomment-3176833314))	2025-08-11 20:44:45 +00:00
PyTorch MergeBot	09381f5dac	Revert "[Graph Partition] Pass all OSS unit tests (#154667 )" This reverts commit ca7315c17162ea21b1ca5ba23f4bf6168766c7b9. Reverted https://github.com/pytorch/pytorch/pull/154667 on behalf of https://github.com/clee2000 due to broke inductor/test_memory.py::TestOperatorReorderForPeakMemory::test_reorder_peak_memory_lpmf [GH job link](https://github.com/pytorch/pytorch/actions/runs/16885961204/job/47836769279) [HUD commit link](`ca7315c171`) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154667#issuecomment-3176805477))	2025-08-11 20:34:27 +00:00
Pian Pawakapan	9eedd2a20b	[PGO] no counterfactual suggestions for dynamic allowlist (#160231 ) Being more conservative with whitelist suggestions as we roll out suggestions; now we only suggest sources that were dynamic in previous runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160231 Approved by: https://github.com/bobrenjc93	2025-08-11 20:13:25 +00:00
Edward Yang	c3dc8dc412	159965 is merged, no need to patch it in (#160275 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160275 Approved by: https://github.com/albanD, https://github.com/ZainRizvi	2025-08-11 19:55:04 +00:00
Liao, Wei	76a0609b6b	port distributed pipeline test files for Intel GPU (#159033 ) In this PR we will port all distributed pipeline test files. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. instantiate_device_type_tests() 2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend 3. use "requires_accelerator_dist_backend()" to replace requires_nccl() 4. use "get_default_backend_for_device()" to get backend 5. enabled XPU for some test path 6. add TEST_MULTIACCELERATOR in common_utils for all backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159033 Approved by: https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Daisy Deng <daisy.deng@intel.com>	2025-08-11 19:43:15 +00:00
Simon Fan	c8205cb354	[autograd] match 0-dim gradients device type regardless of subclassness (#160165 ) Not sure if there some subclasses where the outer.dim() == 0 but you wouldn't want to move it? FIXES https://github.com/pytorch/pytorch/issues/160084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160165 Approved by: https://github.com/ezyang, https://github.com/albanD	2025-08-11 17:57:32 +00:00
Nikita Shulga	d25c4f954d	[MPS] Type-promote tensor-iterator common dtype (#160334 ) Otherwise, `torch.add(FloatTensor, IntTensor, alpha=2)` and `torch.add(FloatTensor, IntTensor, alpha=2)` were dispatched to different kernels Fixes https://github.com/pytorch/pytorch/issues/160208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160334 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-08-11 17:53:56 +00:00
David Berard	d0e2240f68	[triton_heuristics] Optimize the triton launcher in pt2 (#160000 ) Summary: (Original author: Xu Zhao. Commandeered by David to land this since it is relatively urgent) We observed ~10us PT2-Triton launch overhead regression after pin update. Before Triton pin-update: {F1980557238} After Triton pin-update: {F1980557240} The root cause is because https://github.com/pytorch/pytorch/pull/145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path. The motivation for `_get_args_with_constexprs` was that between triton 3.2 and triton 3.3, the convention for calling Triton kernels (at the level that non-static-cuda-launcher inductor integrates) changed. Previously, the callable did not take constexpr arguments as parameters; after 3.3, it does. With pointwise/reduction kernels, we don't know the constexpr values until after autotuning occurs; so `_get_args_with_constexprs` would inject constexprs into the arguments list before calling the Triton kernel. The fix (in this PR) is to instead inject the constexpr args into the launcher string - this avoids the cost of sorting/reordering arguments which previously occurred upon execution of each kernel. Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (`e96c7c4bb0/torch/_inductor/runtime/static_cuda_launcher.py (L220)`), there is no need to pass in constexprs to the generated launcher code. The new launcher code needs to work on three cases: - StaticallyLaunchedCudaKernel - triton.compile.CompiledKernel - AOTInductor Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0 Test Plan: Before: ``` $ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 1.893x ``` ``` $ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency x_val nop_python_function-walltime nop_triton_kernel-walltime nop_triton_compiled_kernel_run-walltime nop_inductor_kernel-walltime nop_inductor_kernel_cudagraph-walltime ------- ------------------------------ ---------------------------- ----------------------------------------- ------------------------------ ---------------------------------------- 0 0.00760921 1.80298 0.623282 5.25024 0.203722 19 0.00799885 4.78223 1.00226 5.8213 0.239084 average 0.00780403 3.29261 0.812769 5.53577 0.221403 ``` After: ``` buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency x_val nop_python_function-walltime nop_triton_kernel-walltime nop_triton_compiled_kernel_run-walltime nop_inductor_kernel-walltime nop_inductor_kernel_cudagraph-walltime ------- ------------------------------ ---------------------------- ----------------------------------------- ------------------------------ ---------------------------------------- 0 0.00747067 1.92589 0.726509 4.35459 0.204205 19 0.00747823 7.36852 1.26241 6.28208 0.239278 average 0.00747445 4.6472 0.994459 5.31834 0.221741 ``` ``` $ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs 1.985x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160000 Approved by: https://github.com/jansel Co-authored-by: Xu Zhao <xzhao9@meta.com>	2025-08-11 17:22:40 +00:00
Shangdi Yu	9ccd0f5e31	Fix unbacked symint and memory leak in inductor memory planning (#159839 ) Summary: In memory planning, some allocation sizes involve unbacked symints. These unbacked symints are not known before they are computed in run time, so allocation pools that involve unbacked symints cannot be allocated until we have the values of the unbacked symints . So we add a notion of `earliest_available` to Allocation nodes. If an allocation node has unbacked symint, it is available at only when its live range begin. Then in AllocationPool, if a pool involves an Allocation node that has an earliest available time, we restrict its life range. If a block's earliest available time is later than a pool's life range's start time, we cannot allocate it from the pool. We also fix a memory leak that's caused by allocating tensor without wrapping it with RAIIAtenTensor. In python wrapper for JIT inductor, `codegen_alloc_from_pool` doesn't actually write the alloc lines to wrapper, it just returns the string to alloc. However, in cpp_wrapper, `codegen_alloc_from_pool` actually write to the wrapper. Specifically, it writes the following and returns string `RAIIAtenTensorHandle`. ``` AtenTensorHandle handle_name; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__alloc_from_pool(....); ``` This is bug prune. If you write aoti_torch__alloc_from_pool lines, you must write the RAIIAtenTensorHandle as well, otherwise you get memory leaks. We remove the alloc_from_pool call from codegen_create, because this doesn't work for AOTI. In python wrapper, we can generate the same alloc_from_pool variable name for the same block, but cpp_wrapper will generate a different variable name for each call to alloc_from_pool. Test Plan: ``` python test/inductor/test_memory_planning.py ``` Rollback Plan: Differential Revision: D79603119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159839 Approved by: https://github.com/jansel	2025-08-11 17:16:15 +00:00
Boyuan Feng	ca7315c171	[Graph Partition] Pass all OSS unit tests (#154667 ) Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315). Run the same diff on two days and both show speedup on average. [first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d) <img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" /> [second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf) <img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667 Approved by: https://github.com/eellison	2025-08-11 16:25:12 +00:00
Richard Barnes	68a4b4b2e3	[codemod] Fix unreachable-break issue in caffe2/c10/cuda/CUDAFunctions.cpp +2 (#160257 ) Summary: LLVM has a warning `-Wunreachable-code-break` which identifies `break` statements that cannot be reached. These compromise readability, are misleading, and may identify bugs. This diff removes such statements. For questions/comments, contact r-barnes. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Rollback Plan: Differential Revision: D79835614 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160257 Approved by: https://github.com/Skylion007	2025-08-11 16:09:24 +00:00
Xu Han	80cca83079	[inductor] Skip some AOTI UTs on Windows. (#160287 ) Skip some AOTI UTs on Windows, it is not fully ready. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160287 Approved by: https://github.com/ezyang	2025-08-11 13:50:43 +00:00
Xu Han	515cb70367	[inductor] normalize_path_separator for test_different_file_paths_local_pgo (#160286 ) `normalize_path_separator` for test_different_file_paths_local_pgo Pull Request resolved: https://github.com/pytorch/pytorch/pull/160286 Approved by: https://github.com/ezyang	2025-08-11 13:50:18 +00:00
cyy	c184cb3852	[submodule] Bump fbgemm to latest (#158210 ) Merge the recent commits of FBGEMM and remove unnecessary CMake code. Specifically, we 1. enable `fbgemm_autovec` since the target is now correctly handled. 2. remove option `USE_FAKELOWP` which is not used. 3. remove `CAFFE2_COMPILER_SUPPORTS_AVX512_EXTENSIONS` check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158210 Approved by: https://github.com/q10	2025-08-11 13:48:02 +00:00
PyTorch UpdateBot	2259dbed4e	Update slow tests (#158222 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158222 Approved by: https://github.com/pytorchbot	2025-08-11 12:00:13 +00:00
PyTorch UpdateBot	05029ad1c3	[xla hash update] update the pinned xla hash (#160306 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160306 Approved by: https://github.com/pytorchbot	2025-08-11 11:28:49 +00:00
cyy	cf4964be68	Remove unnecessary CMake checks for glog (#158185 ) With the updating to CMake 2.27, some old scripts can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158185 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-08-11 10:14:47 +00:00
Tanmay Sinha	ecea81117b	Fix clang builds by adding headers (#160252 ) Clang compiler from llvm-14 fails to build full torch from source with the message ``` no template named 'unordered_map' in namespace 'std' std::unordered_map<std::string, HandlerFunc> handlers_{}; ~~~~~^ ``` A similar issue here https://github.com/intel/llvm/issues/5264 Fix is to add the correct headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160252 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-08-11 09:03:14 +00:00
fduwjj	1c2cba17ea	[FR] Add stack_id and an optional print of stack_id to stack_trace mapping (#160119 ) To better help users debug with FR, we want to add stack_id and print a map between stack_id and stack_trace (optional) Screenshot: <img width="1029" height="529" alt="image" src="https://github.com/user-attachments/assets/8404a1d3-cc33-4f5f-971b-29609ec316c1" /> <img width="1620" height="358" alt="image" src="https://github.com/user-attachments/assets/3dd29c8c-ff68-41a2-acfd-e770036cfeb1" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160119 Approved by: https://github.com/H-Huang, https://github.com/wconstab	2025-08-11 07:27:10 +00:00
Nick Riasanovsky	ff0d56d035	[Inductor] [Triton] Enable Configuration warmup/rep iterations when benchmarking in inductor (#159982 ) Summary: When benchmarking on B200 Max Autotune, I discovered that the estimations from the autotune logs consistently produced a better ATEN result by > 20% on an example shape. Here is an example of the output: ``` Autotune Choices Stats: {"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.3081120103597641, "best_triton_pos": 1, "best_triton_time": 0.6589759886264801, "best_triton_kernel": "triton_mm_16", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0"} AUTOTUNE mm(3840x1152, 1152x49136) strides: [1, 3840], [49152, 1] dtypes: torch.bfloat16, torch.bfloat16 mm 0.3081 ms 100.0% triton_mm_16 0.6590 ms 46.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_17 0.6830 ms 45.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_13 0.7015 ms 43.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_9 0.8487 ms 36.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_11 0.8695 ms 35.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_10 0.8797 ms 35.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_18 0.9089 ms 33.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_14 0.9718 ms 31.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_15 1.0169 ms 30.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 SingleProcess AUTOTUNE benchmarking takes 2.8574 seconds and 0.1032 seconds precompiling for 20 choices Removed 3483 outliers from 28645 samples 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:20<00:00, 20.00s/it] (M, N, K) pt2_matmul_maxautotune-latency pt2_matmul_maxautotune-speedup pt2_matmul_maxautotune-tflops ------------------- -------------------------------- -------------------------------- ------------------------------- (3840, 49136, 1152) 0.359392 (±8.27%) 1209.61 average 1209.61 ``` Based on my reading about B200 power usage, I believe this is due to the new for power aware benchmarking as a kernel may perform better in short bursts. This adds environment variables to expand autotuning iterations so we can get more consistent results between the estimation and the actual runtime. I did not update the default yet, even for B200 because I'm not sure how this is used in practice. This is the new output: ``` Autotune Choices Stats: {"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.3848319947719574, "best_triton_pos": 1, "best_triton_time": 0.6287680268287659, "best_triton_kernel": "triton_mm_16", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0"} AUTOTUNE mm(3840x1152, 1152x49136) strides: [1, 3840], [49152, 1] dtypes: torch.bfloat16, torch.bfloat16 mm 0.3848 ms 100.0% triton_mm_16 0.6288 ms 61.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_13 0.6299 ms 61.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_17 0.6728 ms 57.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_9 0.7189 ms 53.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_18 0.8566 ms 44.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_11 0.8693 ms 44.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_14 0.9298 ms 41.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_10 0.9524 ms 40.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_15 1.0216 ms 37.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 SingleProcess AUTOTUNE benchmarking takes 3.9245 seconds and 0.0965 seconds precompiling for 20 choices Removed 3537 outliers from 29530 samples 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:23<00:00, 23.70s/it] (M, N, K) pt2_matmul_maxautotune-latency pt2_matmul_maxautotune-speedup pt2_matmul_maxautotune-tflops ------------------- -------------------------------- -------------------------------- ------------------------------- (3840, 49136, 1152) 0.359328 (±9.71%) 1209.82 average 1209.82 ``` Test Plan: `TORCH_AUTOTUNE_REP=1000 CUDA_VISIBLE_DEVICES=2 ENABLE_MMA_V5_ATT_PIPELINE=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 buck2 run mode/opt //pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op gemm --iter $NUM_ITERS --input-loader /home/njriasan/parsed_shapes.json --only pt2_matmul_maxautotune` Rollback Plan: Reviewed By: NikhilAPatel Differential Revision: D79737929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159982 Approved by: https://github.com/NikhilAPatel	2025-08-11 05:27:51 +00:00
Jiaxi WANG	334b38ccc4	Fix typo in README.md (#160160 ) The "Get the PyTorch Source" section is now located before the "Install Dependencies/Common" section, so "... using the “Get the PyTorch Source“ section below" should be "... using the “Get the PyTorch Source“ section above". Pull Request resolved: https://github.com/pytorch/pytorch/pull/160160 Approved by: https://github.com/BoyuanFeng	2025-08-11 05:09:59 +00:00
FFFrog	dc0d18e023	[CUDA] Remove the uncessary CUDA_GUARD (#160249 ) `CUDA_GUARD` is unnecessary in `initDeviceStreamState`, because the `initSingleStream` has already done it. `29712314dd/c10/cuda/CUDAStream.cpp (L202-L203)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160249 Approved by: https://github.com/Skylion007	2025-08-11 05:08:05 +00:00
cyy	8ae4d2652f	Tidy torch/csrc/jit/passes/onnx code (#160262 ) Apply clang-tidy fixes to torch/csrc/jit/passes/onnx Pull Request resolved: https://github.com/pytorch/pytorch/pull/160262 Approved by: https://github.com/justinchuby	2025-08-11 04:50:38 +00:00
Edward Z. Yang	8088cfa592	Add type assert for tensor_meta, based on real bug in autoparallel. (#157927 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157927 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/wconstab	2025-08-11 04:22:02 +00:00
Nikita Shulga	d8cb3db533	Add unsigned support to `IValue` (#160102 ) - Moved repeated logic of saving int64/uint64 into a polymorphic container into `THPUtils_unpackInteger` - Added `TestPythonDispatch.test_dispatch_uint64` regression test Fixes https://github.com/pytorch/pytorch/issues/159168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160102 Approved by: https://github.com/ezyang	2025-08-11 03:57:18 +00:00
Han, Xu	e7152ff8a6	[inductor] fix some windows inductor UTs (#160292 ) This PR is the UT part of https://github.com/pytorch/pytorch/pull/160161. As @malfet 's comments: https://github.com/pytorch/pytorch/pull/160161#pullrequestreview-3103812178 This PR will not land turn on change, and only land UT part. changes: 1. Fixed `test_invalid_artifact_flag_error_msg`. 2. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`. 3. Skiped whole UT `test_cpu_select_algorithm.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160292 Approved by: https://github.com/malfet	2025-08-11 02:55:37 +00:00
Nikita Shulga	842cc77ab9	[MPS] Extend addmm to integral types (#160270 ) By adding `addmm` kernel, which is a logical continuation of `mm` one. The only tricking part are how alpha and beta constants are handled, which are passed as `optmath_t`, i.e. that it could be, int64, int32 or float Unified all MM flavors instantiations thru `INSTANTIATE_MM_OPS` and tested that `addmm` metal kernel works as expected for floating types as well by testing it via ``` PYTORCH_MPS_PREFER_METAL=1 python test/test_mps.py -v -k test_output_match_addmm_mps_ ``` Fixes https://github.com/pytorch/pytorch/issues/154901 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160270 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #160228, #160234	2025-08-11 00:54:17 +00:00
PyTorch MergeBot	b602ea9cab	Revert "[inductor] turn on windows inductor UTs (#160161 )" This reverts commit 4416433c7c625127b7f975c92f8ec98ea4c67fd3. Reverted https://github.com/pytorch/pytorch/pull/160161 on behalf of https://github.com/xuhancn due to auto merged with two related issue ([comment](https://github.com/pytorch/pytorch/pull/160161#issuecomment-3172982125))	2025-08-11 00:04:25 +00:00
Xu Han	4416433c7c	[inductor] turn on windows inductor UTs (#160161 ) With this PR, we can turn on the inductor UTs on Windows CPU. changes: 1. Turn on inductor UTs on Windows CPU. 2. Add a shard to balance added UTs, otherwise it should run timeout. 3. Fixed `test_invalid_artifact_flag_error_msg`. 4. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`. 5. Skiped whole UT `test_cpu_select_algorithm.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160161 Approved by: https://github.com/jansel	2025-08-10 23:18:35 +00:00
Andy (An) Wang	05c19d1ace	[Inductor] Add back the revert part (#160054 ) Add back the reverted code(https://github.com/pytorch/pytorch/pull/159809) as we've figured out the actual root cause of the internal test failures. Mote details in the internal diff. Rollback Plan: Differential Revision: D79776691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160054 Approved by: https://github.com/blaine-rister	2025-08-10 19:20:30 +00:00
Xu Han	d6786741a7	[inductor] slow test some Windows UTs. (#160267 ) When we enabled Windows inductor UTs since the PR: https://github.com/pytorch/pytorch/pull/160161/ The main branch CI occurred timeout issue, Let's move some UT to slow test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160267 Approved by: https://github.com/ezyang	2025-08-10 18:35:42 +00:00
PyTorch MergeBot	7ae0629d64	Revert "[inductor] turn on windows inductor UTs (#160161 )" This reverts commit f0980fc0bbd656d6c02d23ad97e945353b314f35. Reverted https://github.com/pytorch/pytorch/pull/160161 on behalf of https://github.com/clee2000 due to broke some inductor tests on windows inductor\test_codecache.py::TestStandaloneCompile::test_different_process [GH job link](https://github.com/pytorch/pytorch/actions/runs/16853706010/job/47748778757) [HUD commit link](`f0980fc0bb`). note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/160161#issuecomment-3172784292))	2025-08-10 17:33:19 +00:00
Xu Han	0e3e377bd5	[inductor] fix CompiledArtifact.load path on Windows. (#160268 ) fix CompiledArtifact.load path on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160268 Approved by: https://github.com/ezyang	2025-08-10 14:22:52 +00:00
Isalia20	a84b60c0c4	[MPS] Sparse coalesce more dtypes to match cpu (#160254 ) More dtypes to match the cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/160254 Approved by: https://github.com/malfet	2025-08-10 12:25:18 +00:00
atalman	3ac86e728d	Add Alban and Piotr to list of maintainers (#160187 ) Add Alban and Piotr to list of maintainers Pull Request resolved: https://github.com/pytorch/pytorch/pull/160187 Approved by: https://github.com/albanD	2025-08-10 12:00:16 +00:00
Edward Yang	c9671dc865	Delete Python reference implementation from torchdim, as it is untested (#160115 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160115 Approved by: https://github.com/albanD	2025-08-10 11:21:33 +00:00
ghostspiders	af10f1f86c	Fix requires_cuda to requires_cuda_and_triton (#160222 ) Fixes ##159399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160222 Approved by: https://github.com/janeyx99	2025-08-10 07:05:52 +00:00
Edward Yang	5dddcd5b07	Correctly copy self.module_stack in ModuleStackTracer (#159956 ) There is a bigger cluster of issues which this does not completely fix, but I think this is a matter of good hygiene, especially because we immediately mutate the dict after assigning it. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159956 Approved by: https://github.com/pianpwk	2025-08-10 03:33:59 +00:00
PyTorch MergeBot	d3d359dbaf	Revert "Fix get_free_symbol_uses for several nodes. (#160134 )" This reverts commit db78943a1ca13a32a3d6045eb15e2b719ee13a2f. Reverted https://github.com/pytorch/pytorch/pull/160134 on behalf of https://github.com/malfet due to No, those are not pre-existing, see `df55ec7d4b/1` ([comment](https://github.com/pytorch/pytorch/pull/160134#issuecomment-3172314322))	2025-08-10 02:37:40 +00:00
Nikita Shulga	df55ec7d4b	[OpInfo][BE] Better inputs for addmm (#160234 ) Right now alpha and betha are both less than zero, which makes them useless for all addmm samples for interal types Pull Request resolved: https://github.com/pytorch/pytorch/pull/160234 Approved by: https://github.com/Skylion007 ghstack dependencies: #160228	2025-08-10 01:26:48 +00:00
Xu Han	f0980fc0bb	[inductor] turn on windows inductor UTs (#160161 ) With this PR, we can turn on the inductor UTs on Windows CPU. changes: 1. Turn on inductor UTs on Windows CPU. 2. Add a shard to balance added UTs, otherwise it should run timeout. 3. Fixed `test_invalid_artifact_flag_error_msg`. 4. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`. 5. Skiped whole UT `test_cpu_select_algorithm.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160161 Approved by: https://github.com/jansel	2025-08-09 21:06:00 +00:00
Laith Sakka	db78943a1c	Fix get_free_symbol_uses for several nodes. (#160134 ) get_free_symbol_uses is used to know what unbacked symbols are used by a given node. not having correct get_free_symbol_uses defined properly leads to : 1. eliminating of some nodes due to not detection of any users. (See the added unit test) 2. Incorrect topological sort. Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel. for ComputedBuffer with NonOwningLayout its interesting case. when layout is NonOwningLayout we need to access the actual view op base layout and use detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160134 Approved by: https://github.com/bobrenjc93	2025-08-09 18:15:46 +00:00
thenumberouscode	29712314dd	[fx][pass] Support converting a float32 tensor to a scalar in FX trace. (#158216 ) Fixes https://github.com/pytorch/pytorch/issues/158083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158216 Approved by: https://github.com/laithsakka	2025-08-09 15:13:13 +00:00
cyy	01f66d08d9	Remove outdated CMAKE_CUDA_COMPILER_VERSION branch (#160075 ) Remove the condition `if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.0)` in cmake/Codegen.cmake, because we are now default to CUDA >=12.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160075 Approved by: https://github.com/Skylion007	2025-08-09 14:23:17 +00:00
PyTorch MergeBot	2f4c222617	Revert "Make user defined Triton kernels serializable for fx_graph_runnable (#160002 )" This reverts commit 4183d4ff3dcc1d87400326a9a7998c3f9e966f60. Reverted https://github.com/pytorch/pytorch/pull/160002 on behalf of https://github.com/albanD due to Breaks inductor tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/160002#issuecomment-3170855866))	2025-08-09 14:01:58 +00:00
xinan.lin	8047421fbb	[Linter] Expanding the scope of detecting device-bias code. (#159949 ) Currently, the device-bias linter only targets functions decorated with @requires_gpu. This PR adds support for two new detection scenarios: 1. Detect device-bias code in functions decorated with @requires_triton. 2. Detect device-bias code for entire test suites that are defined as shared across GPUs. For example: ``` if __name__ == "__main__": if HAS_GPU: run_tests() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159949 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-08-09 09:41:16 +00:00
PaulZhang12	4183d4ff3d	Make user defined Triton kernels serializable for fx_graph_runnable (#160002 ) Resolves issue https://github.com/pytorch/pytorch/issues/153475 where `fx_graph_runnable` didn't work with user defined triton kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160002 Approved by: https://github.com/eellison	2025-08-09 09:26:05 +00:00
Sherlock Huang	fb887c3bb5	Add Sherlock and Zhengxu as codeowner for schema.py (#160233 ) Test Plan: CI Rollback Plan: Differential Revision: D79933462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160233 Approved by: https://github.com/zhxchen17	2025-08-09 04:44:12 +00:00
PyTorch UpdateBot	bcf23ecc47	[vllm hash update] update the pinned vllm hash (#160235 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160235 Approved by: https://github.com/pytorchbot	2025-08-09 04:17:32 +00:00
Animesh Jain	303c614f3d	[dynamo] Be consistent with UserMethodVariable source (#160155 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160155 Approved by: https://github.com/StrongerXi	2025-08-09 04:16:14 +00:00
PyTorch UpdateBot	0d88593dd8	[audio hash update] update the pinned audio hash (#160153 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160153 Approved by: https://github.com/pytorchbot	2025-08-09 04:01:31 +00:00
Rob Timpe	5ed4f91779	[dynamo] support itertools.permutations (#159694 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159694 Approved by: https://github.com/guilhermeleobas ghstack dependencies: #159693	2025-08-09 03:01:58 +00:00
Rob Timpe	e07c52b2c0	[dynamo] Improve support for itertools.product (#159693 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159693 Approved by: https://github.com/guilhermeleobas, https://github.com/mlazos	2025-08-09 03:01:58 +00:00
cyy	10e3514c96	Remove tensorexpr tests (#158928 ) The tests are not maintained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928 Approved by: https://github.com/albanD, https://github.com/malfet	2025-08-09 02:21:22 +00:00
Shangdi Yu	11a3565f18	[Torch Native] Add test for packaging weight (#158750 ) Add test that require weights to be packaged for torch native For now, we need `package_weights_in_so=True` for compile standalone. The constants are in a `.o` file and will be added as a source to the CMakeLists.txt of the model. After we added weight deduping, we should be able to let this config be False. ``` python test/inductor/test_aot_inductor_package.py -k test_compile_with_exporter_weights ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158750 Approved by: https://github.com/desertfire	2025-08-09 01:04:21 +00:00
Ankita George	e96c7c4bb0	[dcp][hf] Improve HF consolidation algorithm (#158648 ) Before we had a bunch of if-else cases based on sharding strategy to decide how to save the tensor with different logic for different strategies. This can be consolidated into one function that uses an algorithm to handle all cases by finding the max possible contiguous bytes that can be written Differential Revision: [D78489438](https://our.internmc.facebook.com/intern/diff/D78489438/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158648 Approved by: https://github.com/saumishr	2025-08-09 00:11:22 +00:00
Jane Xu	9b803cdbe2	[BE] Remove more optim entries from docs coverage ignore list (#160194 ) This PR does privatize ReduceLRSchedulerOnPlateau.is_better -> ReduceLRSchedulerOnPlateau._is_better because that API was never meant to be public. A GitHub search for it also reveals that the API is not commonly used much. https://github.com/search?q=.is_better%28&type=code&p=2 If you do use this API and you rely on it for some reason, please file an issue. In the meantime, you can access it through `_is_better(...)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160194 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-08-09 00:09:45 +00:00
Nikita Shulga	8c41cb800a	[MPS][BE] Combine all pre-MacOS14 xfail lists (#160228 ) It does not matter whether it started to fail after 13.1 or 13.3, fact that it still fails on latest MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/160228 Approved by: https://github.com/dcci	2025-08-09 00:00:46 +00:00
Yanan Cao (PyTorch)	731ee31f7b	[TorchScript, PT2] Add torch._check compatibility support (#159988 ) Summary: Add support for torch._check() in TorchScript jit.script frontend. * It will be special cased to behave like torch._assert, turned into an if + raise exception. Test Plan: Unit tests Rollback Plan: Differential Revision: D79744604 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159988 Approved by: https://github.com/davidberard98	2025-08-08 23:14:13 +00:00
Ti-Tai Wang	566c6d52ef	[ONNX] Fix the export of the model having none as output (#160200 ) Fixes #160150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160200 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-08-08 23:09:34 +00:00
Aidyn-A	4e2ddb5db6	[Inductor][CUTLASS] Copy cutlass_mock_imports directory (#159724 ) Pip wheels of PyTorch nightly and 2.8 release candidates do not contain `cutlass_mock_imports`. This is the path to the source code: ``` root@8120d02fd9c5:$ tree ./torch/_inductor/codegen/cuda/cutlass_lib_extensions/ ./torch/_inductor/codegen/cuda/cutlass_lib_extensions/ ├── cutlass_mock_imports │ ├── cuda │ │ ├── __init__.py │ │ ├── cuda.py │ │ └── cudart.py │ ├── pydot │ │ └── __init__.py │ └── scipy │ ├── __init__.py │ └── special.py ├── evt_extensions.py └── gemm_operation_extensions.py 5 directories, 8 files ``` And this what installed wheel has: ``` root@8120d02fd9c5:$ tree /usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/cuda/cutlass_lib_extensions/ /usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/cuda/cutlass_lib_extensions/ ├── __init__.py ├── evt_extensions.py └── gemm_operation_extensions.py 1 directory, 3 files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159724 Approved by: https://github.com/henrylhtsang	2025-08-08 22:56:05 +00:00
Kanya-Mo	9e07673deb	Fix test_fsdp_ep.py due to _MeshEnv API change (#158695 ) #132339 changed parent/child mesh related APIs from _MeshEnv. UT TestFSDPWithEP.test_e2e still uses old APIs and will fail: ``` File "/home/kanya/pytorch/test/distributed/checkpoint/e2e/test_fsdp_ep.py", line 77, in test_e2e mesh_fsdp_ep = _mesh_resources.create_child_mesh(mesh_fsdp_tp, ("dp",)) AttributeError: '_MeshEnv' object has no attribute 'create_child_mesh' To execute this test, run the following from the base repo dir: python test/distributed/checkpoint/e2e/test_fsdp_ep.py TestFSDPWithEP.test_e2e This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0. Did you mean: 'create_sub_mesh'? ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158695 Approved by: https://github.com/Skylion007, https://github.com/nWEIdia	2025-08-08 22:36:47 +00:00
Eddie Yan	1128f4c2a8	[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 ) cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282 Approved by: https://github.com/drisspg Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-08-08 22:22:48 +00:00
Robert Hardwick	334ecbd4ff	Add torchao to install_inductor_benchmark_deps cleanup stage (#160191 ) It looks like `torcho` was missed from the cleanup during torchbench setup. Fixes #160188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160191 Approved by: https://github.com/huydhn	2025-08-08 22:18:41 +00:00
PyTorch MergeBot	206c1eef65	Revert "[pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#159655 )" This reverts commit 2ee22e435131369a7e4f8cc4732579acc29a941b. Reverted https://github.com/pytorch/pytorch/pull/159655 on behalf of https://github.com/clee2000 due to broke dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed [GH job link](https://github.com/pytorch/pytorch/actions/runs/16839294394/job/47711078667) [HUD commit link](`2ee22e4351`). Probably a landrace since it did run on the PR ([comment](https://github.com/pytorch/pytorch/pull/159655#issuecomment-3169400889))	2025-08-08 22:04:22 +00:00
Nikita Shulga	28ccc9e724	[MPS] Extend `index_put` to complex types (#160159 ) And delete confusing supported types check. Move all pseudo atomic (but eventually consistent) ops to `c10/metal/atomic.h` header Fixes https://github.com/pytorch/pytorch/issues/160034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160159 Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/Skylion007	2025-08-08 21:54:30 +00:00
Syed Tousif Ahmed	2247aa6d1d	Documents tuning NVLink performance on H100/H200 (#159792 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159792 Approved by: https://github.com/ngimel	2025-08-08 20:28:24 +00:00
Sheng Fu	1febab2a89	Do not treat ReinterpretView as a realized node (#159920 ) Summary: Do not treat ReinterpretView as a realized node Function [gather_origins](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L888](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fmain%2Ftorch%2F_inductor%2Futils.py%23L888&h=AT2PYr83thTo6VUjPs26Y8QAN6Sid16rvDMHtxO-Bp9FDwHr4J5PObtH3IhNTL-LPSRVC9WVJAcmwUToVWJIrDWb84i0j61QE55ySYAkGbuigqcNc7xczlirHhbiC9vMqiz91VwWdl4Pe2yKN7VIjjCiFUqw) calls is_realized_node to decide if a FX node should be included in the origins of a IR node. ReinterpretView is considered a realized node, so it is not included in the origins. It leads to an incomplete graph. For example: ``` @torchdynamo.optimize("inductor") def fn(input_data, weight): normalized_input = input_data * weight.unsqueeze(0) return normalized_input input_data = torch.randn(4272, 192, requires_grad=True).to(device) weight = torch.randn(192, requires_grad=True).to(device) fn(input_data, weight) ``` The original FX graph returned in [get_kernel_metadata](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L723](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fmain%2Ftorch%2F_inductor%2Futils.py%23L723&h=AT2PYr83thTo6VUjPs26Y8QAN6Sid16rvDMHtxO-Bp9FDwHr4J5PObtH3IhNTL-LPSRVC9WVJAcmwUToVWJIrDWb84i0j61QE55ySYAkGbuigqcNc7xczlirHhbiC9vMqiz91VwWdl4Pe2yKN7VIjjCiFUqw) is the following: %primals_2 : Tensor "f32[4272, 192][192, 1]cuda:0" = PlaceHolder[target=primals_2] %primals_1 : Tensor "f32[192][1]cuda:0" = PlaceHolder[target=primals_1] %mul : Tensor "f32[4272, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %unsqueeze), kwargs = {}) return %mul The unsqueeze op is missing. With this DIFF, the new FX graph is the following: %primals_2 : Tensor "f32[4272, 192][192, 1]cuda:0" = PlaceHolder[target=primals_2] %primals_1 : Tensor "f32[192][1]cuda:0" = PlaceHolder[target=primals_1] %unsqueeze : Tensor "f32[1, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.unsqueeze.default](args = (%primals_1, 0), kwargs = {}) %mul : Tensor "f32[4272, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %unsqueeze), kwargs = {}) return %mul Pull Request resolved: https://github.com/pytorch/pytorch/pull/159920 Approved by: https://github.com/mlazos	2025-08-08 20:13:35 +00:00
Jovian Anthony Jaison	2ee22e4351	[pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#159655 ) This change logs the stack trace of the code being compiled by Dynamo, improving visibility into what is compiled. It adds a stack_trace field to compilation metrics. This helps with debugging and analysis of Dynamo compilation behavior. Ref [D79287964](https://www.internalfb.com/diff/D79287964) Test Plan: $ python -m test_utils Internal: ref [D79372519](https://www.internalfb.com/diff/D79372519) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159655 Approved by: https://github.com/c00w	2025-08-08 19:53:47 +00:00
James Dong	c86040a8e6	[torch.export] Fix test_export_api_with_dynamic_shapes (#160164 ) Summary: Update test KJT's dynamic_shapes to match the newly exported fields. Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test:test_export -- --exact 'caffe2/test:test_export - test_export_api_with_dynamic_shapes_cpp_runtime_nonstrict (caffe2.test.export.test_nativert.NativeRTTestExport)' File changed: fbcode//caffe2/test/export/test_export.py Buck UI: https://www.internalfb.com/buck2/8247eaf8-eaf9-4876-95cb-7b4263d15ef2 Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275093345198 Network: Up: 100KiB Down: 0B (reSessionID-72a2579f-df3f-4262-9aa3-de0db9687 Executing actions. Remaining 0/2 Command: test. Time elapsed: 2:20.5s Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Reviewed By: malaybag Differential Revision: D79862872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160164 Approved by: https://github.com/angelayi, https://github.com/ezyang	2025-08-08 19:45:30 +00:00
Anshul Sinha	72009ec6be	[replicate][be] improved readability and cleaned up remaining DDP code (#160133 ) Summary As much of ReplicateState functionality is copied from FSDPState, I fixed any remaining comments that incorrectly used FSDP instead of Replicate. In addition, instead of labeling modules FSDPModule or FSDPLinear, I have changed it so that is now uses Replicate____. Finally, I have removed some leftover code from the DDP implementation. I have included test cases to verify correctness. Test Case 1. pytest test/distributed/_composable/test_replicate_with_fsdp.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/160133 Approved by: https://github.com/mori360 ghstack dependencies: #160128	2025-08-08 19:42:23 +00:00
Andres Lugo	5f5f508aa8	[ROCm] Ck backend UX refactor (#152951 ) Refactors how the enablement/disablement of CK Gemms and SDPA works. - Adds USE_ROCM_CK_GEMM compile flag for enabling CK gemms. - USE_ROCM_CK_GEMM is set to True by default on Linux - Updates USE_CK_FLASH_ATTENTION to USE_ROCM_CK_SDPA. - USE_ROCM_CK_SDPA is set to False by default - (USE_CK_FLASH_ATTENTION still works for now, but will be deprecated in a future release) - Prevents these CK libraries from being used unless pytorch has been built specifically with the functionality AND is running on a system architecture that supports it. - the getters for these library backends will also do some validity checking in case the user used an environment variable to change the backend. If invalid, (i.e. one of the cases mentioned above is false) the backend will be set as the current non-CK default Pull Request resolved: https://github.com/pytorch/pytorch/pull/152951 Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/m-gallus Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-08-08 18:40:17 +00:00
Yu, Guangye	da1f608ca3	Add UT for torch.accelerator memory-related API (#155200 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155200 Approved by: https://github.com/albanD ghstack dependencies: #138222, #152932	2025-08-08 17:41:22 +00:00
Yu, Guangye	84f7e88aef	Add unified memory APIs for torch.accelerator (#152932 ) # Motivation The following API will be put under torch.accelerator - empty_cache - max_memory_allocated - max_memory_reserved - memory_allocated - memory_reserved - memory_stats - reset_accumulated_memory_stats - reset_peak_memory_stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932 Approved by: https://github.com/albanD ghstack dependencies: #138222	2025-08-08 17:41:22 +00:00
Yu, Guangye	d7114f05b1	Add DeviceAllocator as the base device allocator (#138222 ) # Motivation In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases. <div align="center"> <table> <tr> <td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td> </tr> <tr> <td> ```python torch.xxx.empty_cache ``` </td> <td> ```python torch.accelerator.empty_cache ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_peak_memory_stats ``` </td> <td> ```python torch.accelerator.reset_peak_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_accumulated_memory_stats ``` </td> <td> ```python torch.accelerator.reset_accumulated_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_stats ``` </td> <td> ```python torch.accelerator.memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_allocated ``` </td> <td> ```python torch.accelerator.memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_allocated ``` </td> <td> ```python torch.accelerator.max_memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_reserved ``` </td> <td> ```python torch.accelerator.memory_reserved ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_reserved ``` </td> <td> ```python torch.accelerator.max_memory_reserved ``` </td> </tr> </table> </div> # Solution This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222 Approved by: https://github.com/albanD, https://github.com/Camyll	2025-08-08 17:41:10 +00:00
albanD	c5ec5458a5	Don't build nccl when distributed is disabled (#160086 ) Because distributed doesn't build on recent compilers, I have to disable distributed, but this makes it still fail as nccl is still built Pull Request resolved: https://github.com/pytorch/pytorch/pull/160086 Approved by: https://github.com/Skylion007, https://github.com/janeyx99	2025-08-08 17:19:16 +00:00
Kurt Mohler	86eb65f7f0	[MPS] Move max_pool2d to Metal for `stride != 1` (#157876 ) This PR updates `max_pool2d` to use a Metal kernel instead of the old MPS graph impl. However, when the `stride` argument is 1 in all dimensions, the old implementation gives significantly better performance, so we fall back to it in that case. Below is a performance comparison of `max_pool2d` before and after this PR, obtained from this script: `2f02f2bf7a/max_pool_mps/perf.py` <details><summary>Click to expand</summary> case \| before PR \| after PR \| speedup \| \| case info -- \| -- \| -- \| -- \| -- \| -- 0 \| 0.014264 \| 0.004473 \| 3.188911245 \| \| (3, 2, 2), {'kernel_size': 2, 'return_indices': True} 1 \| 0.010752 \| 0.00421 \| 2.55391924 \| \| (3, 2, 2), {'kernel_size': 2, 'return_indices': False} 2 \| 0.020777 \| 0.006123 \| 3.393271272 \| \| (3, 10, 10), {'kernel_size': 5, 'return_indices': True} 3 \| 0.011065 \| 0.005759 \| 1.921340511 \| \| (3, 10, 10), {'kernel_size': 5, 'return_indices': False} 4 \| 0.01452 \| 0.007829 \| 1.854642994 \| \| (3, 100, 100), {'kernel_size': 5, 'return_indices': True} 5 \| 0.009258 \| 0.007075 \| 1.308551237 \| \| (3, 100, 100), {'kernel_size': 5, 'return_indices': False} 6 \| 0.188137 \| 0.168688 \| 1.115295694 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': True} 7 \| 0.161362 \| 0.154746 \| 1.042753932 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': False} 8 \| 0.182883 \| 0.16945 \| 1.079274122 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': True} 9 \| 0.156875 \| 0.163346 \| 0.9603847049 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': False} 10 \| 0.193433 \| 0.167396 \| 1.155541351 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': True} 11 \| 0.158967 \| 0.151246 \| 1.051049284 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': False} 12 \| 0.931071 \| 0.932883 \| 0.9980576342 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': True} 13 \| 0.324496 \| 0.3252 \| 0.9978351784 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': False} 14 \| 0.944071 \| 0.936246 \| 1.008357846 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': True} 15 \| 0.322171 \| 0.314854 \| 1.023239343 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': False} 16 \| 0.894158 \| 0.886408 \| 1.008743152 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': True} 17 \| 0.309338 \| 0.304146 \| 1.017070749 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': False} 18 \| 0.606 \| 0.260546 \| 2.325884873 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': True} 19 \| 0.30445 \| 0.231054 \| 1.317657344 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': False} 20 \| 0.474708 \| 0.261925 \| 1.812381407 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': True} 21 \| 0.23175 \| 0.231883 \| 0.9994264349 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': False} 22 \| 0.434475 \| 0.266246 \| 1.631855502 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': True} 23 \| 0.236942 \| 0.231792 \| 1.022218196 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': False} 24 \| 0.202396 \| 0.174888 \| 1.157289237 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': True} 25 \| 0.160679 \| 0.158246 \| 1.015374796 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': False} 26 \| 0.200354 \| 0.184133 \| 1.088093932 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': True} 27 \| 0.160779 \| 0.160679 \| 1.000622359 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': False} 28 \| 0.199175 \| 0.178625 \| 1.115045486 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': True} 29 \| 0.159458 \| 0.160883 \| 0.9911426316 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': False} 30 \| 0.199021 \| 0.165329 \| 1.203787599 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': True} 31 \| 0.156337 \| 0.158213 \| 0.9881425673 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': False} 32 \| 0.180146 \| 0.174483 \| 1.032455884 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': True} 33 \| 0.156988 \| 0.158167 \| 0.9925458534 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': False} 34 \| 0.182133 \| 0.176521 \| 1.031792251 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': True} 35 \| 0.169042 \| 0.156483 \| 1.080257919 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': False} 36 \| 1.767821 \| 1.766254 \| 1.000887188 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': True} 37 \| 1.059346 \| 1.058775 \| 1.000539302 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': False} 38 \| 1.85755 \| 1.859429 \| 0.9989894747 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': True} 39 \| 1.100417 \| 1.097683 \| 1.002490701 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': False} 40 \| 1.843167 \| 1.847558 \| 0.9976233493 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': True} 41 \| 1.090142 \| 1.093163 \| 0.9972364597 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': False} 42 \| 0.480867 \| 0.251733 \| 1.910226311 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': True} 43 \| 0.319246 \| 0.236479 \| 1.349997251 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': False} 44 \| 0.49315 \| 0.256408 \| 1.923301925 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': True} 45 \| 0.316746 \| 0.227854 \| 1.390127011 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': False} 46 \| 0.4912 \| 0.257762 \| 1.905633879 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': True} 47 \| 0.324771 \| 0.229371 \| 1.41592006 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': False} 48 \| 0.152904 \| 0.095079 \| 1.608178462 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': True} 49 \| 0.102963 \| 0.089217 \| 1.154073775 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': False} 50 \| 0.155158 \| 0.095429 \| 1.625899884 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': True} 51 \| 0.104338 \| 0.089979 \| 1.15958168 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': False} 52 \| 0.153121 \| 0.096429 \| 1.587914424 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': True} 53 \| 0.103642 \| 0.090254 \| 1.148336916 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': False} 54 \| 0.191071 \| 0.165125 \| 1.157129447 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': True} 55 \| 0.153971 \| 0.149021 \| 1.033216795 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': False} 56 \| 0.193192 \| 0.166892 \| 1.157586942 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': True} 57 \| 0.156617 \| 0.15215 \| 1.029359185 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': False} 58 \| 0.178033 \| 0.167308 \| 1.06410333 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': True} 59 \| 0.157425 \| 0.164404 \| 0.9575496947 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': False} 60 \| 1.757638 \| 1.750896 \| 1.0038506 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': True} 61 \| 1.048471 \| 1.047967 \| 1.000480931 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': False} 62 \| 1.790708 \| 1.789767 \| 1.000525767 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': True} 63 \| 1.054575 \| 1.054796 \| 0.9997904808 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': False} 64 \| 1.785837 \| 1.784192 \| 1.000921986 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': True} 65 \| 1.054713 \| 1.054492 \| 1.00020958 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': False} 66 \| 0.478267 \| 0.261017 \| 1.832321266 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': True} 67 \| 0.32005 \| 0.226654 \| 1.412064204 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': False} 68 \| 0.484008 \| 0.254721 \| 1.900149575 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': True} 69 \| 0.321 \| 0.218842 \| 1.466811672 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': False} 70 \| 0.482087 \| 0.248771 \| 1.937874591 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': True} 71 \| 0.316558 \| 0.230533 \| 1.373156988 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': False} 72 \| 0.137842 \| 0.085088 \| 1.619993419 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': True} 73 \| 0.100671 \| 0.0769 \| 1.309115735 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': False} 74 \| 0.148321 \| 0.086967 \| 1.705485989 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': True} 75 \| 0.101392 \| 0.075454 \| 1.343759112 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': False} 76 \| 0.150208 \| 0.083742 \| 1.793699697 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': True} 77 \| 0.099587 \| 0.075825 \| 1.313379492 \| \| (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': False} 78 \| 0.622546 \| 0.602729 \| 1.03287879 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': True} 79 \| 0.531696 \| 0.5067 \| 1.049330965 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': False} 80 \| 0.626646 \| 0.617038 \| 1.015571164 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': True} 81 \| 0.530354 \| 0.525367 \| 1.009492412 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': False} 82 \| 0.633933 \| 0.577775 \| 1.097197006 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': True} 83 \| 0.533067 \| 0.526954 \| 1.011600633 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': False} 84 \| 3.372867 \| 3.386412 \| 0.9960001914 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': True} 85 \| 1.155975 \| 1.156604 \| 0.9994561665 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': False} 86 \| 3.401921 \| 3.39755 \| 1.001286515 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': True} 87 \| 1.202829 \| 1.192538 \| 1.008629494 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': False} 88 \| 3.23675 \| 3.220238 \| 1.005127571 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': True} 89 \| 1.077067 \| 1.085613 \| 0.9921279498 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': False} 90 \| 1.572925 \| 0.925625 \| 1.699311276 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': True} 91 \| 0.791204 \| 0.793454 \| 0.9971642969 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': False} 92 \| 1.572742 \| 0.922729 \| 1.704446268 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': True} 93 \| 0.784292 \| 0.788871 \| 0.9941955022 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': False} 94 \| 1.526546 \| 0.925708 \| 1.649057802 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': True} 95 \| 0.769321 \| 0.787675 \| 0.9766985114 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': False} 96 \| 0.736033 \| 0.612808 \| 1.201082558 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': True} 97 \| 0.574625 \| 0.530925 \| 1.082309177 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': False} 98 \| 0.722021 \| 0.614488 \| 1.174996094 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': True} 99 \| 0.563171 \| 0.533721 \| 1.055178642 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': False} 100 \| 0.735725 \| 0.613992 \| 1.198264798 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': True} 101 \| 0.583487 \| 0.532513 \| 1.095723485 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': False} 102 \| 0.656383 \| 0.575313 \| 1.140914598 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': True} 103 \| 0.559796 \| 0.509079 \| 1.099625009 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': False} 104 \| 0.662046 \| 0.572362 \| 1.156691045 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': True} 105 \| 0.552633 \| 0.508671 \| 1.086425214 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': False} 106 \| 0.634108 \| 0.574629 \| 1.103508525 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': True} 107 \| 0.534013 \| 0.510996 \| 1.045043405 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': False} 108 \| 7.056642 \| 7.066717 \| 0.9985743026 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': True} 109 \| 4.144275 \| 4.142658 \| 1.000390329 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': False} 110 \| 7.172683 \| 7.189867 \| 0.9976099697 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': True} 111 \| 4.162538 \| 4.158875 \| 1.000880767 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': False} 112 \| 7.194233 \| 7.181837 \| 1.001726021 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': True} 113 \| 4.294083 \| 4.196062 \| 1.023360236 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': False} 114 \| 1.875692 \| 0.891071 \| 2.104986022 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': True} 115 \| 1.097479 \| 0.781175 \| 1.404907991 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': False} 116 \| 1.8883 \| 0.89015 \| 2.121327866 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': True} 117 \| 1.101329 \| 0.778542 \| 1.414604479 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': False} 118 \| 1.872833 \| 0.893654 \| 2.095702587 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': True} 119 \| 1.096712 \| 0.784579 \| 1.397835017 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': False} 120 \| 0.513029 \| 0.374417 \| 1.370207549 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': True} 121 \| 0.349546 \| 0.305763 \| 1.143192603 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': False} 122 \| 0.518929 \| 0.377487 \| 1.374693698 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': True} 123 \| 0.364662 \| 0.3145 \| 1.159497615 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': False} 124 \| 0.521275 \| 0.375242 \| 1.389170189 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': True} 125 \| 0.367488 \| 0.308354 \| 1.191773092 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': False} 126 \| 0.652342 \| 0.569308 \| 1.145850752 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': True} 127 \| 0.555696 \| 0.506892 \| 1.096280865 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': False} 128 \| 0.654333 \| 0.570367 \| 1.147213987 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': True} 129 \| 0.548925 \| 0.505825 \| 1.085207335 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': False} 130 \| 0.655908 \| 0.571904 \| 1.146884792 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': True} 131 \| 0.560808 \| 0.508238 \| 1.103435792 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': False} 132 \| 6.949462 \| 6.949112 \| 1.000050366 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': True} 133 \| 4.072913 \| 4.065013 \| 1.001943413 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': False} 134 \| 7.200896 \| 7.197792 \| 1.000431243 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': True} 135 \| 4.291367 \| 4.218538 \| 1.017264038 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': False} 136 \| 7.1823 \| 7.306933 \| 0.9829431856 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': True} 137 \| 4.151175 \| 4.149592 \| 1.000381483 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': False} 138 \| 1.781279 \| 0.884288 \| 2.014365229 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': True} 139 \| 1.050804 \| 0.774362 \| 1.356993241 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': False} 140 \| 1.860758 \| 0.884637 \| 2.103414169 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': True} 141 \| 1.099908 \| 0.775887 \| 1.417613647 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': False} 142 \| 1.857387 \| 0.885738 \| 2.096993693 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': True} 143 \| 1.105279 \| 0.77365 \| 1.428655077 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': False} 144 \| 0.489408 \| 0.269583 \| 1.815426047 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': True} 145 \| 0.322525 \| 0.236979 \| 1.360985573 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': False} 146 \| 0.515475 \| 0.265813 \| 1.93923924 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': True} 147 \| 0.315525 \| 0.228146 \| 1.382995976 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': False} 148 \| 0.503438 \| 0.277204 \| 1.816128194 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': True} 149 \| 0.335421 \| 0.228275 \| 1.469372467 \| \| (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': False} 150 \| 5.72495 \| 4.909554 \| 1.166083518 \| \| (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': None, 'return_indices': True} 151 \| 4.45215 \| 4.251333 \| 1.047236243 \| \| (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': None, 'return_indices': False} 152 \| 29.953021 \| 29.879879 \| 1.002447868 \| \| (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': True} 153 \| 9.854683 \| 9.839517 \| 1.001541336 \| \| (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': False} 154 \| 6.178033 \| 5.697375 \| 1.084364817 \| \| (10, 10, 1000, 1000), {'kernel_size': 100, 'padding': 50, 'return_indices': True} 155 \| 6.280317 \| 5.712525 \| 1.099394226 \| \| (10, 10, 1000, 1000), {'kernel_size': 100, 'padding': 50, 'return_indices': False} 156 \| 10.256062 \| 11.336527 \| 0.9046917103 \| \| (10, 10, 1000, 1000), {'kernel_size': 250, 'padding': 50, 'return_indices': True} 157 \| 9.469546 \| 11.33705 \| 0.8352742556 \| \| (10, 10, 1000, 1000), {'kernel_size': 250, 'padding': 50, 'return_indices': False} 158 \| 0.119087 \| 0.0797 \| 1.494190715 \| \| (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': True} 159 \| 0.098713 \| 0.047173 \| 2.092574142 \| \| (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': False} 160 \| 0.960812 \| 0.675762 \| 1.421820108 \| \| (10, 10, 300, 300), {'kernel_size': 2, 'return_indices': True} 161 \| 0.536546 \| 0.485958 \| 1.104099531 \| \| (10, 10, 300, 300), {'kernel_size': 2, 'return_indices': False} 162 \| 2.555225 \| 1.791567 \| 1.426251432 \| \| (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': True} 163 \| 1.419087 \| 1.305137 \| 1.087308842 \| \| (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': False} 164 \| 5.182008 \| 3.48085 \| 1.488719135 \| \| (10, 10, 700, 700), {'kernel_size': 2, 'return_indices': True} 165 \| 2.831779 \| 2.498537 \| 1.133374851 \| \| (10, 10, 700, 700), {'kernel_size': 2, 'return_indices': False} 166 \| 8.546038 \| 5.7783 \| 1.478988284 \| \| (10, 10, 900, 900), {'kernel_size': 2, 'return_indices': True} 167 \| 4.731004 \| 4.161975 \| 1.136720908 \| \| (10, 10, 900, 900), {'kernel_size': 2, 'return_indices': False} 168 \| 0.084754 \| 0.07435 \| 1.139932751 \| \| (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': True} 169 \| 0.057933 \| 0.043096 \| 1.344277891 \| \| (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': False} 170 \| 2.568592 \| 1.802117 \| 1.425319222 \| \| (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': True} 171 \| 1.433054 \| 1.307342 \| 1.096158465 \| \| (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': False} 172 \| 10.3213 \| 7.111604 \| 1.451332217 \| \| (10, 10, 1000, 1000), {'kernel_size': 2, 'return_indices': True} 173 \| 5.680525 \| 5.168129 \| 1.099145358 \| \| (10, 10, 1000, 1000), {'kernel_size': 2, 'return_indices': False} 174 \| 1.02255 \| 1.01375 \| 1.008680641 \| \| (10, 1000, 1000), {'kernel_size': 2, 'padding': 1, 'stride': 1, 'return_indices': False} 175 \| 3.074233 \| 3.094383 \| 0.993488201 \| \| (10, 1000, 1000), {'kernel_size': 2, 'padding': 1, 'stride': 1, 'return_indices': True} 176 \| 1.016812 \| 1.030575 \| 0.9866453194 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': False} 177 \| 3.053658 \| 3.089504 \| 0.9883974903 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': True} 178 \| 1.025863 \| 1.032088 \| 0.9939685376 \| \| (10, 1000, 1000), {'kernel_size': 8, 'padding': 1, 'stride': 1, 'return_indices': False} 179 \| 3.798942 \| 3.799213 \| 0.9999286694 \| \| (10, 1000, 1000), {'kernel_size': 8, 'padding': 1, 'stride': 1, 'return_indices': True} 180 \| 4.492979 \| 4.493421 \| 0.999901634 \| \| (10, 1000, 1000), {'kernel_size': 16, 'padding': 1, 'stride': 1, 'return_indices': False} 181 \| 51.543363 \| 51.266204 \| 1.005406271 \| \| (10, 1000, 1000), {'kernel_size': 16, 'padding': 1, 'stride': 1, 'return_indices': True} 182 \| 1.018008 \| 1.001587 \| 1.016394981 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 1), 'return_indices': False} 183 \| 3.035404 \| 3.003113 \| 1.010752509 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 1), 'return_indices': True} 184 \| 0.610421 \| 0.56 \| 1.0900375 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 4), 'return_indices': False} 185 \| 1.138983 \| 0.757296 \| 1.504012962 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 4), 'return_indices': True} 186 \| 0.641558 \| 0.557808 \| 1.150141267 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (4, 1), 'return_indices': False} 187 \| 1.181475 \| 0.754725 \| 1.565437742 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (4, 1), 'return_indices': True} 188 \| 1.03045 \| 1.026904 \| 1.003453098 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 1), 'return_indices': False} 189 \| 3.041421 \| 3.0263 \| 1.00499653 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 1), 'return_indices': True} 190 \| 0.609929 \| 0.572304 \| 1.065743032 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 4), 'return_indices': False} 191 \| 1.146875 \| 0.756446 \| 1.516135983 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 4), 'return_indices': True} 192 \| 0.645187 \| 0.561708 \| 1.148616363 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (4, 1), 'return_indices': False} 193 \| 1.181721 \| 0.758054 \| 1.558887625 \| \| (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (4, 1), 'return_indices': True} 194 \| 0.927654 \| 0.925946 \| 1.0018446 \| \| (10, 1000, 1000), {'kernel_size': 1, 'return_indices': False} 195 \| 2.749983 \| 2.740354 \| 1.00351378 \| \| (10, 1000, 1000), {'kernel_size': 1, 'return_indices': True} </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157876 Approved by: https://github.com/malfet	2025-08-08 16:40:10 +00:00
Animesh Jain	a4f69a5da0	[dynamo][guards] Remove guards on stdlib modules (#159913 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159913 Approved by: https://github.com/StrongerXi	2025-08-08 16:26:04 +00:00
Adam J. Stewart	231c72240d	CMake build: preserve PYTHONPATH (#160144 ) Fixes #160092 I'm very new to CMake, so let me know if there's a fancier way to do this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160144 Approved by: https://github.com/malfet Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>	2025-08-08 16:03:49 +00:00
gaoyvfeng	50f23ff6f8	rename-HAS_CUDA-to-HAS_CUDA_AND_TRITON (#159883 ) Fixes #159399 "Modified torch.testing._internal.inductor_utils and test/inductor" Pull Request resolved: https://github.com/pytorch/pytorch/pull/159883 Approved by: https://github.com/janeyx99	2025-08-08 15:44:52 +00:00
zpcore	8a37f0c903	improve gather and scatter_add strategy (#160140 ) As title. This PR made a small fix on top of https://github.com/meta-pytorch/autoparallel/pull/81. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160140 Approved by: https://github.com/fmassa	2025-08-08 15:06:24 +00:00
Edward Z. Yang	b5fd7223b1	Improve pin_memory error message on CPU-only systems (#159994 ) ## Summary - clarify pin_memory error message when no accelerator backend is available ## Testing - `python repro_pin_memory.py` (fails: Need to provide pin_memory allocator to use pin memory) - `lintrunner -a` ------ https://chatgpt.com/codex/tasks/task_e_6893ba92c93483238a9bdfdd6c52812b Pull Request resolved: https://github.com/pytorch/pytorch/pull/159994 Approved by: https://github.com/albanD	2025-08-08 14:36:45 +00:00
Edward Yang	9fa8ce26cf	Working setup with runnable PyTorch on Codex. (#159968 ) Sample transcript: https://chatgpt.com/s/cd_68938effc1a88191ae78bc82a8cefe94 This makes use of https://github.com/pytorch/pytorch/pull/159965 to bypass doing an actual build and use nightly. Things to improve: - Once USE_NIGHTLY is in main can remove the patching - We should just keep using the latest nightly, instead of a hard coded one Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159968 Approved by: https://github.com/wdvr	2025-08-08 14:34:15 +00:00
David Berard	62bac07981	[inductor][triton] support profile_scratch launcher arg (#159772 ) This adds support for Triton after https://github.com/triton-lang/triton/pull/7258 landed. https://github.com/triton-lang/triton/pull/7258 adds a new argument to all the Triton kernels - a profile_scratch argument, similar to global_scratch. This PR updates the static cuda launcher and the AOTI kernel callers to pass in these arguments when calling the Triton kernel. Tests: https://github.com/pytorch/pytorch/pull/159158. I also verified these test locally with triton 3.2, 3.3, and 3.4. Fixes: * static_cuda_launcher (test/repro: `python tools/dynamo/verify_dynamo.py`) * AOTI calling logic (test/repro: `TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_linalg_vander_cuda_float32`) Differential Revision: [D79825121](https://our.internmc.facebook.com/intern/diff/D79825121) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159772 Approved by: https://github.com/NikhilAPatel, https://github.com/eellison	2025-08-08 14:27:38 +00:00
Isalia20	7f4cb4a3e0	[MPS] coalesce for sparse tensors (#159729 ) MPS coalesce function for sparse tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/159729 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-08 13:49:55 +00:00
Aidyn-A	556e2a73f4	[Test][Easy] Use float16 dtype in test_sort_large (#159939 ) The test fails with: >RuntimeError: var_mean only support floating point and complex dtypes Pull Request resolved: https://github.com/pytorch/pytorch/pull/159939 Approved by: https://github.com/eqy	2025-08-08 09:56:44 +00:00
Xuehai Pan	178515d0ff	[BE][PYFMT] remove `black`: finish `black -> ruff format` migration (#144557 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144557 Approved by: https://github.com/ezyang	2025-08-08 07:46:10 +00:00
codingwithsurya	3a56237440	[SymmMem] Send tensors with unerased type information to NVSHMEM Triton kernels (#159788 ) This PR introduces a small `@triton.jit` wrapper function over our core NVSHMEM extern functions for users to send tensors as inputs to their NVSHMEM Triton kernels (rather than pointers). The goal is to abstract away tedious details from the developer, like manual byte-size calculations and handling of raw `int64` pointers. This lets developers work directly with typed Triton tensors and element counts, which will also be useful if you want to do for instance some local math on the data. ----- TODO: This is almost complete. One pending item is tensor-aware implementation of `nvshmem.putmem_signal_block `and `nvshmem.signal_wait_until` From my investigation, I found the root cause to be that this specific tensor API uses local addresses instead of remote addresses for the peer ``` Pointer-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Remote buffer: 0x2430300c00 (dst) ← Rank 1's memory Remote signal: 0x2430301600 (sig) ← Rank 1's signal Rank 1 (waiting): Local signal: 0x430301600 (waits here) Tensor-Based Version: Rank 0 → Rank 1: Local buffer: 0x430300a00 (src) Local buffer: 0x430300c00 (dst) ← this is wrong Local signal: 0x430300e00 (sig) ← this is wrong Rank 1 (waiting): Local signal: 0x430300e00 (waits here) ``` Next Steps: Need mechanism to resolve local tensor → remote PE address, equivalent to handle.buffer_ptrs[peer] lookup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159788 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734, #159755, #159756	2025-08-08 05:20:42 +00:00
codingwithsurya	e0d8a315c5	[SymmMem] Add helpful docstrings for all NVSHMEM APIs (#159756 ) Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159756 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734, #159755	2025-08-08 05:20:42 +00:00
codingwithsurya	bfff2e3592	[SymmMem] Refactor NVSHMEM Reduction API to be more ergonomic with automatic dtype‐based dispatch (#159755 ) This change introduces a single, generic Triton‐extern wrapper for NVSHMEM team‐based reductions. We now expose one function, `nvshmem.reduce(team, dest, source, nreduce, operation, dtype_id)`, that covers all supported ops (sum, max, min, prod) and dtypes (int8…int64, uint8…uint64, float16, bfloat16, float32, float64). It accepts real dtype objects (torch.dtype or tl.dtype) directly in the Triton kernel launch. Internally, we normalize dtype_id (handling tl.dtype, torch.dtype, str, or constexpr) into the canonical NVSHMEM typename and assemble the proper function name, e.g. nvshmem_float_sum_reduce or nvshmem_bfloat16_prod_reduce Pull Request resolved: https://github.com/pytorch/pytorch/pull/159755 Approved by: https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734	2025-08-08 05:20:36 +00:00
codingwithsurya	1c881440f4	[SymmMem] Initialize NVSHMEM module only for kernels that have nvshmem in their name (#159734 ) Previously, a global post-compile hook initialized the NVSHMEM module for all Triton kernels, which was inefficient. This change conditionally initializes `_nvshmemx_cumodule_init(kernel.module)` only for Triton kernels containing "nvshmem" in their name. Also updated the names for all of our nvshmem kernels to align with this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159734 Approved by: https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215, #159701	2025-08-08 05:20:29 +00:00
codingwithsurya	7c4f7b9340	[SymmMem] Add Triton 3.4 support to NVSHMEM Triton and fix CI tests (make device library discoverable + fix peer calculation bug) (#159701 ) This PR introduces support for Triton 3.4 and resolves several CI and test-related issues. Triton 3.4 Compatibility - The JIT post-compile hook has been updated from the legacy JITFunction.compiled_hook to the new API path at triton.knobs.runtime.jit_post_compile_hook. - The internal parameter for kernel semantics in extern function definitions has been updated from _semantic to _builder to align with API changes. Fix CI Errors - The new logic inspects the RPATH of libtorch_nvshmem.so to find the NVSHMEM device library, preventing CI tests from being skipped. - Added a decorator to run NVSHMEM tests only on H100s (compatible hardware) Peer Rank Calculation Fix - The peer calculation in test_nvshmem_triton.py was changed from peer = (world_size - 1) - rank to peer = 1 - rank. Reasoning: The previous logic was only valid for a 2-rank setup. In the 8-rank CI environment, it incorrectly mapped peers (e.g., rank 0 to 7), breaking tests that assume a 0↔1 communication pattern. This was reproduced and validated on an 8-rank dev setup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159701 Approved by: https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136, #159215	2025-08-08 05:20:22 +00:00
codingwithsurya	1783d6e966	[SymmMem] Fix flaky wait_until test (#159215 ) When playing around with it, I noticed some flakiness in this test across sessions. After debugging, turns out the heavy sync primitives that I was calling (like `nvshmem_quiet()` or `nvshmem_fence()`) from inside Triton kernels was causing deadlocks. The original test tried to guarantee ordering: `put(data) -> fence/quiet -> put(flag)`. But the GPU thread got stuck in `quiet()` waiting for network confirmation while holding the SM, creating a deadlock. The fix was realizing `wait_until` already provides all the sync you need. Just do: - PE A: `nvshmem_wait_until(&ivar, ...)` - PE B: `nvshmem_put(&ivar_on_PE_A, ...)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159215 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718, #159136	2025-08-08 05:20:16 +00:00
codingwithsurya	ea7fe0ecf6	[SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + improve code clarity (#159136 ) Quick refactor for consistency and clarity. 1. We now standardize all NVSHMEM data-moving collectives (put, get, alltoall, broadcast) to use their byte-based *_mem_block variants. This makes the API behavior more predictable and avoids mixing paradigms. 2. Previously, some functions operated on element counts (nelems), while others expected byte sizes but still used `nelems` as the param name. That inconsistency was easy to miss and could lead to bugs, especially for devs not familiar with the NVSHMEM internals. To clean this up: • All byte-based APIs now use nbytes or nbytes_per_pe to make the units explicit. • Typed APIs consistently use nelems for element counts. • Docstrings were added or updated to clarify expected units. Also did some code cleanup — removed unused functions, fixed typos in comments, and did some general housekeeping. This should make the API more intuitive and reduce friction for developers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159136 Approved by: https://github.com/mandroid6, https://github.com/ngimel ghstack dependencies: #158515, #158718	2025-08-08 05:20:09 +00:00
codingwithsurya	b0b229b197	[SymmMem] Use _get_default_group() instead of group.WORLD for group_name access (#158718 ) Both approaches functionally return the default process group created by `init_process_group()` but `_get_default_group()` is a dedicated function with [better error handling and type safety](`4869f71170/torch/distributed/distributed_c10d.py (L1300-L1310)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/158718 Approved by: https://github.com/Skylion007, https://github.com/fduwjj ghstack dependencies: #158515	2025-08-08 05:20:02 +00:00
codingwithsurya	b5c937259b	[SymmMem] Add NVSHMEM Reduction support (sum, min, max) into Triton (#158515 ) Implements sum_reduce, min_reduce, and max_reduce collective operations for NVSHMEM Triton kernels. Enables parallel reduction computations across PE teams for int64 data types. Tests: `python test/distributed/test_nvshmem_triton.py` <details> <summary> Quick debug print for sanity check </summary> ```markdown ============================================================ [Rank 1] Starting min/max reduction test with world_size=2 ============================================================ ============================================================ [Rank 0] Starting min/max reduction test with world_size=2 ============================================================ [Rank 0] Source data for min/max: [10, 20] [Rank 1] Source data for min/max: [15, 5] [Rank 1] All values across PEs: [Rank 0] All values across PEs: - Position 0: [10, 15] - Position 0: [10, 15] - Position 1: [20, 5] - Position 1: [20, 5] [Rank 1] Expected min: [10, 5] [Rank 0] Expected min: [10, 5] [Rank 1] Expected max: [15, 20] [Rank 0] Expected max: [15, 20] [Rank 0] Executing MIN reduction... [Rank 1] Executing MIN reduction... [Rank 0] Executing MAX reduction... [Rank 1] Executing MAX reduction... /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [Rank 1] Results: [Rank 0] Results: [Rank 1] MIN reduction result: [10, 5] [Rank 1] MAX reduction result: [15, 20] [Rank 0] MIN reduction result: [10, 5] [Rank 0] MAX reduction result: [15, 20] [Rank 1] ============================================================ [Rank 1] Min/Max reduction test PASSED ✓ [Rank 1] ============================================================ [Rank 0] ============================================================ [Rank 0] Min/Max reduction test PASSED ✓ [Rank 0] ============================================================ ...... ============================================================ ============================================================ [Rank 0] Starting sum reduction test with world_size=2 [Rank 1] Starting sum reduction test with world_size=2 ============================================================ ============================================================ [Rank 0] Configuration: [Rank 1] Configuration: - nreduce: 3 (number of separate reductions) - nreduce: 3 (number of separate reductions) - dtype: torch.int64 - dtype: torch.int64 [Rank 1] Source data: [2, 4, 6] [Rank 1] Contribution explanation: [Rank 0] Source data: [1, 2, 3] [Rank 0] Contribution explanation: - Element 0: 2 = (rank=1+1) * (index=0+1) - Element 0: 1 = (rank=0+1) * (index=0+1) - Element 1: 4 = (rank=1+1) * (index=1+1) - Element 1: 2 = (rank=0+1) * (index=1+1) - Element 2: 6 = (rank=1+1) * (index=2+1) - Element 2: 3 = (rank=0+1) * (index=2+1) [Rank 1] Initial destination: [-1, -1, -1] [Rank 0] Initial destination: [-1, -1, -1] [Rank 0] Expected results after reduction: [3, 6, 9] [Rank 1] Expected results after reduction: [3, 6, 9] [Rank 0] Executing sum reduction... [Rank 1] Executing sum reduction... [Rank 1] Sum reduction completed /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [Rank 0] Sum reduction completed /data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [Rank 0] Results after reduction: [Rank 0] Destination buffer: [3, 6, 9] [Rank 1] Results after reduction: [Rank 0] Verification: - Reduction 0: PE0: 1 + PE1: 2 = 3 Result: 3, Match: ✓ - Reduction 1: PE0: 2 + PE1: 4 = 6 Result: 6, Match: ✓ [Rank 1] Destination buffer: [3, 6, 9] - Reduction 2: PE0: 3 + PE1: 6 = 9 [Rank 1] Verification: - Reduction 0: PE0: 1 + PE1: 2 = 3 Result: 9, Match: ✓ Result: 3, Match: ✓ - Reduction 1: PE0: 2 + PE1: 4 = 6 Result: 6, Match: ✓ - Reduction 2: PE0: 3 + PE1: 6 = 9 Result: 9, Match: ✓ [Rank 0] ============================================================ [Rank 0] Sum reduction test PASSED ✓ [Rank 0] All 3 reductions computed correctly across 2 PEs [Rank 0] ============================================================ [Rank 1] ============================================================ [Rank 1] Sum reduction test PASSED ✓ [Rank 1] All 3 reductions computed correctly across 2 PEs [Rank 1] ============================================================ ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158515 Approved by: https://github.com/mandroid6, https://github.com/ngimel	2025-08-08 05:19:55 +00:00
PyTorch UpdateBot	24257f5bfa	[vllm hash update] update the pinned vllm hash (#159822 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159822 Approved by: https://github.com/pytorchbot	2025-08-08 04:13:48 +00:00
Yiming Zhou	017259f9c6	[benchmarks] Add nativert benchmark (#159922 ) Add NativeRT as an option in the PT2 OSS benchmark ``` python ./benchmarks/dynamo/huggingface.py --performance --inference --export-nativert python ./benchmarks/dynamo/timm_models.py --performance --inference --export-nativert python ./benchmarks/dynamo/torchbench.py --performance --inference --export-nativert ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159922 Approved by: https://github.com/angelayi	2025-08-08 03:38:32 +00:00
xinan.lin	2ea40fba84	[Linter] Improve device-bias linter by adding detection for `with torch.device("cuda")`. (#159926 ) ``` For example, detect the following situation: >>>Lint for test/dynamo/test_modes.py: Error (TEST_DEVICE_BIAS) [device-bias] `@requires_gpu` function should not hardcode `with torch.device('cuda')`, suggest to use torch.device(GPU_TYPE) 687 \| flex_attention as flex_attention_eager, 688 \| ) 689 \| >>> 690 \| with torch.device("cuda"): 691 \| flex_attention = torch.compile(flex_attention_eager, dynamic=False) 692 \| 693 \| with self.assertRaisesRegex( ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159926 Approved by: https://github.com/EikanWang, https://github.com/jansel ghstack dependencies: #159759	2025-08-08 03:20:42 +00:00
Aaron Gokaslan	beb4d7816d	[BE]: ruff PLC0207 - use maxsplit kwarg (#160107 ) Automatically replaces split with rsplit when relevant and only performs the split up to the first ( or last value). This allows early return of the split function and improve efficiency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160107 Approved by: https://github.com/albanD	2025-08-08 03:14:59 +00:00
Guilherme Leobas	3fcd79e023	Fix infinite loop when iterating over an empty zip (#159673 ) Dynamo would enter in an infinite recursion when `ZipVariable.next_variable(tx)` was called and there was no iterable to be iterated Pull Request resolved: https://github.com/pytorch/pytorch/pull/159673 Approved by: https://github.com/williamwen42	2025-08-08 02:50:21 +00:00
bobrenjc93	05c417715f	integrate kernacle into inductor (#160121 ) This adds integration into inductor in two parts 1) It kicks off the best config lookup at lowering time within mm.py 2) It awaits the future at scheduling time in select_algorithm.py Notably this does not do the following 1) Support for enumerating between mm, addmm and bmm 2) Support for enumerating between exhaustive/max 3) Enumerating different hardware SKUs eg. H100, A100, etc. those will come in the next diffs Differential Revision: [D79824921](https://our.internmc.facebook.com/intern/diff/D79824921/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160121 Approved by: https://github.com/izaitsevfb	2025-08-08 02:14:44 +00:00
Georgia Phillips	ba4ccf5d67	turn on executon frame clenaup by default (#160110 ) Summary: Turning execution frame cleanup back on since D78621408 is done Test Plan: See D78621408 Rollback Plan: Differential Revision: D79730674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160110 Approved by: https://github.com/jingsh	2025-08-08 02:13:48 +00:00
Wenyuan Chi	d68c323692	Log max_autotune exceptions (#159687 ) (#159688 ) Summary: Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures. Currently, exceptions are dumped to the console in the following format:: ``` [0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help. [0/0] Runtime error during autotuning: [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.. [0/0] Ignoring this choice. ``` The exception tracebacks: ``` # inner exception traceback: File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers launchers.append(result.make_launcher()) ^^^^^^^^^^^^^^^^^^^^^^ File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher self.kernel.load_kernel(device) File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel (self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel( # wrapped exception traceback: File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run result = self.fn(self.args, *self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout choice.precompile() File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile self.bmreq.precompile() File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile getattr(mod, self.kernel_name).precompile() File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile self._make_launchers() File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}") ``` With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event. The format: ``` { "exceptions": [ { "choice_type": "triton", "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0", "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.", "exception": "OutOfMemoryError", "required_memory": "262144", "hardware_limit": "232448" } ] } ``` Test Plan: buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt Rollback Plan: Differential Revision: D79420953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159688 Approved by: https://github.com/stashuk-olek	2025-08-08 01:30:08 +00:00
Edward Z. Yang	03b254e49f	Extend torch function support to ALL arguments, not just scalar type (but not insides of list) (#145089 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/145089 Approved by: https://github.com/albanD, https://github.com/zou3519	2025-08-07 23:43:53 +00:00
PyTorch MergeBot	195b5c2e27	Revert "dynamo: Remove passing or deleted dynamo_expected_failures (#159691 )" This reverts commit 36f46d082a4954921cb8493223f000f2aab79ed7. Reverted https://github.com/pytorch/pytorch/pull/159691 on behalf of https://github.com/izaitsevfb due to breaking dynamo tests ([comment](https://github.com/pytorch/pytorch/pull/159691#issuecomment-3166067241))	2025-08-07 22:55:51 +00:00
Anshul Sinha	f077c2402e	[replicate][be] improved readability of test case description (#160128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160128 Approved by: https://github.com/mori360	2025-08-07 22:51:58 +00:00
Patrick C. Toulme	d46768db04	[MTIA] Allow users who know what they are doing to ignore all device mismatches in tracing and take a preferred device. (#159931 ) Summary: Device mismatches in tracing can most often be ignored. These are only logical mismatches not physical. Take any intermediate computation, and that computation will not actually materialize in a compiled binary execution. So a device mismatch in the middle of the program is not real. The runtime will never materialize those tensors on CPU device during the execution, as they are temporary allocations. If a user knows his tensors at graph input are all on the correct device, then he can ignore all tracing errors. Users who know what they are doing should have an escape hatch to ignore any device mismatch in tracing. Users can set ``` torch._functorch.config.fake_tensor_prefer_device_type = 'mtia' ``` to forcefully override any mismatch and prefer the non cpu device. This unblocks vLLM graph mode for MTIA. Test Plan: Added two unit tests. Rollback Plan: Differential Revision: D79698438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159931 Approved by: https://github.com/jansel	2025-08-07 22:37:15 +00:00
clr	36f46d082a	dynamo: Remove passing or deleted dynamo_expected_failures (#159691 ) partially generated with ``` for TESTCASE in $(ls \| cut -f1 -d'.' \| grep -v CPython \| uniq); do if grep "$TESTCASE" -m 1 .. -r; then echo; else sl rm "$TESTCASE"* ; fi; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159691 Approved by: https://github.com/xmfan	2025-08-07 21:41:50 +00:00
Sherlock Huang	8147370733	Fix qembeddingbag_byte_prepack_meta to use sym_sizes (#159985 ) Summary: In qembeddingbag_byte_prepack_meta, weight.sizes() would return a concrete int. we should use .sym_size() to return a SymInt instead. Test Plan: CI Rollback Plan: Reviewed By: kqfu, henryoier Differential Revision: D79744512 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159985 Approved by: https://github.com/jerryzh168, https://github.com/henryoier	2025-08-07 21:22:29 +00:00
Angela Yi	e619c6bb90	[export] Apply move_to_device_pass to all submodules (#159992 ) Previously we only applied this move_to_device_pass to the toplevel graph. However if we have HOO, this pass will not be applied on the HOO submodules. This PR modifies the pass to run on all submodules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159992 Approved by: https://github.com/yiming0416	2025-08-07 18:51:15 +00:00
Will Constable	3cf7b4024e	[DTensor] Support user-supplied Generator for random ops (#159933 ) If the user provides a generator kwarg to a random op (e.g. nn.init.uniform_(..., generator=my_generator)), we can still advance that generator's state in a SPMD-global way so that each local-tensor gets appropriate values and the generator advances to the same state as if it had operated on the full tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159933 Approved by: https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/wanchaol	2025-08-07 18:47:22 +00:00
Xu Han	21392c0e06	[inductor] disable flex decoding on Windows. (#160072 ) Discussed with @jianan-gu and @Valentine233 , disable flex decoding on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160072 Approved by: https://github.com/angelayi	2025-08-07 18:07:36 +00:00
Aleksei Nikiforov	ee1fb43450	Fix docker image creation (#158634 ) Since switching from wheel 0.34.2 to wheel 0.45.1 python symlinks are no longer correctly created. Migrate to packaging package for symlink creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/158634 Approved by: https://github.com/malfet	2025-08-07 17:41:47 +00:00
Aidyn-A	0bd3af4fb8	Further fix failing tests in test/inductor/test_analysis.py (#160070 ) This is a follow up on #159800 as other tests are still failing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160070 Approved by: https://github.com/aorenste	2025-08-07 17:32:58 +00:00
Ankita George	8399cf88ce	Use only safetensors APIs in HFStorageReader (#159681 ) Get rid of the logic to read the metadata from the header of the safetensors file manually and use the functions as part of safe_open() to get the metadata. This is much cleaner and allows us to not rely on our own custom methods to get metadata, but use safetensors provided APIs Differential Revision: [D79460272](https://our.internmc.facebook.com/intern/diff/D79460272/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159681 Approved by: https://github.com/saumishr ghstack dependencies: #159405, #159406	2025-08-07 17:23:03 +00:00
Ankita George	0b187b3114	DCP HF reader: use safe_open instead of reading the bytes (#159406 ) Reading the bytes and converting to tensors is much slower than using safe_open. For a 8B model across 8 ranks, took ~30s to load before this change and ~4s after. Differential Revision: [D78994259](https://our.internmc.facebook.com/intern/diff/D78994259/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159406 Approved by: https://github.com/saumishr ghstack dependencies: #159405	2025-08-07 17:23:03 +00:00
Ankita George	69cc606fda	HF component update to not use fsspec components (#159405 ) Update HF components to not inherit from fsspec components and instead use filesystem writer/reader. The reason is because there doesn't seem to be much of a need for fsspec, since users are using mounted storage. Using local storage will allow for performance improvements because we can take advantage of the safe_open API provided by HF safetensors (30s vs 4s for load of 8b model), which is signifcant performance wins over reading bytes and converting to tensors which is what we are doing now. Also, we can use the official methods provided by HF instead of relying on reading the metadata by bytes and loading it Differential Revision: [D78993550](https://our.internmc.facebook.com/intern/diff/D78993550/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159405 Approved by: https://github.com/saumishr	2025-08-07 17:22:54 +00:00
Markus Hoehnerbach	57f738b635	[inductor] move all cpu scalars using pinned memory for graph partition (#155360 ) (#158983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158983 Approved by: https://github.com/eellison ghstack dependencies: #158758	2025-08-07 17:07:26 +00:00
Markus Hoehnerbach	e167c7d0f3	[inductor] allocate non-blocking copy destinations in pinned memory (#155121 ) (#158758 ) Fixes #155121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758 Approved by: https://github.com/EikanWang, https://github.com/eellison	2025-08-07 17:07:26 +00:00
Shivam Raikundalia	b1a602762e	[Profiler] Update README (#159816 ) Summary: Updated README with code structure and explanation of core features within profiler Test Plan: N/A Rollback Plan: Differential Revision: D79604189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159816 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2025-08-07 16:44:41 +00:00
Han, Xu	e1cf0d496e	[inductor] unification for inductor debug. (#159998 ) Unification inductor debug build, follow @desertfire 's suggestion: https://github.com/pytorch/pytorch/pull/159938#pullrequestreview-3093803196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159998 Approved by: https://github.com/angelayi	2025-08-07 16:38:00 +00:00
Xu Han	06824f3c72	[inductor] fix test_dynamo_timed on Windows. (#159981 ) Fixed `test_dynamo_timed `: <img width="1030" height="389" alt="image" src="https://github.com/user-attachments/assets/02d84dd8-6a65-4f91-8d4c-48ba0a81fac1" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159981 Approved by: https://github.com/angelayi	2025-08-07 16:37:52 +00:00
PyTorch MergeBot	f3a4d742ec	Revert "Add DeviceAllocator as the base device allocator (#138222 )" This reverts commit f7a66da5f9f6b8b75119b1ee8ce9ddc23e15570e. Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))	2025-08-07 16:34:36 +00:00
PyTorch MergeBot	74da2604c9	Revert "Add unified memory APIs for torch.accelerator (#152932 )" This reverts commit 15f1173e5d72d6d45faba4cecd135e0160f06c6f. Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))	2025-08-07 16:34:36 +00:00
PyTorch MergeBot	c4e64467b5	Revert "Add UT for torch.accelerator memory-related API (#155200 )" This reverts commit 4604f0482c2b4a3001b62e5bc5085149a9bb053c. Reverted https://github.com/pytorch/pytorch/pull/155200 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))	2025-08-07 16:34:36 +00:00
Zain Rizvi	90b78ee50f	Move xla jobs to unstable workflow (#159272 ) Disables the job on PRs completely, so that we don't litter people's CI signals and use machines unnecessarily. If you want to run these xla tests, add the ciflow/unstable label to your PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/159272 Approved by: https://github.com/atalman, https://github.com/malfet	2025-08-07 16:22:52 +00:00
Xilun Wu	e248719ac0	[DTensor] support _StridedShard in view op (#159656 ) Summary Some thoughts on view-op and `_StridedShard` interaction: 1. `_StridedShard` has no impact on sharding (i.e. how tensor is partitioned) compared to `Shard`. It only changes how shards permute across the devices. 2. `view()` op on DTensor strictly forbids shard redistribution which means if `view()` may cause shard permutation across devices, it should be rejected. This is enforced in today's sharding prop for `view()`. 3. Since DTensor `view()` won't introduce any redistribution, it's certain that `placements` won't change except the inner `dim` attribute of `Shard` or `_StridedShard`. Therefore, to support `_StridedShard` in `view()` op, the only change required is to keep `_StridedShard` as `_StridedShard` in the output spec. Test `pytest test/distributed/tensor/test_view_ops.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159656 Approved by: https://github.com/wconstab	2025-08-07 15:59:25 +00:00
Aleksei Nikiforov	f60454cce8	S390X: update test dependencies (#158636 ) numba currently doesn't build from source due to https://github.com/numba/numba/pull/10073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158636 Approved by: https://github.com/malfet	2025-08-07 15:58:30 +00:00
rzou	8ab5868a21	Actually run the einops tests in CI (#159776 ) The test filter was wrong, it should not start with "test/". Test Plan: - wait for CI - Tested locally with `python test/run_test.py --einops --verbose` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159776 Approved by: https://github.com/atalman, https://github.com/StrongerXi	2025-08-07 15:23:06 +00:00
Wang, Chuanqi	d20c4c20e6	[CI] Update xpu ci use rolling driver for new features (#158340 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/158340 Approved by: https://github.com/seemethere Co-authored-by: xinan.lin <xinan.lin@intel.com>	2025-08-07 15:18:51 +00:00
Zhengxu Chen	83875cdb55	[nativert] Expose ModelRunner to public through pmpl type ModelRunnerHandle. (#159989 ) Summary: Today users outside of pytorch core cannot `#include <torch/nativert/ModelRunner.h>`. It turns out that we should place a header inside `torch/csrc/api/include/`. Placing every single nativert header here would pollute the namespace a lot and that's not what we want in general. Therefore here we just create a Handle type which hold a pointer to decouple the actual type from header definition. Test Plan: CI Rollback Plan: Differential Revision: D79751098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159989 Approved by: https://github.com/dolpm	2025-08-07 14:23:21 +00:00
PyTorch MergeBot	a53d14d5f8	Revert "unskipped mobilenet_v3 quantization and mobilenet_v2 quantization plus tests from https://github.com/pytorch/pytorch/issues/125438 (#157786 )" This reverts commit 3a2c3c8ed365eb4e4cf4620c25d70b2f70483762. Reverted https://github.com/pytorch/pytorch/pull/157786 on behalf of https://github.com/albanD due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/157786#issuecomment-3164126250))	2025-08-07 13:09:33 +00:00
Dev Sashidhar	8cb91e20bc	Renaming HAS_XPU to HAS_XPU_AND_TRITON (#159908 ) This PR follows up on the discussion in #159399 where @Akabbaj and @janeyx99 mentioned renaming HAS_XPU to HAS_XPU_AND_TRITON for consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159908 Approved by: https://github.com/janeyx99, https://github.com/guangyey	2025-08-07 11:24:44 +00:00
Huy Do	b0df7715e8	Remove benchmark dependencies from regular ROCm CI images (#160047 ) Instead, use a new `pytorch-linux-jammy-rocm-n-py3-benchmarks` image for Docker benchmark job. This addresses 2 issues: * The current ROCm failures in trunk w.r.t librosa version https://github.com/pytorch/pytorch/actions/runs/16789466749/job/47549950994 that TorchBench pulls in. * Reduce the size of the regular ROCm CI images by removing TorchBench models, which is needed only for benchmarking jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160047 Approved by: https://github.com/malfet, https://github.com/izaitsevfb	2025-08-07 09:26:58 +00:00
Avik Chaudhuri	422bd6808b	dataclass pytree fix (#159916 ) Differential Revision: D79687243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159916 Approved by: https://github.com/XuehaiPan, https://github.com/angelayi	2025-08-07 08:22:41 +00:00
thenumberouscode	24f43d0da7	[inductor] [cpu] fix the dype hardcoded to int64 in store_reduction (#157904 ) ## Fixes https://github.com/pytorch/pytorch/issues/157683 ## mini repro * Just copy the code from the issue to reproduce it. ```python import torch device = "cpu" # Input tensors v2_0 = torch.randn(16, 24, 59, dtype=torch.complex64, device=device) v3_0 = torch.randn(16, 24, 59, dtype=torch.complex64, device=device) def my_model(v2_0, v3_0): v6_0 = -v3_0 v4_0 = v2_0 * v3_0 v1_0 = v4_0.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).unsqueeze(-1) v0_0 = v2_0.to(torch.int32) v5_0 = v0_0.amax(dim=0) return v6_0, v4_0, v1_0, v0_0, v5_0 v6_0, v4_0, v1_0, v0_0, v5_0 = my_model(v2_0, v3_0) print("v6_0", v6_0.shape) print("v4_0", v4_0.shape) compiled_model = torch.compile(my_model, backend="inductor") v6_0, v4_0, v1_0, v0_0, v5_0 = compiled_model(v2_0, v3_0) print("v6_0", v6_0.shape) print("v4_0", v4_0.shape) print("v1_0", v1_0.shape) print("v0_0", v0_0.shape) print("v5_0", v5_0.shape) ``` error_stack ``` /home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注：candidate: ‘template<class dst_t, class src_t> std::enable_if_t<(! is_same_v<dst_t, src_t>), at::vec::CPU_CAPABILITY::Vectorized<T> > at::vec::CPU_CAPABILITY::convert(const at::vec::CPU_CAPABILITY::Vectorized<T>&)’ 41 \| convert(const Vectorized<src_t>& src) { \| ^~~~~~~ /home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注： template argument deduction/substitution failed: /tmp/torchinductor_admin/6k/c6kr65o43rlmp2cmkpn5ezewhe5bla4w72hpcrg5biyelrs4skyw.main.cpp:37:99: 错误：模板参数数目不对(不应是 4 个而应是 2 个) 37 \| auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec); ``` ## summary The C++ kernel generated by the Inductor had the wrong data type for the output variable; it should be int32_t instead of int64_t. This incorrect data type led to an incompatible data type conversion, which caused the g++ compilation to fail. The original code that caused the problem. ``` def my_model(v2_0, v3_0): v6_0 = -v3_0 v4_0 = v2_0 * v3_0 v1_0 = v4_0.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).unsqueeze(-1) v0_0 = v2_0.to(torch.int32) // The original code that caused the problem. v5_0 = v0_0.amax(dim=0) ``` ## proof procedure The c++ kernel generated by inductor: ```c++ #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const int32_t* in_ptr0, int32_t* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(1416L); x0+=static_cast<int64_t>(16L)) { { int32_t tmp_acc0_arr[16]; for (int i = 0; i < 16; i++) { tmp_acc0_arr[i] = std::numeric_limits<int32_t>::min(); } int32_t tmp_acc0 = std::numeric_limits<int32_t>::min(); at::vec::Vectorized<int32_t> tmp_acc0_vec = at::vec::Vectorized<int32_t>(std::numeric_limits<int32_t>::min()); for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(1L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1408L))) { auto tmp0 = at::vec::Vectorized<int32_t>::loadu(in_ptr0 + static_cast<int64_t>(x0 + 1416Lx1), static_cast<int64_t>(16)); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(1408L) && x0 < static_cast<int64_t>(1416L))) { for (int64_t x0_tail = static_cast<int64_t>(1408L);x0_tail < static_cast<int64_t>(1416L); x0_tail++) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail + 1416Lx1)]; tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)] = max_propagate_nan(tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)], tmp0); } } } } if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1408L))) { // impossible data type conversion which would caused the g++ compilation to fail. auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec); int32_t_tmp_acc0_vec.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(1408L) && x0 < static_cast<int64_t>(1416L))) { for (int64_t x0_tail = static_cast<int64_t>(1408L);x0_tail < static_cast<int64_t>(1416L); x0_tail++) { out_ptr0[static_cast<int64_t>(x0_tail)] = tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)]; } } } } } } ``` the compilers complains ```text /home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注：candidate: ‘template<class dst_t, class src_t> std::enable_if_t<(! is_same_v<dst_t, src_t>), at::vec::CPU_CAPABILITY::Vectorized<T> > at::vec::CPU_CAPABILITY::convert(const at::vec::CPU_CAPABILITY::Vectorized<T>&)’ 41 \| convert(const Vectorized<src_t>& src) { \| ^~~~~~~ /home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注： template argument deduction/substitution failed: /tmp/torchinductor_admin/6k/c6kr65o43rlmp2cmkpn5ezewhe5bla4w72hpcrg5biyelrs4skyw.main.cpp:37:99: 错误：模板参数数目不对(不应是 4 个而应是 2 个) 37 \| auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec); ``` so the following line have problem ```c++ // this line means that tmp_acc0_vec should be Vectorized<int64_t>, and it will convert it to Vectorized<int32_t>. auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec); ``` The issue is that tmp_acc0_vec is of type Vectorized<int32_t>, but the template parameters expect it to be Vectorized<int64_t>. and it will convert it to a Vectorized<int32_t>. this is conflict. the conversion should not be exist for tmp_acc0_vec is already Vectorized<int32_t>.The following line hardcodes the output variable type to int64, which causes unnecessary and incorrect type conversions. `d89f30ad45/torch/_inductor/codegen/cpp.py (L2985-L2993)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157904 Approved by: https://github.com/jgong5	2025-08-07 08:03:05 +00:00
Sherlock Huang	aa75e917bd	[Export Schema] Remove deviceAllocationMap field (#159653 ) Summary: This field is not used today, and it's not useful either. The device allocation is configured at model loading time, specified by user. It shouldn't be part of the model definition. Test Plan: CI Rollback Plan: Differential Revision: D79385513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159653 Approved by: https://github.com/zhxchen17	2025-08-07 07:31:42 +00:00
PyTorch UpdateBot	3f1636ebef	[audio hash update] update the pinned audio hash (#160046 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160046 Approved by: https://github.com/pytorchbot	2025-08-07 04:16:35 +00:00
IlyasMoutawwakil	c859ba7114	Make onnx export SDPA match aten behavior (#159973 ) This PR makes onnx sdpa export match the behavior of aten sdpa when boolean mask is used. @justinchuby ```python import onnxruntime as ort import torch class ScaledDotProductAttention(torch.nn.Module): def forward(self, query, key, value, attn_mask): return torch.nn.functional.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask) model = ScaledDotProductAttention() attn_mask = torch.ones(2, 4, 8, 8).bool() # boolean mask for attention attn_mask[0, 0, 0, :] = False # masking an entire row (padding token) query = key = value = torch.randn(2, 4, 8, 16) output = model(query, key, value, attn_mask) torch.onnx.export( model, (query, key, value, attn_mask), "scaled_dot_product_attention.onnx", input_names=["query", "key", "value", "attn_mask"], output_names=["output"], dynamo=false, # or True, ) ort_session = ort.InferenceSession("scaled_dot_product_attention.onnx") np_inputs = {"query": query.numpy(), "key": key.numpy(), "value": value.numpy(), "attn_mask": attn_mask.numpy()} onnx_outputs = ort_session.run(None, np_inputs)[0] torch.testing.assert_close(output, torch.tensor(onnx_outputs), equal_nan=True) ``` fails the assertion because the ort model outputs nans. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159973 Approved by: https://github.com/xadupre, https://github.com/titaiwangms	2025-08-07 04:06:07 +00:00
Simon Fan	d4c1a08c89	Relax unclaimed successes in dtype op tests when running under TEST_WITH_DYNAMO/TEST_WITH_INDUCTOR (#159976 ) This PR changes the behavior for compile wrapped op tests: - supported_but_unclaimed_forward - supported_but_unclaimed_backward These typically manifest when the op doesn't support inputs of certain dtypes. But under torch.compile, Dynamo/AOTAutograd will trace the graph with FakeTensors, which @ezyang and @eellison tell me need to run decomps before op dispatch. The decomp may map this test to a different op, one that does support the dtype. I suspect all of our failures here are due to decomps, and so I propose to just disable this check for compile. ~~TODO: re-enable all the failed tests.~~ jk there were no failed tests outside of compiled autograd due to this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159976 Approved by: https://github.com/ezyang	2025-08-07 02:38:45 +00:00
Nikita Shulga	81d72fb1f7	Move smoke binary builds to 3.12 (#159993 ) And limit them just to stable CUDA version (as there weren't any recent instances when only one of those jobs failed to build) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159993 Approved by: https://github.com/ngimel ghstack dependencies: #159986, #159990	2025-08-07 01:59:30 +00:00
Nikita Shulga	d0226719a9	[BE][EZ] Delete remains of split-build logic (#159990 ) Hopefully last piece of https://github.com/pytorch/pytorch/issues/138750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159990 Approved by: https://github.com/atalman ghstack dependencies: #159986	2025-08-07 01:59:30 +00:00
Edward Yang	38d65c6465	Add a USE_NIGHTLY option to setup.py (#159965 ) If you run python setup.py develop with USE_NIGHTLY, instead of actually building PyTorch we will just go ahead and download the corresponding nightly version you specified and dump its binaries. This is intended to obsolete tools/nightly.py. There's some UX polish for detecting what the latest nightly is if you pass in a blank string. I only tested on OS X. Coded with claude code. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159965 Approved by: https://github.com/malfet	2025-08-07 01:44:20 +00:00
Yu, Guangye	2ba2f598f3	[Dynamo] Add torch.xpu.stream to trace rules (#159844 ) # Motivation Previously, I thought using `with stream:` was sufficient. However, many older scripts still use `torch.xpu.stream` as the context manager. To maintain backward compatibility, I had to include `torch.xpu.stream` in the trace rules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159844 Approved by: https://github.com/jansel	2025-08-07 01:35:50 +00:00
Laith Sakka	1bb5e6c076	update expected results (#159867 ) refresh due to https://github.com/pytorch/pytorch/pull/159696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159867 Approved by: https://github.com/masnesral	2025-08-07 01:18:36 +00:00
Denghui Dong	8b0be7b65a	[Profiler] Fix unexpected C return events (#159574 ) The fix in https://github.com/pytorch/pytorch/pull/155446 addressed the "stack empty" issue that's easily reproducible on CPython 3.12.0-4. While this issue can also appear in other versions, it's not as easy to reproduce there. I recently found a new cause for this problem. `1df5d00145/Python/ceval.c (L5807-L5836)` In the CPython 3.10 implementation, PyTrace_C_CALL and PyTrace_C_RETURN/PyTrace_C_EXCEPTION are supposed to appear in pairs. However, when c_profilefunc is changed, unexpected PyTrace_C_RETURN/PyTrace_C_EXCEPTION events can occur. Here is the code to reproduce this problem. ``` import threading import time import torch from threading import Event, Lock lock = Lock() lock.acquire() event1 = Event() event2 = Event() event3 = Event() def run(): event1.set() event2.wait() lock.acquire() event3.set() threading.Thread(target=run).start() with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], with_stack=True): event1.wait() event2.set() time.sleep(1) with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], with_stack=True): lock.release() event3.wait() ``` <img width="1766" height="1250" alt="image" src="https://github.com/user-attachments/assets/6794eeca-7364-429e-91eb-62cdad116bd3" /> To fix this problem, we can record active_frames_ and remaining_start_frames_ for each thread, and when the PyTrace_C-RETURN/PyTrace_CEXT CEPTION event occurs, we can determine whether to record this event based on these two fields. In reality, even without this fix, the final data appears to be right since the match process can handle this case (it would just result in an exception log being printed). Do you think the fix is necessary? Pull Request resolved: https://github.com/pytorch/pytorch/pull/159574 Approved by: https://github.com/sraikund16	2025-08-07 01:17:55 +00:00
Xuehai Pan	5cedc5a0ff	[BE][PYFMT] migrate PYFMT for `torch/[p-z]*/` to `ruff format` (#144552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144552 Approved by: https://github.com/ezyang	2025-08-07 00:09:56 +00:00
William Wen	fd606a3a91	[dynamo] update pytorch-labs -> meta-pytorch in graph break URLs (#159975 ) Related PR: https://github.com/meta-pytorch/compile-graph-break-site/pull/30 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159975 Approved by: https://github.com/Lucaskabela	2025-08-06 23:57:31 +00:00
Animesh Jain	3daef4d128	[dynamo] Trace nn.Module __delattr__ (#159969 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159969 Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/StrongerXi	2025-08-06 23:43:19 +00:00
PyTorch MergeBot	cb4b29b754	Revert "[pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#159874 )" This reverts commit 9fd5b5f73589cf08dca60910368cc0f05c7906c8. Reverted https://github.com/pytorch/pytorch/pull/159874 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/159874#issuecomment-3161896978))	2025-08-06 23:21:29 +00:00
drisspg	a6bc296207	[FlexAttention] Update the guard semantics for divisibility (#159884 ) We don't add guards unless we know (and another guard has ensured this) that this is a safe optimization Pull Request resolved: https://github.com/pytorch/pytorch/pull/159884 Approved by: https://github.com/Chillee	2025-08-06 23:12:44 +00:00
Thomas Bohnstingl	64dc30c213	[HOP, map] Rework of map autograd to the new interface (#153343 ) This PR reworks the current autograd implementation of map to the new interface. @pytorchbot label "topic: not user facing" Pull Request resolved: https://github.com/pytorch/pytorch/pull/153343 Approved by: https://github.com/ydwu4	2025-08-06 23:02:42 +00:00
Nathan Brown	93da9952a7	gloo: fix building system gloo with CUDA/HIP (#146637 ) Fix incorrect linking of Gloo's libraries when building with system Gloo. Previously, either Gloo's native library or Gloo's CUDA library were linked. However, Gloo had changed such that all users of Gloo must link the native library, and can optionally link the CUDA or HIP library for Gloo + CUDA/HIP support. This had been updated when building/linking with vendored Gloo, but not when using system Gloo. Fixes: #146239 Reported-by: Adam J Stewart <ajstewart426@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146637 Approved by: https://github.com/malfet	2025-08-06 22:56:31 +00:00
christinaburge	3a2c3c8ed3	unskipped mobilenet_v3 quantization and mobilenet_v2 quantization plus tests from https://github.com/pytorch/pytorch/issues/125438 (#157786 ) These tests now pass on AArch64 in our downstream CI. `test_quantization.py::TestNumericSuiteEager::test_mobilenet_v2 <- test/quantization/eager/test_numeric_suite_eager.py PASSED [2.4434s] [ 35%]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157786 Approved by: https://github.com/jerryzh168, https://github.com/malfet	2025-08-06 22:41:07 +00:00
Jovian Anthony Jaison	9fd5b5f735	[pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#159874 ) Summary: Writing torch.compile worked logs to dedicated_log_rank{RANK} if we're running on mast. Test Plan: See: D79456310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159874 Approved by: https://github.com/c00w	2025-08-06 22:33:04 +00:00
Xiaochang Wu	2507ae63f2	Partitioner: Fix to align partition node order with original graph (#157892 ) Fixes #157891 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157892 Approved by: https://github.com/ezyang	2025-08-06 22:12:47 +00:00
Lucas Kabela	40c4d61f9a	[Dynamo][Better Engineering] Typing `torch/_dynamo/guards.py` (#159315 ) As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to `torch/_dynamo/guards.py` Running ``` mypy torch/_dynamo/guards.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Annotated \| Lines Total \| % lines covered \| Funcs Annotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 2030 \| 3945 \| 51.46% \| 70 \| 138 \| 50.72% \| \| This PR \| 4055 \| 4055 \| 100.00% \| 138 \| 138 \| 100.00% \| \| Delta \| +2025 \| +90 \| +48.54% \| +68 \| 0 \| +49.28% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/159315 Approved by: https://github.com/williamwen42, https://github.com/Skylion007	2025-08-06 21:52:14 +00:00
Tom Ritchford	a5725965ea	Remove unnecessary "# noqa: set_linter" comments (#159467 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159467 Approved by: https://github.com/eellison	2025-08-06 21:31:52 +00:00
Ruben Rodriguez Buchillon	289f62ce8a	[inductor][ez] fixup scaled_mm (#159948 ) Summary: This reverts the part of #159383 for scaled_mm where now, like before, we pass through the normal input_nodes (not the triton_input_nodes) to select_algorithm - #159383 refactored how kwargs are retrieved - it introduced this notion of KernelInputs that wrap input_nodes - scaled_mm uses unsqueezed input nodes for triton to retrieve params - the issue: it uses a squeezed (regular) bias for select_algorithm instead This fixes that by passing the original input nodes rather than the triton input nodes. Test Plan: ``` buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_rowwise_scaling_shape_1024,1024,512_has_bias_True_use_fast_accum_True_persistent_matmul_False (caffe2.test.inductor.test_fp8.TestFP8Lowering)' buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_rowwise_scaling_shape_1024,1024,512_has_bias_True_use_fast_accum_True_persistent_matmul_True (caffe2.test.inductor.test_fp8.TestFP8Lowering)' ``` This set of tests was failing, and is passing now Side note: these tests were failing I believe because the unsqueezed bias made the ATEN choice no longer eligible, and there is some minor numerical discrepancy between ATEN and Triton for this. I'm not sure the test should be written like that, as we're implicitly relying on ATEN being the choice here. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D79717654](https://our.internmc.facebook.com/intern/diff/D79717654) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159948 Approved by: https://github.com/izaitsevfb, https://github.com/eellison	2025-08-06 21:25:48 +00:00
Nikita Shulga	512b4730e3	[EZ] Remove useless `cross_compile_arm64` (#159986 ) As we don't have any Intel Mac runners in CI for last 2+ years Pull Request resolved: https://github.com/pytorch/pytorch/pull/159986 Approved by: https://github.com/atalman	2025-08-06 21:01:05 +00:00
Xia, Weiwen	d2368aa6f3	[CPUBLAS] add macros for brgemm APIs for versioning (#158629 ) Summary Add macros for brgemm, so that callers (e.g., Torchao's cpp kernels) know which APIs are available. It is useful when callers need to co-work with old versions of PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158629 Approved by: https://github.com/CaoE, https://github.com/Valentine233, https://github.com/ezyang	2025-08-06 20:54:05 +00:00
Mwiza Kunda	0afaeb7c4e	Improve `extract_test_fn` (#158637 ) The current implementation assumes test functions are resolved as test_module.TestClass.test_fn, however this would not work for modules nested in directories e.g. inductor.test_torchinductor.TestClass.test_fn Pull Request resolved: https://github.com/pytorch/pytorch/pull/158637 Approved by: https://github.com/jbschlosser	2025-08-06 20:45:21 +00:00
Alan Du	50580b5053	Add minimal nn.functional.log_softmax support for NestedTensor (#159662 ) This only works for the jagged layout and for the non-batch and non-jagged dimensions. I did this mostly by copy-pasting from the existing softmax implementation, but it seems fairly straightforward and I think it should work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159662 Approved by: https://github.com/jbschlosser	2025-08-06 20:34:02 +00:00
Frank Seide	b8ef60b6bc	Enable XNNPACK aarch64 builds (#159762 ) Summary: This fixes the build of TorchScript's XNNPACK dependency for our aarch64 device. Thanks to andrewjcg for proposing this fix. Rollback Plan: Reviewed By: andrewjcg Differential Revision: D79497613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159762 Approved by: https://github.com/frankseide, https://github.com/malfet Co-authored-by: Frank Seide <seide@meta.com>	2025-08-06 20:20:32 +00:00
Nikita Shulga	0de2a45a48	[BE] Merge 3 CUDA build jobs into one (#159890 ) Before this change there were build+test jobs: - s89 build+tests - sm75 build+distributed_test - sm_75 build+pr_time_benchmark test This change compiles all 3 builds into one (for 2 architectures) and skips testing sm86 as it never found any new regressions that were not found at the same time on sm89 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159890 Approved by: https://github.com/clee2000, https://github.com/seemethere	2025-08-06 20:09:55 +00:00
xinan.lin	12a54e4ac1	[Inductor UT][Fix XPU CI] Fix case failures introduced by community. (#159759 ) Fixes #159631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159759 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-08-06 20:02:20 +00:00
Nikita Shulga	d10e9e4781	[MPS] Remove all pre-MacOS14 logic (#159912 ) Delete older enums, checks for MacOS-13.3+ for int64 support, etc Fixes https://github.com/pytorch/pytorch/issues/159275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159912 Approved by: https://github.com/manuelcandales	2025-08-06 19:48:12 +00:00
Xu Han	c71950907d	[inductor] add _get_inductor_debug_symbol_cflags for debug symbol control. (#159938 ) We need to add inductor debug symbol support for crash case debug. When we turn on generate debug symbol. On Windows, it should create a [module_name].pdb file. It helps debug by WinDBG. On Linux, it should create some debug sections in binary file. I added UT for it also. It works well on Windows inductor debug. <img width="1648" height="833" alt="image" src="https://github.com/user-attachments/assets/5282a7de-cef3-4a38-9cd4-a0e63482c8b6" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159938 Approved by: https://github.com/jansel, https://github.com/angelayi	2025-08-06 19:31:45 +00:00
Divyansh Khanna	6fa3592dc6	Dataloader benchmark script (#159432 ) This script adds a simple dataloading benchmark tracking throughput and memory. The output looks like this ``` System Information: PyTorch version: 2.9.0a0+gitf87d117 PyTorch location: /home/divyanshkhanna/pytorch/torch/__init__.py Torchvision version: 0.24.0a0+f52c4f1 Torchvision location: /home/divyanshkhanna/pytorch/vision/torchvision/__init__.py CUDA available: True CUDA device: NVIDIA PG509-210 CPU count: 192 Physical CPU cores: 96 Total system memory: 1510.11 GB Loading dataset from imagenet/val (1 copies) Dataset size: 50000 --- Benchmarking DataLoader with worker_method=multiprocessing --- Memory before DataLoader creation: 500.59 MB Detailed memory information: USS (Unique Set Size): 499.00 MB PSS (Proportional Set Size): 500.74 MB RSS (Resident Set Size): 497.39 MB Memory after DataLoader creation: 1127.61 MB Memory increase: 627.02 MB Starting training loop with 1 epochs (max 100 batches per epoch) Epoch 1, Batch 10, Time: 0.2910s, Memory: 12044.50 MB Epoch 1, Batch 20, Time: 0.2909s, Memory: 12185.71 MB Epoch 1, Batch 30, Time: 0.2909s, Memory: 10654.93 MB Epoch 1, Batch 40, Time: 0.2909s, Memory: 12378.26 MB Epoch 1, Batch 50, Time: 0.2907s, Memory: 12402.28 MB Epoch 1, Batch 60, Time: 0.2909s, Memory: 10559.35 MB Epoch 1, Batch 70, Time: 0.2907s, Memory: 12644.69 MB Epoch 1, Batch 80, Time: 0.2909s, Memory: 12654.65 MB Epoch 1, Batch 90, Time: 0.2909s, Memory: 12727.20 MB Epoch 1, Batch 100, Time: 0.2908s, Memory: 12722.09 MB Results: Worker method: multiprocessing DataLoader init time: 0.1553 seconds Average batch time: 0.3408 seconds Samples per second: 375.53 Peak memory usage: 12738.76 MB Memory increase: 12238.17 MB ``` > TODO: This script right now is CPU-only friendly and GPU friendly. But it might be worth upgrading it to test against a canonical DistributedDataParallel setup on say a 1x8 node. Or maybe we can keep that as a separate script inside `benchmarks` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159432 Approved by: https://github.com/ramanishsingh	2025-08-06 19:05:19 +00:00
PyTorch MergeBot	ba37f589d4	Revert "[dynamo] Be consistent with storing func source for UserMethodVariable (#159696 )" This reverts commit ee62177c196d716fc3a2d641370bed8a673a45d3. Reverted https://github.com/pytorch/pytorch/pull/159696 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/159696#issuecomment-3161196192))	2025-08-06 18:41:05 +00:00
Bin Bao	44dd3684d2	[AOTI] Fix memory leak from all_reduce (#159818 ) Summary: This PR solves two issues: 1. When lowering the all_reduce op, Inductor expects to convert it to the in-place version, all_reduce_, but it was calling ir._AllReduceKernel.create_inplace instead of ir._AllReduce_Kernel.create_inplace. This triggers a tricky bug in AOIT because it generates cpp call to the functional version aoti_torch_cpu__c10d_functional_all_reduce, but later corresponding wait operation will still wait on the input to aoti_torch_cpu__c10d_functional_all_reduce instead of the output from aoti_torch_cpu__c10d_functional_all_reduce. This causes unwaited tensor leading to memory leak. 2. Since AOTI generates the inplace version aoti_torch_cpu__c10d_functional_all_reduce_ now. The return tensor from aoti_torch_cpu__c10d_functional_all_reduce_ doesn't get used. It will be released when the program exists, so it's not a memory leak but it will unnecessarily hold that tensor which causes high memory water mark. This PR generates tensor delete operation right after calling aoti_torch_cpu__c10d_functional_all_reduce_. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159818 Approved by: https://github.com/henryhu6, https://github.com/yushangdi	2025-08-06 18:11:14 +00:00
Georgia Phillips	c669b0ab87	Fix execution frame cleanup logic (#158717 ) Summary: This fixes a bug in the execution fram cleanup logic - previously, whenever we hit the time interval to clear out the frames, we were removing any cached execution frames beyond the configured minimum number (frameEntry.used was unused). Instead, we only want to clear frames that were NOT USED in during the last time interval. This diff refactors the executor to have the correct logic. Test Plan: ``` buck2 test 'mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:model_runner_test -- ModelRunnerTest.Basic_InterpreterCuda_Multithread_Cleanup --run-disabled --print-passing-details ``` Rollback Plan: Differential Revision: D78621408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158717 Approved by: https://github.com/dolpm	2025-08-06 18:04:24 +00:00
Luca Wehrstedt	d7a855d67d	[async-TP] Make scaled-mm + reduce-scatter preserve alignment of scales (#159957 ) After https://github.com/pytorch/pytorch/pull/157905 started using cuBLAS for row-wise scaling on CUDA 12.9+, this broke some downstream tests for fp8 which were testing "odd" shapes. After checking in with the cuBLAS team this turned out to be due to the scale tensors' starting addresses not being aligned to 16 bytes. PyTorch storages are always aligned at 256 bytes, hence this came from a "slicing" of the scale tensor being done inside async-TP when chunking a matmul in order to overlap it with reduce-scatter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159957 Approved by: https://github.com/vkuzo, https://github.com/danielvegamyhre	2025-08-06 17:42:26 +00:00
Meet Vadakkanchery	4c01991b38	[DCP][Prototype] Checkpoint replication via PGTransport (#157963 ) (#159801 ) Summary: ### PR Context Introduce simple replication logic via PGTransport. The goal is to showcase a working prototype of replication via PGTransport, in this impl we assume world_sizes are equal allowing us to create perfect bi-directional pairs for the purpose of choosing replica "partners". Test Plan: CI Rollback Plan: Differential Revision: D79590797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159801 Approved by: https://github.com/saumishr	2025-08-06 16:52:03 +00:00
Bin Bao	a4b07fe8f6	[AOTI] Add more default options to compile_standalone (#158560 ) Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560 Approved by: https://github.com/yushangdi	2025-08-06 15:59:27 +00:00
Mikayla Gawarecki	d87161c3c8	[Easy] Fix wrong propagation of fallback_ops_dict in gen_aoti_c_shim (#159904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159904 Approved by: https://github.com/janeyx99	2025-08-06 15:09:18 +00:00
Zhengxu Chen	79eca4677b	[precompile] Skip serializing unnecesssary objects for guards. (#158926 ) Summary: The following type of objects don't need to be serialized for precompile: 1. PyCapsule because we don't guard on C binding objects in meaningful ways. 2. Code object because we only id matching on these but id matches will always be dropped for precompile. 3. Nested function objects since we also ban CLOSURE_MATCH. Test Plan: buck run mode/opt test/dynamo:test_dynamo -- -k test_skipped_objects Rollback Plan: Differential Revision: D78816888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158926 Approved by: https://github.com/jamesjwu	2025-08-06 15:00:28 +00:00
PyTorch MergeBot	2855688a1d	Revert "Replace C array with std::array in formatSockAddr (#159812 )" This reverts commit e7feedf6a9bb346ad205796aa4084c8dcfb18072. Reverted https://github.com/pytorch/pytorch/pull/159812 on behalf of https://github.com/malfet due to Looks like it broke distribtued tests, see `2231c3ca3a/1` ([comment](https://github.com/pytorch/pytorch/pull/159812#issuecomment-3160513656))	2025-08-06 14:55:48 +00:00
Nikita Shulga	2231c3ca3a	[CI][CD] Fix `install_nvshem` function (#159907 ) When one builds CD docker, all CUDA dependencies must be installed into `/usr/local/cuda/` folder Test plan: Looks at the binary build logs, for example [here](https://github.com/pytorch/pytorch/actions/runs/16768141521/job/47477380147?pr=159907): ``` 2025-08-06T05:58:00.7347471Z -- NVSHMEM_HOME set to: '' 2025-08-06T05:58:00.7348378Z -- NVSHMEM wheel installed at: '' 2025-08-06T05:58:00.7392528Z -- NVSHMEM_HOST_LIB: '/usr/local/cuda/lib64/libnvshmem_host.so' 2025-08-06T05:58:00.7393251Z -- NVSHMEM_DEVICE_LIB: '/usr/local/cuda/lib64/libnvshmem_device.a' 2025-08-06T05:58:00.7393792Z -- NVSHMEM_INCLUDE_DIR: '/usr/local/cuda/include' 2025-08-06T05:58:00.7394252Z -- NVSHMEM found, building with NVSHMEM support ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159907 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-08-06 14:44:37 +00:00
can-gaa-hou	c03a734ba1	[OpenReg] Disable automatic inclusion of data files (#159845 ) # Background After I built torch_openreg, I noticed that the wheel package contained the stub.c file under the csrc directory, which was not used in the runtime. # Motivation This PR aims to remove the stub.c file and any unused file when running torch_openreg. Changes: - Setting include_package_data keyword to false in the setup function Pull Request resolved: https://github.com/pytorch/pytorch/pull/159845 Approved by: https://github.com/albanD	2025-08-06 10:35:13 +00:00
Benji Beck	98316e5896	[WOQ] Add CUDA kernel for _weight_int8pack_mm (#159325 ) Summary This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. https://github.com/pytorch/pytorch/issues/158849 Motivation A fused GPU kernel for aten._weight_int8pack_mm would: - Eliminate reliance on the .mul().sum() fallback in quantization.py - Improve performance for quantized inference on CUDA - Extend Inductor’s GPU quantization support across more workloads Implementation - Implement a Triton kernel for: ``` out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n] where: x: [B, K] float32 w: [N, K] int8 scale: [N] float32 out: [B, N] float32 ``` - Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py - Route it conditionally in quantization.py where GPU currently falls back to .mul().sum() - Add unit tests comparing results to the reference fallback path Test Plan: ``` buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda ``` Log: P1882799769 ``` buck2 test 'fbcode//mode/opt' caffe2/test:linalg ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/ Benchmark Results: ``` [Shape B=256, K=1024, N=512] CPU and CUDA outputs match Max abs diff: 2.59e-04, max rel diff: 0.75 CPU: 144.14 ms, CUDA: 303.67 µs Speedup: ×474.6 [Shape B=512, K=2048, N=1024] CPU and CUDA outputs match Max abs diff: 5.49e-04, max rel diff: 0.15 CPU: 1173.27 ms, CUDA: 2.40 ms Speedup: ×488.5 ``` Rollback Plan: Differential Revision: D79042656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159325 Approved by: https://github.com/danielvegamyhre, https://github.com/jerryzh168	2025-08-06 10:28:08 +00:00
angelayi	23cf241039	[aoti][mps] Initialize mps kernels first (#159753 ) In some cases we have mps kernels which are reused across higher-order-op subgraphs and the toplevel code. However, currently we initialize the variable for the mps kernel the first time we use it, which runs into an issue if we run into the mps kernel within a subgraph since the kernel will only be initialized within the subgraph scope. For instance: ``` if ... auto mps_lib_0_func = ... mps_lib_0_func->run() // since we already used mps_lib_0 once, we don't re-initialize it mps_lib_0_func->run() // error, mps_lib_0_func not initialized ``` So the solution we took here is to initialize all the kernels at the beginning: ``` const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() { static const auto func = mps_lib_0.getKernelFunction("generated_kernel"); return func; } AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() { static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get()); return handle; } ... if ... get_mps_lib_0()->run() get_mps_lib_0()->run() // success ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159753 Approved by: https://github.com/malfet ghstack dependencies: #159456, #159695	2025-08-06 07:54:29 +00:00
Will Constable	e7feedf6a9	Replace C array with std::array in formatSockAddr (#159812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159812 Approved by: https://github.com/Skylion007	2025-08-06 07:44:29 +00:00
Will Constable	dad2a05bec	[DTensor] Set up DTensorContinuousTestBase (#159885 ) Also migrate `test_common_rules.py` since it was a short file `python test/distributed/tensor/test_common_rules.py` Before: Ran 10 tests in 91.516s After: Ran 10 tests in 5.604s Pull Request resolved: https://github.com/pytorch/pytorch/pull/159885 Approved by: https://github.com/ezyang	2025-08-06 07:40:31 +00:00
Colin L Reliability Rice	0495cab545	Wire in pt2_triton_builds (#159897 ) Summary: This allows us to start seeing the failure rate on these models (and potentially alert on it). Test Plan: ``` FORCE_LOG_TRITON_BUILDS_TO_PROD=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 buck2 run @//mode/opt :compile 2>&1 \| tee out ``` P1889607054 Waiting for scuba table to generate, but manual logging show it should show up at https://fburl.com/scuba/pt2_triton_builds_inc_archive/7852kt8h soon. Rollback Plan: Reviewed By: masnesral Differential Revision: D79308333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159897 Approved by: https://github.com/masnesral	2025-08-06 07:39:51 +00:00
Mengtian Xu	abfe403981	[AIDIR] Internal util function to insert MLHub debugging insight for dynamic shape (#159391 ) Summary: This feature is Meta internal only Add a util function to put dynamic shape-related suggestion to MLHubDebugInsightService, which will then be surfaced to users in the MLHub . The rollout will be controlled by JK. Test Plan: MAST job aps-omnifmv3_dev_baseline_test-a34fdccf21 {F1980593060} * If you're not able to see the insight, please add yourself to this gk 'mlhub_debugging_insights_dev_visibility' * The URL link should route to a new Job Inspector page that will provide details and straight forward instructions of how to config the ds. The page is currently still in development so here we use the general PT2 compile JI page. * Test fails because of the export checks. I'll export after addressing all the comments from reviewers. Rollback Plan: Reviewed By: pianpwk Differential Revision: D78526522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159391 Approved by: https://github.com/jingsh	2025-08-06 07:39:39 +00:00
Jane Xu	1690c0c3a0	[Reland] Migrate ScalarType to headeronly (#159911 ) The non ghstack version of #159416, to make sure we don't get reverted again Pull Request resolved: https://github.com/pytorch/pytorch/pull/159911 Approved by: https://github.com/mikaylagawarecki	2025-08-06 07:36:37 +00:00
Aidyn-A	e9d27aa8fd	[CUDA 13] CMake/Dependencies: no need to call find_package(CUB) (#159854 ) CUB library is the part of CCCL of the CUDA Toolkit 13. If CUDA Found, CUB is found as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159854 Approved by: https://github.com/eqy	2025-08-06 06:03:58 +00:00
PyTorch MergeBot	2457e62c90	Revert "Set PYTHONHOME for inductor subprocesses using torch (#159382 )" This reverts commit fe8984a9f43bde10d1956abe7cb40710ed7ceed2. Reverted https://github.com/pytorch/pytorch/pull/159382 on behalf of https://github.com/malfet due to Broke MacOS testing see `d0fccbc99c/1` ([comment](https://github.com/pytorch/pytorch/pull/159382#issuecomment-3157455367))	2025-08-06 05:30:20 +00:00
Nikita Shulga	d0fccbc99c	[CI] Delete sm86 tests from pull (#159903 ) And delete sm89+cuda12.4 builds from periodic (as sm86+legacy driver should be enough) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159903 Approved by: https://github.com/huydhn	2025-08-06 05:16:55 +00:00
PyTorch UpdateBot	3461988a4b	[audio hash update] update the pinned audio hash (#159823 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159823 Approved by: https://github.com/pytorchbot	2025-08-06 05:02:35 +00:00
Will Constable	9764981116	Pass fw/bw compilers to aot_export_joint_with_descriptors (#159814 ) Allow overriding nop compilers with real ones when using this flow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159814 Approved by: https://github.com/fmassa	2025-08-06 04:50:56 +00:00
Michael Lazos	704594eb23	[Dynamo] make HOPs hashable (#159910 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159910 Approved by: https://github.com/yf225	2025-08-06 04:02:17 +00:00
eqy	bfc27cf468	[Distributed] Fix `@parametrize` on unordered iterable in distributed test (#159793 ) seems to fix https://github.com/pytorch/pytorch/issues/145807 sets aren't ordered so `@parametrize` can cause two processes to spawn with different settings originally debugged thanks to @k-artem, see https://github.com/pytorch/pytorch/issues/145807#issuecomment-2971009451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159793 Approved by: https://github.com/Skylion007, https://github.com/wconstab	2025-08-06 03:51:42 +00:00
bobrenjc93	311f74089a	remove print (#159917 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159917 Approved by: https://github.com/laithsakka	2025-08-06 03:48:23 +00:00
Tianhao Huang	14c7358c64	Enable fr_trace to read local traces from multiple hosts. (#159490 ) Summary: For training jobs particularly from GenAI, NCCL trace dumps are generated in the format of `<hostname>.pci3_rank_<rank>`. For multi-node training jobs, the hostname varies across traces. The current prefix matching logic can't handle this case. Test Plan: Create a local folder `dumps` and several empty files: `host0.pci3_rank_0`, `host0.pci3_rank_1`, `host1.pci3_rank_0`, `host1.pci3_rank_1` inside it. Then run ``` buck2 run fbcode//caffe2/fb/flight_recorder:fr_trace -- trace_dir dumps ``` Before this diff, fr_trace cannot locate any trace files, giving the following assertion error: ``` AssertionError: no files loaded from /home/tianhaoh/dumps with prefix pci3_rank_ ``` After this diff, fr_trace is able to locate the trace files, resulting in the exceptions like ``` dump = pickle.load(infile) ^^^^^^^^^^^^^^^^^^^ EOFError: Ran out of input ``` (since the trace files are fake and empty). Rollback Plan: Differential Revision: D79224727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159490 Approved by: https://github.com/fduwjj	2025-08-06 03:15:34 +00:00
Dave Lei	8ce81bcee1	[Torch Package] Make get names of OrderedImporters support fallback to importers (#155743 ) Summary: OrderedImporters is supposed to be an importer which tries out every single importer in self._importers. However the get_name API does not follow this behavior and only uses the get_name from the basic Importer class. This change is to update the OrderedImporters get_name API so that it tries the get_name API of every single importers. Differential Revision: D76463252 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155743 Approved by: https://github.com/jcwchen, https://github.com/jingsh	2025-08-06 02:26:10 +00:00
Yu, Guangye	4604f0482c	Add UT for torch.accelerator memory-related API (#155200 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155200 Approved by: https://github.com/albanD ghstack dependencies: #138222, #152932	2025-08-06 02:22:18 +00:00
Yu, Guangye	15f1173e5d	Add unified memory APIs for torch.accelerator (#152932 ) # Motivation The following API will be put under torch.accelerator - empty_cache - max_memory_allocated - max_memory_reserved - memory_allocated - memory_reserved - memory_stats - reset_accumulated_memory_stats - reset_peak_memory_stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932 Approved by: https://github.com/albanD ghstack dependencies: #138222	2025-08-06 02:22:18 +00:00
henrylhtsang	e16c48ae97	[BE] Fix type hint in AOTIRunnerUtil (#159577 ) Not sure why it was labelled as list in the first place. In test_aot_inductor.py, I scanned a few use cases and they are tuple as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159577 Approved by: https://github.com/Skylion007	2025-08-06 01:20:45 +00:00
Yu, Guangye	f7a66da5f9	Add DeviceAllocator as the base device allocator (#138222 ) # Motivation In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases. <div align="center"> <table> <tr> <td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td> </tr> <tr> <td> ```python torch.xxx.empty_cache ``` </td> <td> ```python torch.accelerator.empty_cache ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_peak_memory_stats ``` </td> <td> ```python torch.accelerator.reset_peak_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.reset_accumulated_memory_stats ``` </td> <td> ```python torch.accelerator.reset_accumulated_memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_stats ``` </td> <td> ```python torch.accelerator.memory_stats ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_allocated ``` </td> <td> ```python torch.accelerator.memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_allocated ``` </td> <td> ```python torch.accelerator.max_memory_allocated ``` </td> </tr> <tr> <td> ```python torch.xxx.memory_reserved ``` </td> <td> ```python torch.accelerator.memory_reserved ``` </td> </tr> <tr> <td> ```python torch.xxx.max_memory_reserved ``` </td> <td> ```python torch.accelerator.max_memory_reserved ``` </td> </tr> </table> </div> # Solution This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222 Approved by: https://github.com/albanD, https://github.com/Camyll	2025-08-06 00:40:29 +00:00
Animesh Jain	3eb3da9b4b	[dynamo][guards] Skip ID_MATCH guard on self.__class__.__closure__ (#159888 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159888 Approved by: https://github.com/williamwen42	2025-08-06 00:36:43 +00:00
Jane Xu	3ddfd46bd2	Cut a version of TORCH_ERROR_CODE_CHECK in headeronly from AOTI (#159604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159604 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-08-06 00:29:56 +00:00
Zhengxu Chen	6a82da392e	[export] Fix generated schema for C++20/23 (#159871 ) Summary: Fixing the issue from https://github.com/pytorch/pytorch/issues/159838 Test Plan: buck run caffe2/:export_update_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/ Rollback Plan: Differential Revision: D79647167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159871 Approved by: https://github.com/malfet	2025-08-06 00:23:05 +00:00
Simon Fan	22bedc429f	Extract some HOP utils to be importable (#159705 ) Useful helper function for stage 1 export -> manual partitioner -> stage 2 compile users Pull Request resolved: https://github.com/pytorch/pytorch/pull/159705 Approved by: https://github.com/zou3519 ghstack dependencies: #159134	2025-08-05 23:59:47 +00:00
Huy Do	49abc0e3f8	[Take 2] Setup TorchBench in Docker (#159300 ) Fix and reland https://github.com/pytorch/pytorch/pull/158613, I keep `checkout_install_torchbench` in `.ci/pytorch/macos-test.sh` script because it's still used there, and there is no Docker. ### Testing MacOS perf nightly run https://github.com/pytorch/pytorch/actions/runs/16580798470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159300 Approved by: https://github.com/ZainRizvi	2025-08-05 23:47:42 +00:00
Xu Han	1052604acd	fix logging setup issue for Windows.. (#159887 ) When we setup logging config as guide: https://docs.pytorch.org/docs/stable/logging.html Such as: TORCH_LOGS="+schedule,+inductor,+output_code" On Linux, it shows as: ```cmd declare -x SSH_TTY="/dev/pts/0" declare -x TERM="xterm" declare -x TORCH_LOGS="+schedule,+inductor,+output_code" declare -x USER="xu" ``` On Windows, it shows as: ```cmd TORCHINDUCTOR_WINDOWS_TESTS=1 TORCH_LOGS="+schedule,+inductor,+output_code" UCRTVersion=10.0.22000.0 ``` For Linux, it shows quotes by default, And Windows is not shows quotes. Besides that, Windows would auto assemble quotes when env var processing. On Linux, we will get variable: "+schedule,+inductor,+output_code" On Windows, we will get variable: '"+schedule,+inductor,+output_code"' So, we need remove the outer quotes for Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159887 Approved by: https://github.com/angelayi	2025-08-05 23:44:38 +00:00
Alex Malyshev	fe8984a9f4	Set PYTHONHOME for inductor subprocesses using torch (#159382 ) Summary: This is needed for subprocesses that are trying to call back into torch functionality, i.e. anything that's also setting `PYTHONPATH`. There are more `sys.executable` subprocesses in torch/ but it seems like they're fine. Test Plan: Local inference runs. Reviewed By: aorenste Differential Revision: D79124705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159382 Approved by: https://github.com/aorenste	2025-08-05 23:32:48 +00:00
angelayi	74a754aae9	Add meta kernel for sdpa_math_for_mps (#159695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159695 Approved by: https://github.com/malfet ghstack dependencies: #159456	2025-08-05 22:27:06 +00:00
angelayi	b1ec088113	[mps] Turn on inductor dynamic shapes tests (#159456 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159456 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-08-05 22:27:06 +00:00
angelayi	fb35a9ea4a	[export] Improve error messages (#159881 ) Originally, if the PT2 errored when loading, we would try to load using the old loader to fit BC issues. However this hides the error messages for if an up-to-date PT2 is erroring when loading due to some other reason. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159881 Approved by: https://github.com/yushangdi	2025-08-05 22:26:48 +00:00
Sandeep Narendranath Karjala	8034b2a732	[inductor] Add TLParse artifact for logging runtime of collective and compute ops (#159730 ) Summary: - debug.py: Added log_runtime_estimates() function to dump runtime estimation data as structured tlparse artifacts in JSON format - test_structured_trace.py: Added comprehensive test coverage with testing compute and collective ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/159730 Approved by: https://github.com/yushangdi ghstack dependencies: #159190	2025-08-05 22:06:32 +00:00
anwang	64cc6f06b1	[Inductor] Revert minimal changes to avoid internal test failures (#159809 ) The diff/PR https://github.com/pytorch/pytorch/pull/159211 caused a bunch of test failures for graph compiler(T232684410). But I couldn't figure out a forward fix so far. So with this diff/PR, I'm proposing to revert the minimal changes to resolve the test failures. I'll continue the debugging, and re-land the reverted changes once we find out a forward fix. Differential Revision: [D79221721](https://our.internmc.facebook.com/intern/diff/D79221721/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159809 Approved by: https://github.com/blaine-rister, https://github.com/eellison	2025-08-05 22:05:26 +00:00
PyTorch MergeBot	410812763b	Revert "[Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777 )" This reverts commit bbc0df1094b5a4dcd2cce83f8402127b07913231. Reverted https://github.com/pytorch/pytorch/pull/159777 on behalf of https://github.com/izaitsevfb due to breaking inductor test on ROCm ([comment](https://github.com/pytorch/pytorch/pull/159777#issuecomment-3156770098))	2025-08-05 22:00:24 +00:00
Michael Lazos	bdb07a2bc5	[Cutlass] Allow offsets to be passed as arguments to kernel (#159761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159761 Approved by: https://github.com/henrylhtsang ghstack dependencies: #159760	2025-08-05 21:59:07 +00:00
Simon Fan	8085edc8f9	[autograd] torch._C._set_view_replay_enabled state leaking into other tests (#159840 ) This was causing view_fns to pop up in tests that ran after `TestAutograd.test_view_replay_enabled` where it isn't used as a context manager. It is unclear to me why we would want `_force_original_view_tracking` to mutate global state on __init__ rather than on __enter__, that could be an alternative fix. FIXES https://github.com/pytorch/pytorch/issues/156306 https://github.com/pytorch/pytorch/issues/156289 https://github.com/pytorch/pytorch/issues/156265 https://github.com/pytorch/pytorch/issues/156209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159840 Approved by: https://github.com/albanD	2025-08-05 21:57:49 +00:00
Nikita Shulga	882d50c5bf	[C10] Add `Scalar::isUnsigned()` method (#159877 ) That returns true if Scalar hold unsigned integral value With the implications of `Tag::HAS_u` semantic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159877 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2025-08-05 21:43:21 +00:00
Catherine Lee	b52a4d0821	[ez][CI] Remove some unused docker images (#159171 ) Removes unused docker images from the docker build workflow Then removes unused definitions in build.sh The only one I left is the vllm one because I'm pretty sure it's going to be used in the future I assume everything not mentioned is old and we forgot to remove them Pull Request resolved: https://github.com/pytorch/pytorch/pull/159171 Approved by: https://github.com/yangw-dev	2025-08-05 21:31:53 +00:00
Nikita Shulga	a45a840926	[CI] Disable check-labels and check_mergeability (#159900 ) See https://github.com/pytorch/pytorch/issues/159825 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159900 Approved by: https://github.com/clee2000	2025-08-05 21:16:12 +00:00
Nikita Shulga	9b953bb3fb	[BE] Update TensorPipe pin (#159834 ) No functional changes, just: - Update C++ standard to C++17 - Update `cmake` min version to 3.18 - Update `libuv` dependency to 1.51 (to move its cmake min version to 3.10) - Replace boost optional implementation with `std::optional` wrapper - Make it compilable with gcc-14.x plus by including `cstddef` in few headers - Avoid using deprecated enums for MacOS builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/159834 Approved by: https://github.com/Skylion007	2025-08-05 20:45:09 +00:00
eellison	eb25a95a6e	Fix inductor memory estimation when a single buf has multiple mutations. Add runtime verification of mem tracking (#159569 ) With fsdp, we sometimes have multiple, non-overlapping views of a single buffer which are all mutated. Previously we considered the original buffer as an allocation, and make the mutated buffer the deallocation. With multiple mutations of the same buffer, we need to consider the original buffer as deallocated only when all of its aliases die (and avoid double counting the input buffer size). See comment inline: ``` When an operation mutates a buffer in-place, the scheduler creates a new buffer name to track the "before" and "after" states, even though they share the same memory. The mutated buffer represents a rename with zero allocation and deallocation cost. During dependency tracking, we transfer dependencies from the mutated name back to the original buffer, ensuring the original memory is only freed when all aliases are done. This handles cases where a buffer has multiple non-overlapping aliases - rather than trying to assign free costs to individual aliases, we forward all alias dependencies to the original buffer. Consider: buf0 = op0() buf1 = mutation_op_(buf0) del buf0 ... op(buf1) del buf1 The only memory events are the creation prior to op0, and the deletion following buf1. ``` As @IvanKobzarev 's logs in https://github.com/pytorch/pytorch/pull/158361/files#diff-e173a1d52aff49959c9f6d17ecc09946d8a616fc5909df884e62a15e1ebd1d41R1776-R1807 show, it can a bit of a pain to pinpoint which part of our memory calculation is incorrect. This pr also adds a runtime verifier `config.test_configs.track_memory_lifecycle` which tracks buffer allocation and deallocation, and errors if their lifetime does not match our expectations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159569 Approved by: https://github.com/IvanKobzarev	2025-08-05 19:58:11 +00:00
eqy	9884d0351e	[CUDA] Decrease launch bounds of CTCLoss backward for blackwell (#159522 ) Otherwise we see `CUDA error: too many resources requested for launch` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159522 Approved by: https://github.com/janeyx99	2025-08-05 19:26:25 +00:00
Eli Uriegas	d7c83972d5	tools: Add mode to find python automatically (#159820 ) Add support for automatically finding Python interpreters in manylinux environments to our wheel building script. Scaffolding for sequential builds Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159820 Approved by: https://github.com/malfet	2025-08-05 19:19:22 +00:00
Nikita Shulga	e06b110f73	[Testing] Add MPS to NATIVE_DEVICES (#153835 ) This would allow me to enable more opinfo tests against MPS device eventually and supposed to be a very simple test, but actually required minor adjustments to lots of test files, namely: - Introduce `all_mps_types_and` that is very similar to `all_types_and`, but skips `float64` - Decorate lots of tests with `@dtypesIfMPS(*all_mps_types())` - Skip `test_from_dlpack_noncontinguous` as it currently crashes (need to be fixed) - Add lots of `expectedFailureIfMPS` - Delete all `@onlyNativeDeviceTypesAnd("mps")` <sarcasm> I love how well documented this variable are </sarcasm> Pull Request resolved: https://github.com/pytorch/pytorch/pull/153835 Approved by: https://github.com/Skylion007	2025-08-05 18:57:35 +00:00
Zheng, Zhaoqiong	0ba09a6d34	fix link for tutorial of inductor on windows (#159853 ) fix link issue from https://docs.pytorch.org/tutorials/prototype/inductor_windows.html to https://docs.pytorch.org/tutorials/unstable/inductor_windows.html due to structure change with pr https://github.com/pytorch/tutorials/pull/3489 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159853 Approved by: https://github.com/sekyondaMeta Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com> Co-authored-by: Zesheng Zong <zesheng.zong@outlook.com>	2025-08-05 18:37:47 +00:00
Luca Wehrstedt	aeb5321b63	Allow controlling PG backend and options via init_device_mesh (#159371 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159371 Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/wanchaol	2025-08-05 12:44:14 +00:00
Ruben Rodriguez Buchillon	625108ede2	[inductor] consolidate common GEMM triton param retrieval (#159383 ) \# Why - Make loop iteration simpler - Have a common spot where to make modifications that affect all the GEMM Triton templates, avoiding missed spots \# What - pull out commong logic of taking the BaseConfig objects and turning them into kwargs to feed into maybe_append_choice for Triton GEMM templates Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383 Approved by: https://github.com/jansel	2025-08-05 11:42:25 +00:00
Edward Z. Yang	09e5a93fcb	Improve graph output alias with subclass error message (#159619 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159619 Approved by: https://github.com/albanD	2025-08-05 06:47:31 +00:00
Yu, Guangye	908c5cc4c0	Generalize torch._C._set_allocator_settings to be generic (#156175 ) # Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175 Approved by: https://github.com/albanD ghstack dependencies: #159629, #150312, #156165	2025-08-05 04:08:42 +00:00
Yu, Guangye	c1145852a5	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #159629, #150312	2025-08-05 04:08:42 +00:00
Yu, Guangye	ae1a706444	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD ghstack dependencies: #159629	2025-08-05 04:08:04 +00:00
Yu, Guangye	56d19a5ced	Fix AllocatorConfig potential SIO issue (#159629 ) # Motivation As @ScottTodd identified in this [comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3141524874), using STL containers like `std::string` and `std::unordered_set` at static init time can cause static initialization order issues. This PR is based on and modified from his original PR: https://github.com/pytorch/pytorch/pull/159607. I’m stacking this PR here to help facilitate the landing and validation process. Co-authored-by: @ScottTodd Pull Request resolved: https://github.com/pytorch/pytorch/pull/159629 Approved by: https://github.com/ScottTodd, https://github.com/albanD	2025-08-05 04:07:51 +00:00
Lucas Kabela	b6c53383fe	[Dynamo][Better Engineering] Type annotation for `torch/_dynamo/output_graph.py` (#159602 ) As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to `torch/_dynamo/output_graph.py` Running ``` mypy torch/_dynamo/output_graph.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Annotated \| Lines Total \| % lines covered \| Funcs Annotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 2163 \| 4792 \| 45.14% \| 121 \| 268 \| 45.15% \| \| This PR \| 4818 \| 4818 \| 100.00% \| 268 \| 268 \| 100.00% \| \| Delta \| +2655 \| +26 \| +54.84% \| +147 \| 0 \| +54.85% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/159602 Approved by: https://github.com/Skylion007	2025-08-05 03:50:54 +00:00
Divyansh Khanna	4fd5fabee9	skip XPU for dataloader CPU only unit test (#159811 ) Fixes [#159802](https://github.com/pytorch/pytorch/issues/159802) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159811 Approved by: https://github.com/izaitsevfb	2025-08-05 03:44:01 +00:00
Nick Riasanovsky	bbc0df1094	[Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777 ) Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs. Test Plan: Relying on CI. Should be a NFC. Rollback Plan: Reviewed By: davidberard98 Differential Revision: D79378792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159777 Approved by: https://github.com/davidberard98	2025-08-05 03:29:13 +00:00
Mark Harfouche	33ec6e3e9a	Remove pin on libuv from instructions (#159504 ) This package doesn't exist at conda-forge and causes some confusion for users. see https://anaconda.org/conda-forge/libuv/files?version=1.39.0 libuv is quite stable, so the newer versions should be fine. we build with them anyway at conda-forge. see: https://github.com/conda-forge/libuv-feedstock/issues/80 Hopefully this can help future users. Fixes https://github.com/conda-forge/libuv-feedstock/issues/80 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159504 Approved by: https://github.com/seemethere	2025-08-05 03:18:42 +00:00
CaoE	efc4b460b3	Add cascade sum support for Inductor CPP backend (#156296 ) Fixes #154703 Add cascade summation support for Inductor CPP backend to improve precision for large size summation. Currently, Inductor CPP directly do reduction for sum. As shown in #154703, when the size of the sum is large and the number of parallel is small, direct reduction will cause an intolerable precision loss: ``` extern "C" void kernel(float* in_out_ptr0, const float* in_ptr0) { auto out_ptr0 = in_out_ptr0; { { float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(3000000000L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(3000000000L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); tmp_acc0_vec = tmp_acc0_vec + tmp0; } } } tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float, 1>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec); out_ptr0[static_cast<int64_t>(0L)] = static_cast<float>(tmp_acc0); } } { { { auto tmp0 = out_ptr0[static_cast<int64_t>(0L)]; auto tmp1 = static_cast<float>(3000000000.0); auto tmp2 = tmp0 / tmp1; in_out_ptr0[static_cast<int64_t>(0L)] = tmp2; } } } } ``` After adding cascade sum support: ``` extern "C" void kernel(float* in_out_ptr0, const float* in_ptr0) { auto out_ptr0 = in_out_ptr0; { { float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); at::vec::Vectorized<float> masked_tmp_acc0_vec = at::vec::Vectorized<float>(0); CascadeSumHelper<float, 65536> scalar_cascade_helper0(static_cast<int64_t>(3000000000L)); CascadeSumHelper<at::vec::Vectorized<float>, 65536> cascade_helper0(static_cast<int64_t>(187500000L)); CascadeSumHelper<at::vec::Vectorized<float>, 65536> masked_cascade_helper0(static_cast<int64_t>(0L)); for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(3000000000L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(3000000000L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); tmp_acc0_vec = cascade_sum_combine(tmp0, &cascade_helper0); } } } tmp_acc0 = cascade_sum_final(&scalar_cascade_helper0); tmp_acc0_vec = cascade_sum_final(&cascade_helper0); masked_tmp_acc0_vec = cascade_sum_final(&masked_cascade_helper0); tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float, 1>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec + masked_tmp_acc0_vec); out_ptr0[static_cast<int64_t>(0L)] = static_cast<float>(tmp_acc0); } } { { { auto tmp0 = out_ptr0[static_cast<int64_t>(0L)]; auto tmp1 = static_cast<float>(3000000000.0); auto tmp2 = tmp0 / tmp1; in_out_ptr0[static_cast<int64_t>(0L)] = tmp2; } } } } ``` This will inevitably reduce performance when cascade sum is turned on. For the case shown in #154703: performance reduced by ~3%. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156296 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-08-05 02:54:32 +00:00
Nikita Shulga	1ca8388442	[BE][MPS] Remove unused size12 variable (#159832 ) Fixes following compilation warning ``` /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:433:8: warning: unused variable 'size12' [-Wunused-variable] auto size12 = input_sizes[1] * input_sizes[2]; ^ 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159832 Approved by: https://github.com/dcci	2025-08-05 02:32:06 +00:00
dolpm	b69497351d	[nativert] force resize to zero. (#159683 ) Summary: this was quite a miserable bug. there are a few kernels that don't explicitly resize outputs to zero, which led to some weird UB. Rollback Plan: Differential Revision: D79476454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159683 Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier	2025-08-05 02:25:31 +00:00
Will Constable	482f069c41	[C10D] fix slow init due to repeated dns resolution failure (#159596 ) It can be be very slow to repeatedly hit DNS resolution failure, but its very helpful to have DNS names in logs by default. So we try to use DNS but if we hit a transient failure we just disable it for the remainder of the job, logging IP addresses instead. Fixes #159007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159596 Approved by: https://github.com/d4l3k	2025-08-05 02:15:26 +00:00
Benjamin Hottell	85d931f29e	Use uppercase OR when checking for system XNNPACK (#159527 ) This PR fixes `cmake/Dependencies.cmake` to work when compiling with `USE_SYSTEM_XNNPACK=ON` by changing a lowercase `or` to an uppercase `OR`. --- For a personal project, I was building pytorch with a customized build of XNNPACK. When trying to do so I encountered the following error: ``` CMake Error at cmake/Dependencies.cmake:566 (if): if given arguments: "NOT" "XNNPACK_LIBRARY" "or" "NOT" "microkernels-prod_LIBRARY" Unknown arguments specified Call Stack (most recent call first): CMakeLists.txt:868 (include) ``` Upon making the change in this PR (changing `or` to `OR`), the process continued as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159527 Approved by: https://github.com/janeyx99	2025-08-05 02:10:53 +00:00
Jack Taylor	8a2f53c523	Recursively sync fbgemm submodules before build (#159477 ) ROCm inductor benchmark builds failing fbgemm build stage https://ossci-raw-job-status.s3.amazonaws.com/log/46800456622 ``` 2025-07-27T08:00:32.3443858Z /var/lib/jenkins/pytorch/fbgemm/src/RowWiseSparseAdagradFused.cc:389:18: error: no matching function for call to ‘asmjit::v1_17::x86::Vec::Vec(uint32_t)’ 2025-07-27T08:00:32.3444080Z 389 \| x86::Xmm partial_sum_xmm(partial_sum_vreg.id()); ``` It looks like asmjit fails to build, this seems to be due to submodules of fbgemm not being updated after checking out to new commit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159477 Approved by: https://github.com/pruthvistony, https://github.com/eqy	2025-08-05 02:00:54 +00:00
Kurt Mohler	b59b61a099	Add `avg_pool3d` backward pass for MPS (#159089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159089 Approved by: https://github.com/malfet	2025-08-05 01:55:38 +00:00
Cui, Yifeng	57ab39f7e4	Update torch-xpu-ops commit pin (#159621 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1f7a57](`1f7a57f507`) includes: - Add Template Parameter to the function `gpu_kernel` for Controlling Broadcasting Vectorization - Add optional NaN checks to XCCL - Fix NllLossForwardReduce2DKernelFunctor accuracy - Extend the existing communication logging to include the reduction operation for collective calls - [Reland] Install xpu codegen header to torch/include Pull Request resolved: https://github.com/pytorch/pytorch/pull/159621 Approved by: https://github.com/EikanWang	2025-08-05 01:46:15 +00:00
Michael Lazos	182975e01a	[Dynamo] Enable torch function dispatch on HOPs (#159708 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159708 Approved by: https://github.com/zou3519, https://github.com/XilunWu ghstack dependencies: #159707	2025-08-05 01:43:22 +00:00
Michael Lazos	9f8cfe7476	[Dynamo] Fix arg ordering in tf modes (#159707 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159707 Approved by: https://github.com/zou3519	2025-08-05 01:43:21 +00:00
Oguz Ulgen	e273ff028a	Fix failing test (#159800 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159800 Approved by: https://github.com/aorenste	2025-08-05 00:28:51 +00:00
David Berard	5e0fc2c9a9	[AOTI] don't allow int32 indices if {non-inf, > int32_max} upper bound is provided (#159433 ) Motivation / Context: (what I _think_ is happening here) In "eager"/just-in-time PT2 usage, dynamo/inductor will guard on whether indices fit in int32 or not. So it's generally safe in Inductor code to rely on the example values for symbolic ints in order to determine whether indices fit in int32, because the indices will be guarded on anyway; and if the inputs ever increase to `>int32_max`, dynamo will cause a recompilation. But with AOTI, those int32 guards aren't respected; so if the example input is `< int32_max` but can be `> int32_max` during future execution, then the future execution might fail / IMA. Solution space Export allows users to specify which dimension are dynamic, and to provide ranges of valid sizes. One solution idea is to always respect the upper bound of the dynamic shape range when doing AOTI; if the index's range includes values `>int32_max`, then don't use the hint and assume that this index doesn't fit in int32. However, the problem with this is that many users may specify dynamism without specifying a range of values - the upper bound of the range will be set to the default of `inf`. Such use cases could potentially experience a perf regression if we implemented the idea above. To prevent any such regressions, this implementation will rely solely on the specified range only if the upper bound of the range isn't inf. In other words, we'll ignore the hints/example values for AOTI (and rely only on the specified range) only if the upper bound of the range isn't inf - if users explicitly specify a range that extends past int32, we can be fairly sure that they actually do need values `>int32_max`. If we continue to see correctness issues even with this implementation, we could consider more aggressively relying on the ranges. Differential Revision: [D79220301](https://our.internmc.facebook.com/intern/diff/D79220301) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159433 Approved by: https://github.com/jingsh, https://github.com/ColinPeppler	2025-08-05 00:17:09 +00:00
Shangdi Yu	bc4b04e058	DeviceCopy should have the same layout as input (#159615 ) Summary: Fix https://github.com/pytorch/pytorch/issues/159612 - Fix the meta implementation of `nan_to_num`, it should preserve the stride of the input - The DeviceCopy IR node should always preserve the input's layout, so we don't end up with a contiguous call during device copy Test Plan: ``` buck2 run @mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_d2h_copy ``` Rollback Plan: Differential Revision: D79411407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159615 Approved by: https://github.com/eellison	2025-08-04 23:56:58 +00:00
David Berard	6b414f56a4	Revert "[inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160 ) (#158462 )" (#159798 ) This reverts commit 305a03727672de42870f956ddf4ad9fa424443e1. Reason: causes device-side assertion failures when running with this repro (a minimized version of a failure seen in a real model) ``` import torch def ri(inp, repeats, output_size): return torch.repeat_interleave(inp, repeats, output_size=output_size) inp = torch.arange(0, 4, device="cuda").reshape(-1, 1) x = torch.tensor([1, 2, 3, 4], device="cuda") ri_c = torch.compile(ri) print(ri(inp, x, 10)) print(ri_c(inp, x, 10)) ``` which leads to errors like ``` /tmp/torchinductor_dberard/3h/c3hlb22fpptebupstsuhl6kexa6z3upgbnyxln7c24gfcr5747iu.py:30: unknown: block: [0,0,0], thread: [10,0,0] Assertion `index out of bounds: 0 <= tmp5 < 4` failed. ``` Differential Revision: [D79591561](https://our.internmc.facebook.com/intern/diff/D79591561) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159798 Approved by: https://github.com/danzimm	2025-08-04 23:39:20 +00:00
PyTorch MergeBot	fb8f32ef52	Revert "[mps] Turn on inductor dynamic shapes tests (#159456 )" This reverts commit 19f1f9960db7f29f2110a7f49f06a1a23c651ecf. Reverted https://github.com/pytorch/pytorch/pull/159456 on behalf of https://github.com/davidberard98 due to Sorry - this causes a merge conflict with https://github.com/pytorch/pytorch/pull/159798, which I'm trying to land with co-dev to resolve a sev ([comment](https://github.com/pytorch/pytorch/pull/159456#issuecomment-3152751821))	2025-08-04 23:11:05 +00:00
Michael Lazos	7ba996bbaa	[Cutlass] Fix wrapper code generation breakage (#159760 ) Fixes issues introduced by https://github.com/pytorch/pytorch/pull/159355 The issue got past OSS CI because the H100 tag wasn't added, not sure how to prevent these kinds of issues in the future, perhaps we should run H100 on Inductor PRs? Pull Request resolved: https://github.com/pytorch/pytorch/pull/159760 Approved by: https://github.com/angelayi	2025-08-04 23:03:03 +00:00
henrylhtsang	ddbdcdc710	[cutlass backend][test] Expand FP8 tests to FP16 (#159538 ) Differential Revision: [D79317343](https://our.internmc.facebook.com/intern/diff/D79317343/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159538 Approved by: https://github.com/mlazos	2025-08-04 23:01:55 +00:00
angelayi	19f1f9960d	[mps] Turn on inductor dynamic shapes tests (#159456 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159456 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-08-04 22:44:31 +00:00
yewentao256	fd6655a0f5	Feature: Implement support for `cudnn_batch_norm_out` kernel to replace the autogen approach. (#123020 ) Fixes #115611 Autogen kernel may cause redundant copy, so we develop the kernel to improve efficiency. Test Case: ```c++ #include <torch/torch.h> #include <iostream> #include <ATen/ATen.h> #include <ATen/cuda/CUDAContext.h> int main() { auto input = torch::rand({2, 3, 4, 4}, torch::device(torch::kCUDA)); auto weight = torch::randn({3}, torch::device(torch::kCUDA)); auto bias = torch::randn({3}, torch::device(torch::kCUDA)); auto running_mean = torch::zeros({3}, torch::device(torch::kCUDA)); auto running_var = torch::ones({3}, torch::device(torch::kCUDA)); bool training = true; double exponential_average_factor = 0.1; double epsilon = 1e-5; auto output = torch::empty_like(input); auto save_mean = torch::empty({3}, torch::device(torch::kCUDA)); auto save_var = torch::empty({3}, torch::device(torch::kCUDA)); auto reserve = torch::empty({0}, torch::device(torch::kCUDA)); // empty place-holder at::native::cudnn_batch_norm_out(input, weight, bias, running_mean, running_var, training, exponential_average_factor, epsilon, output, save_mean, save_var, reserve); auto outputs = at::native::cudnn_batch_norm(input, weight, bias, running_mean, running_var, training, exponential_average_factor, epsilon); bool is_close_output = torch::allclose(output, std::get<0>(outputs)); bool is_close_save_mean = torch::allclose(save_mean, std::get<1>(outputs)); bool is_close_save_var = torch::allclose(save_var, std::get<2>(outputs)); bool is_close_reserve = torch::allclose(reserve, std::get<3>(outputs)); std::cout << "Is output close: " << is_close_output << std::endl; std::cout << "Is save_mean close: " << is_close_save_mean << std::endl; std::cout << "Is save_var close: " << is_close_save_var << std::endl; std::cout << "Is reserve close: " << is_close_reserve << std::endl; return 0; } ``` Please CC @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/123020 Approved by: https://github.com/andrewor14, https://github.com/eqy, https://github.com/albanD	2025-08-04 22:40:33 +00:00
Lucas Kabela	a7f3bdf550	[Dynamo][Better Engineering] Type coverage for `torch/_dynamo/utils.py` (#159580 ) As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to `torch/_dynamo/utils.py` Running ``` mypy torch/_dynamo/utils.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Annotated \| Lines Total \| % lines covered \| Funcs Annotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 2163 \| 4792 \| 45.14% \| 121 \| 268 \| 45.15% \| \| This PR \| 4818 \| 4818 \| 100.00% \| 268 \| 268 \| 100.00% \| \| Delta \| +2655 \| +26 \| +54.84% \| +147 \| 0 \| +54.85% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/159580 Approved by: https://github.com/williamwen42	2025-08-04 21:51:53 +00:00
Xu Han	510e8b4ae0	[inductor] use writable temp file on windows (#159738 ) Use `WritableTempFile` on Windows, reference to: https://github.com/pytorch/pytorch/pull/159342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159738 Approved by: https://github.com/angelayi, https://github.com/Skylion007	2025-08-04 21:51:02 +00:00
PyTorch MergeBot	83ba3f1101	Revert "[inductor] allocate non-blocking copy destinations in pinned memory (#155121 ) (#158758 )" This reverts commit 6085bf7565fec0d2ed26e8590001f09c05adbbe4. Reverted https://github.com/pytorch/pytorch/pull/158758 on behalf of https://github.com/davidberard98 due to I need to revert #158462 (it causes device-side asserts), and this PR causes a merge conflict in the test file. Sorry about that! ([comment](https://github.com/pytorch/pytorch/pull/158758#issuecomment-3152490371))	2025-08-04 21:47:11 +00:00
PyTorch MergeBot	1fad16aacb	Revert "[inductor] move all cpu scalars using pinned memory for graph partition (#155360 ) (#158983 )" This reverts commit 444e2381d07a14cb501c00d11f9e63a3f1d2c86e. Reverted https://github.com/pytorch/pytorch/pull/158983 on behalf of https://github.com/davidberard98 due to I need to revert #158462 (it causes device-side asserts), and this PR causes a merge conflict in the test file. Sorry about that! ([comment](https://github.com/pytorch/pytorch/pull/158758#issuecomment-3152490371))	2025-08-04 21:47:11 +00:00
Markus Hoehnerbach	444e2381d0	[inductor] move all cpu scalars using pinned memory for graph partition (#155360 ) (#158983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158983 Approved by: https://github.com/eellison ghstack dependencies: #158758	2025-08-04 21:42:05 +00:00
Markus Hoehnerbach	6085bf7565	[inductor] allocate non-blocking copy destinations in pinned memory (#155121 ) (#158758 ) Fixes #155121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758 Approved by: https://github.com/EikanWang, https://github.com/eellison	2025-08-04 21:22:11 +00:00
Natalia Gimelshein	8201dbf4bc	check driver to be >=12.4 to use fabric handles (#159697 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/159697 Approved by: https://github.com/malfet	2025-08-04 21:05:39 +00:00
atalman	26d045bb60	Linux py 3.14 wheel builds (#157559 ) Related to https://github.com/pytorch/pytorch/issues/156856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157559 Approved by: https://github.com/malfet, https://github.com/albanD	2025-08-04 20:55:19 +00:00
PyTorch MergeBot	356ac3103a	Revert "Stop parsing command line arguments every time common_utils is imported. (#156703 )" This reverts commit 310f901a71e53688866b14bb2f2b4c8eef9979b3. Reverted https://github.com/pytorch/pytorch/pull/156703 on behalf of https://github.com/izaitsevfb due to breaking tests internally with `assert common_utils.SEED is not None` ([comment](https://github.com/pytorch/pytorch/pull/156703#issuecomment-3152337518))	2025-08-04 20:37:39 +00:00
Kurt Mohler	d4109a0f99	[MPS] Add max_unpool1d/2d/3d (#159789 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159789 Approved by: https://github.com/malfet	2025-08-04 20:00:59 +00:00
Arsh Zahed	7ea789ccfb	Revert #156868 : Bring back symint check for sharding propagation cache (#159671 ) Fixes #159601 Unfortunately #156868 introduced a couple regressions (see #159590 and #159601). This reverts the commit while I am working on a permanent fix. This means the `in_compiled_autograd_initial_trace` global flag will be removed and the `_are_we_tracing()` will instead be replaced with the symint preprocessing step during sharding prop post init. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159671 Approved by: https://github.com/xmfan	2025-08-04 19:58:48 +00:00
PyTorch MergeBot	7e8197e34d	Revert "Migrate ScalarType to headeronly (#159416 )" This reverts commit 1371a98b0e727f8a8916dd473b6dd0cff78c0449. Reverted https://github.com/pytorch/pytorch/pull/159416 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D79452481 ([comment](https://github.com/pytorch/pytorch/pull/159416#issuecomment-3152138508))	2025-08-04 19:55:09 +00:00
Benjamin Glass	50eac811a6	[typing] Constrain OrderedSet generic to be Hashable (#159684 ) Ran across this typing bug while creating an OrderedSet from a type I didn't realize wasn't hashable, which failed at runtime. With this constraint, typing would've failed pre-runtime. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159684 Approved by: https://github.com/Skylion007	2025-08-04 18:08:01 +00:00
ILCSFNO	4e0f179d0b	Update the signature and test of torch.hamming_window() (#152682 ) Fixes #146590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152682 Approved by: https://github.com/albanD	2025-08-04 17:50:42 +00:00
Tan Hoang	36e59d9b12	[c10d][nvshmem] fix missing override compilation error for nvshmem symmetric code (#159557 ) Summary: Fix error when compiling nvshmem code section `NVSHMEMSymmetricMemory.cu` with BUCK ``` fbcode/caffe2/torch/csrc/distributed/c10d/symm_mem/NVSHMEMSymmetricMemory.cu:154:20: error: 'get_buffer' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 154 \| virtual at::Tensor get_buffer(int \| ^ fbcode/caffe2/torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp:56:20: note: overridden virtual function is here 56 \| virtual at::Tensor get_buffer(int rank, c10::IntArrayRef sizes, c10::ScalarType dtype, int64_t storage_offset) = 0; ``` Test Plan: Build test + CI Rollback Plan: Differential Revision: D78813586 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159557 Approved by: https://github.com/kwen2501	2025-08-04 17:46:30 +00:00
angelayi	fc340d0ca3	[export] Allow comparing device w/o index with device w/ index (#159665 ) In the case where we have expected device "cuda" and given device "cuda:0" I think we should succeed? Pull Request resolved: https://github.com/pytorch/pytorch/pull/159665 Approved by: https://github.com/yushangdi	2025-08-04 17:00:07 +00:00
Animesh Jain	53e47af0f7	[dynamo][guards] Read the attr name from GetAttrGuardAccessor (#159754 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159754 Approved by: https://github.com/jansel ghstack dependencies: #159752	2025-08-04 16:51:27 +00:00
Animesh Jain	66ad881fc7	[dynamo][guards][refactor] Simplify type extraction from GuardManager (#159752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159752 Approved by: https://github.com/jansel	2025-08-04 16:51:27 +00:00
amdfaa	1d3eef27ac	[ROCm CI] Migrate to MI325 Capacity (#159649 ) Migrate mi300s to gfx942. Related to https://github.com/pytorch/pytorch/pull/159059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159649 Approved by: https://github.com/huydhn	2025-08-04 16:48:12 +00:00
Xu Han	dd95900cec	[AOTI] normalize_path_separator file path for Windows. (#159726 ) `normalize_path_separator` file path for Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159726 Approved by: https://github.com/angelayi, https://github.com/jansel	2025-08-04 15:57:19 +00:00
yuchengliu1	1cdd665526	fix test_verbose_logs_dynamic_shapes with MSVC (#159573 ) Operator `typeid` have different outputs in different compiler. There is a good example in [cppreference](https://www.en.cppreference.com/w/cpp/language/typeid.html). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159573 Approved by: https://github.com/angelayi, https://github.com/jansel	2025-08-04 15:56:53 +00:00
Tan Hoang	7cb2dcd2dd	[c10d][nvshmem] modify is_nvshmem_available runtime check to work with static-linked library (#159558 ) (#159561 ) Summary: Currently this function rely on the logic that we load `libnvshmem_device.a` statically and load `libnvshmem_host.so` at runtime. For loading `libnvshmem.a` (the combine 2 thing together) statically this will fail. Add a section to check if the symbol from host API exist at runtime to check if nvshmem is loaded statically Test Plan: CI + sample run Rollback Plan: Differential Revision: D79177525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159561 Approved by: https://github.com/kwen2501	2025-08-04 15:40:29 +00:00
Aleksei Nikiforov	e5a81aa7ba	Fix conversion of values in libtorch agnostic tests (#155115 ) Due to different byteorder, when copying data, it has to be put into last bytes to ensure that int32_t converted to int64_t keeps same value. Same has to be done when it's converted back. This change fixes test TestLibtorchAgnosticCPU::test_my_ones_like_cpu from cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py on s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155115 Approved by: https://github.com/huydhn	2025-08-04 13:40:22 +00:00
Andrey Talman	3e2aa4b0e3	Update pin to include Python 3.14 support (#159725 ) Update Triton Pin to top of rel/3.4 branch : https://github.com/triton-lang/triton/tree/rel/3.4 . This is the same as release/3.4.x branch but also includes Python 3.14 support This should unblock enablement of Python 3.14 support in this PR: https://github.com/pytorch/pytorch/pull/157559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159725 Approved by: https://github.com/davidberard98	2025-08-04 13:30:12 +00:00
Aleksei Nikiforov	6646461764	S390X: fix detection of magic number placeholder in inductor (#157784 ) This change fixes multiple tests in test/inductor/test_aot_inductor_arrayref.py such as test_cond_with_parameters_cpu_with_stack_allocation, test_issue_140766_cpu_with_stack_allocation, test_model_modified_weights_cpu_with_stack_allocation, test_nested_tensor_from_jagged_cpu_with_stack_allocation. Enable tests in test/inductor/test_aot_inductor_arrayref.py This change is split off from https://github.com/pytorch/pytorch/pull/150116 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157784 Approved by: https://github.com/huydhn	2025-08-04 12:42:31 +00:00
PyTorch UpdateBot	f74da2a136	[xla hash update] update the pinned xla hash (#159758 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159758 Approved by: https://github.com/pytorchbot	2025-08-04 11:21:45 +00:00
eqy	d35b27dde5	[CUDA] Add some more missing `@serialTest` decorators (#159672 ) Seems to fix #159663 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159672 Approved by: https://github.com/Skylion007	2025-08-04 07:44:35 +00:00
anwang	a9dc1566d4	[MTIA Aten Backend] Migrate arange.start_out (#159540 ) Differential Revision: [D79317519](https://our.internmc.facebook.com/intern/diff/D79317519/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159540 Approved by: https://github.com/malfet, https://github.com/nautsimon	2025-08-04 07:38:05 +00:00
Jiang, Yanbing	33a1996714	Fix perf downgrad by reverting template use in use_mkldnn_matmul (#159024 ) This PR is to fix the performance downgrad by reverting template use in `use_mkldnn_matmul` in #157520 . Fix https://github.com/pytorch/pytorch/issues/159031 and https://github.com/pytorch/pytorch/issues/159551. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159024 Approved by: https://github.com/mingfeima	2025-08-04 05:49:46 +00:00
Animesh Jain	ee62177c19	[dynamo] Be consistent with storing func source for UserMethodVariable (#159696 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159696 Approved by: https://github.com/jansel ghstack dependencies: #159534	2025-08-04 05:12:44 +00:00
Animesh Jain	64cbaa876c	[dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159534 Approved by: https://github.com/jansel	2025-08-04 05:12:44 +00:00
Animesh Jain	4516c59f5f	[dynamo][source] Add special source for __code__ and __closure__ (#159722 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159722 Approved by: https://github.com/jansel	2025-08-04 05:02:05 +00:00
PyTorch UpdateBot	8bc843a9ec	[vllm hash update] update the pinned vllm hash (#159610 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159610 Approved by: https://github.com/pytorchbot	2025-08-04 04:06:09 +00:00
Jason Ansel	e39a62c70d	Fix warnings in triton_helpers.py (#159719 ) ``` /home/jansel/pytorch/torch/_inductor/runtime/triton_helpers.py:152: UserWarning: Logical operators 'and' and 'or' are deprecated for non-scalar tensors; please use '&' or '\|' instead equal \|= a_isnan and b_isnan ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159719 Approved by: https://github.com/Skylion007	2025-08-04 03:21:09 +00:00
Laith Sakka	978e3a9142	refresh expected results (#159727 ) Just regular update due to recent <10% changes CI is stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159727 Approved by: https://github.com/anijain2305	2025-08-03 22:47:50 +00:00
Nikita Shulga	e2a5c42e7e	[BE][MPS] Build metal kernels of MacOS-14+ (#159733 ) Which makes `#if __METAL_VERSION__ >= 310` guards for `bfloat` use support unnecessary. Rename `kernels_bfloat.metallib` into `kernels_basic` and remove custom build/selection logic. Part of https://github.com/pytorch/pytorch/issues/159275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159733 Approved by: https://github.com/dcci ghstack dependencies: #159731, #159732	2025-08-03 20:53:58 +00:00
Nikita Shulga	5116c49b52	[BE] Remove macos-13 guard from bench_mps_ops (#159732 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159732 Approved by: https://github.com/dcci ghstack dependencies: #159731	2025-08-03 20:53:58 +00:00
Nikita Shulga	fecdebe385	[CI][MPS] Fix compile benchmark correctness (#159731 ) By passing `fullgraph=True` attribute and increasing cache size limit to 2**16 Otherwise, compiler might decide not to fall back to eager to avoid recompilations Pull Request resolved: https://github.com/pytorch/pytorch/pull/159731 Approved by: https://github.com/dcci	2025-08-03 20:53:50 +00:00
Nikita Shulga	e136a9175b	[BE] Fix dev warning in `Dependencies.cmake` (#159702 ) Namely ``` CMake Warning (dev) in cmake/Dependencies.cmake: A logical block opening on the line /Users/nshulga/git/pytorch/pytorch/cmake/Dependencies.cmake:261 (if) closes on the line /Users/nshulga/git/pytorch/pytorch/cmake/Dependencies.cmake:263 (endif) with mis-matching arguments. ``` Introduced by https://github.com/pytorch/pytorch/pull/143846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159702 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-08-03 18:45:07 +00:00
Francisco Massa	9a680e14b7	[bucketing] Reduce CPU overhead for reduce_scatter_merge_fn_to_trace (#159723 ) The previous implementation was creating `n_gpu * n_tensors` intermediate tensors, which was adding a lot of CPU overhead, specially given that inductor was generating a number of individual tensor copy kernels for `torch.cat` . This PR changes the implementation so that only `n_tensors` are created, making the CPU overhead proportional to the number of tensors being bucketed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159723 Approved by: https://github.com/IvanKobzarev	2025-08-03 09:16:55 +00:00
PyTorch MergeBot	805a102beb	Revert "[dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534 )" This reverts commit 1616777cd2a3170ff76afa3e7860b0969420c445. Reverted https://github.com/pytorch/pytorch/pull/159534 on behalf of https://github.com/malfet due to Broke some inductor test and lint among other things, see `9c18901bfd/1` ([comment](https://github.com/pytorch/pytorch/pull/159534#issuecomment-3146983186))	2025-08-03 04:58:32 +00:00
PyTorch MergeBot	6e8d705a22	Revert "[dynamo] Be consistent with storing func source for UserMethodVariable (#159696 )" This reverts commit be71000ff5292293d1976f313218e2df4d5046d3. Reverted https://github.com/pytorch/pytorch/pull/159696 on behalf of https://github.com/malfet due to Broke some inductor test and lint among other things, see `9c18901bfd/1` ([comment](https://github.com/pytorch/pytorch/pull/159534#issuecomment-3146983186))	2025-08-03 04:58:32 +00:00
anwang	9c18901bfd	[MTIA Aten Backend] Migrate all.out (#159539 ) Differential Revision: [D79317033](https://our.internmc.facebook.com/intern/diff/D79317033/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159539 Approved by: https://github.com/malfet ghstack dependencies: #159098	2025-08-03 02:08:35 +00:00
Oguz Ulgen	a29ed5e1ac	Add torch compile force disable caches alias (#158072 ) Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072 Approved by: https://github.com/ezyang	2025-08-02 23:23:17 +00:00
Francisco Massa	d2792f51b2	[bucketing] Use max of input/output size for bucketing (#159717 ) The output of a reduce_scatter is n_gpu times smaller than its input, while the output of an all_gather is n_gpu times larger than its input. This means that in the current heuristic for bucketing reduce_scatter, we would need to use a bucket size which is n_gpu times larger than the bucket for all_gather, making it gpu-dependent and less intuitive. This PRs propose to use instead the max between the input and output sizes, so that one can use the same bucket_size value for both passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/159717 Approved by: https://github.com/wconstab	2025-08-02 22:42:22 +00:00
Animesh Jain	be71000ff5	[dynamo] Be consistent with storing func source for UserMethodVariable (#159696 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159696 Approved by: https://github.com/jansel ghstack dependencies: #159186, #159534	2025-08-02 21:40:38 +00:00
Aaron Orenstein	3f86076775	gc before warming up benchmarking (#159670 ) #158649 turned off automatic GCs during cudagraph recording. This is causing a small uptick in some internal benchmark numbers because of memory the benchmark is leaving around before the benchmark starts - so GC before warming up the model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159670 Approved by: https://github.com/oulgen	2025-08-02 19:37:24 +00:00
Animesh Jain	1616777cd2	[dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159534 Approved by: https://github.com/jansel ghstack dependencies: #159186	2025-08-02 18:04:35 +00:00
rajeshvshiyal	38895c0ac2	Update RuntimeError message in is_nonzero(input) method from bool to Boolean (#159712 ) RuntimeError message updated in is_nonzero(input) method from bool to Boolean. Case 1: t = torch.tensor([]) torch.is_nonzero(t) Case 2: t = torch.tensor([1,2]) torch.is_nonzero(t) Existing Error message in documentation: for case 1: RuntimeError: bool value of Tensor with no values is ambiguous for case 2: RuntimeError: bool value of Tensor with more than one value is ambiguous Proposed Error message in documentation: for case 1: RuntimeError: Boolean value of Tensor with no values is ambiguous for case 2: RuntimeError: Boolean value of Tensor with more than one value is ambiguous Fixes #159710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159712 Approved by: https://github.com/malfet	2025-08-02 17:23:45 +00:00
Anthony Barbier	310f901a71	Stop parsing command line arguments every time common_utils is imported. (#156703 ) Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs: https://github.com/pytorch/pytorch/pull/154612 https://github.com/pytorch/pytorch/pull/154628 https://github.com/pytorch/pytorch/pull/154715 https://github.com/pytorch/pytorch/pull/154716 https://github.com/pytorch/pytorch/pull/154725 https://github.com/pytorch/pytorch/pull/154728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703 Approved by: https://github.com/clee2000	2025-08-02 16:38:54 +00:00
Nichols A. Romero	e11b1cd97e	[ROCm] fix nightly wheel due to rocBLAS environment variable (#159570 ) Fixes #159070 The TunableOp failure is due to missing rocBLAS files in our manywheels packaging. This bug has been present since June 7-8 time frame. It was caused by a typo in the rocBLAS environment variable that stores the list of files. It was introduced in this PR: https://github.com/pytorch/pytorch/pull/155388 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159570 Approved by: https://github.com/malfet	2025-08-02 06:54:43 +00:00
Wenyuan Chi	b599d91738	Log autotune choices and benchmark result to scuba/chrome trace (#159496 ) Summary: Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. Example Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" } ``` Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS =1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Differential Revision: D79235037 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159496 Approved by: https://github.com/masnesral	2025-08-02 05:34:17 +00:00
Xiao, Wang	fd6a6658c3	Enable _int_mm on Intel GPU (#157769 ) # Moativation This PR is used to enable _int_mm on Intel GPU. And _int_mm is used by int8 quantization on torchao. # Model Test Result: We run meta-llama/Llama-3.1-8B-Instruct on Intel GPU and A100 using torchao int8-dynamic-quantization. The model configs as below: Precision : torch.bfloat16 quantization configuration : Int8DynamicActivationInt8WeightConfig dataset : wikitext Result: The perplexity values for Intel GPU and A100 are 9.582953453063965 and 9.57755184173584, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157769 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2025-08-02 05:16:01 +00:00
PyTorch UpdateBot	04973496a8	[audio hash update] update the pinned audio hash (#159611 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159611 Approved by: https://github.com/pytorchbot	2025-08-02 05:15:47 +00:00
Sam Larsen	1548b011ea	Fix rand_like decomposition to preserve strides (#159294 ) Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898. Test Plan: New unit test (fails before this PR; but fixed after) Differential Revision: [D79472604](https://our.internmc.facebook.com/intern/diff/D79472604) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294 Approved by: https://github.com/eellison	2025-08-02 03:54:41 +00:00
angelayi	e57a92734d	[export] Fix nn_module_stack of assert_tensor_metadata nodes (#159625 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/159625 Approved by: https://github.com/yushangdi	2025-08-02 02:52:42 +00:00
Dylan Maloy	79ff3b320b	Back out "[ez] get rid of unused var" (#159677 ) Summary: turns out i added this to reduce the frequency we'd call try_update_max_size_at_index when a new maximum is found before the replan is called. oops. Test Plan: backout Rollback Plan: Differential Revision: D79474114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159677 Approved by: https://github.com/georgiaphillips	2025-08-02 01:50:16 +00:00
nandesuka	426f249f20	Fix launch grid calculation (#159497 ) Summary: The launch grid calculation code is using a python trick to achieve CeilDiv() through negative integer division with FloorDiv(). This is language dependent behaviour that doesn't apply to all languages. In the FXIR backend we negate this behaviour and replace the experssion with CeilDiv() operation so the computation is correct regardless of language used. Not directly directly changing the orginal computation as it leads to a performance degredation. Test Plan: CI Rollback Plan: Differential Revision: D79275534 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159497 Approved by: https://github.com/blaine-rister	2025-08-02 01:12:58 +00:00
Edward Z. Yang	d33a484763	Use boxed_nop_preserve_node_meta for aot_export_joint_with_descriptors (#159545 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159545 Approved by: https://github.com/xmfan, https://github.com/wconstab ghstack dependencies: #159336, #159337	2025-08-02 00:33:41 +00:00
Natalia Gimelshein	a81ffbc5f5	improve shape checks for grouped_mm (#159666 ) Check that contraction dimension matches between tensors if it's known, and do device-side checks for correct offsets Pull Request resolved: https://github.com/pytorch/pytorch/pull/159666 Approved by: https://github.com/danielvegamyhre, https://github.com/eqy	2025-08-02 00:12:25 +00:00
Huy Do	465fe4d9f7	Enable sample nightly PT2 benchmark on B200 (#158011 ) Per the discussion with @nWEIdia, this resumes the work on https://github.com/pytorch/pytorch/pull/157870 to enable PT2 benchmark on B200 ### Testing https://github.com/pytorch/pytorch/actions/runs/16615101382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158011 Approved by: https://github.com/nWEIdia, https://github.com/atalman	2025-08-01 23:47:44 +00:00
Natalia Gimelshein	9477af1063	fix compilation on cuda < 12.3 (#159657 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/159657 Approved by: https://github.com/kwen2501	2025-08-01 23:40:55 +00:00
Lucas Kabela	dcc36e38bb	[Graph Breaks] Remove unsupported Additional Info field (#159658 ) Race condition when landing PR#158800 caused us to add this field when it is deprecated, so remove it Pull Request resolved: https://github.com/pytorch/pytorch/pull/159658 Approved by: https://github.com/williamwen42	2025-08-01 23:25:50 +00:00
Zain Rizvi	efd78584a8	[EZ] Add linux-aarch64.yml workflow to the viable/strict blocking set (#159668 ) Since it's required to be run on every PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/159668 Approved by: https://github.com/malfet	2025-08-01 23:19:08 +00:00
Oguz Ulgen	135762ea20	Unpin helion (#159579 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159579 Approved by: https://github.com/jansel	2025-08-01 23:08:06 +00:00
Sherlock Huang	e2ee9cfaa2	[NativeRT] Turn on enableStaticCPUKernels by default (#159422 ) Summary: As title. Test Plan: Need to manual test on production models. Rollback Plan: Differential Revision: D78747742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159422 Approved by: https://github.com/dolpm	2025-08-01 22:27:07 +00:00
Andy Lugo	06d28de17a	Update CK Kernel generation and update ck submodule (#157964 ) changes required to reduce the number of ck kernels generated. This change depends on https://github.com/ROCm/composable_kernel/pull/2480 to be merged first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157964 Approved by: https://github.com/842974287	2025-08-01 22:24:27 +00:00
anwang	df9720b8b5	[MTIA Aten Backend] Migrate all foreach ops (#159098 ) # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This diff Migrate all foreach operators to in-tree, including: - _foreach_abs - _foreach_abs_ - _foreach_add.List - _foreach_add_.List - _foreach_add_.Scalar - _foreach_add_.Tensor - _foreach_addcmul.Scalar - _foreach_addcmul_.Scalar - _foreach_copy - _foreach_copy_ - _foreach_mul.List - _foreach_mul_.List - _foreach_mul_.Scalar - _foreach_mul.Tensor - _foreach_mul_.Tensor - _foreach_norm.Scalar - _foreach_sqrt_ Differential Revision: [D78913847](https://our.internmc.facebook.com/intern/diff/D78913847/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159098 Approved by: https://github.com/malfet	2025-08-01 22:10:12 +00:00
Sandeep Narendranath Karjala	85e74d5ace	[inductor] Add logging for distributed collective ops for multi‑rank diagnostics (#159190 ) This change introduces structured logging of the collective communication schedule, enabling downstream tools (e.g. TLParse) to ingest and analyze per‑rank collective‐order information for multi‑rank jobs. - Iterates over scheduler.nodes, filters for _CollectiveKernel nodes - Extracts each op’s python_kernel_name - Emits a structured JSON payload under the inductor_collective_schedule artifact name - Dumps the full schedule list to collective_schedule.json via the PyTorch trace‑structured artifact - Added comprehensive unit tests for collective schedule tracing: Created test_collective_schedule_empty() and test_collective_schedule_real() tests to verify structured trace logging works correctly for both empty collective schedules and real collective operations (like all_reduce and wait_tensor from _c10d_functional ops). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159190 Approved by: https://github.com/yushangdi, https://github.com/xmfan	2025-08-01 21:51:42 +00:00
Sheng Fu	0450f05658	Output tensor meta data for FX graph node (#159311 ) FX graph segment in CompiledFxGraph does not include tensor meta data, for example, tensor shape, tensor stride, tensor data type, tensor device. AI system co-design team requested to include these information in FX graph segment so they can use FX graph segment to project the performance on different hardware. This DIFF is to modify the Graph::Node::format_node to include tensor meta data. Before this DIFF, the triton kernel FX graph segment looks like the following: ``` # %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm] # %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1] # %sin : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {}) # %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {}) # %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 1111), kwargs = {}) # %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {}) # %cos : cuda:0"[num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%add,), kwargs = {}) # return %cos After this DIFF: # %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm] # %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1] # %sin : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {}) # %permute_1 : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {}) # %mul : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 1111), kwargs = {}) # %add : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {}) # %cos : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%add,), kwargs = {}) # return %cos ``` If format_node can not be changed, I can copy the code to caffe2/torch/_inductor/utils.py. Differential Revision: D77973076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159311 Approved by: https://github.com/angelayi	2025-08-01 21:40:29 +00:00
zeshengzong	595a65f5c2	[dynamo] Replace unimplemented with unimplemented_v2 in `torch/_dynamo/variables/script_object.py` (#159343 ) Fixes part of #147913 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159343 Approved by: https://github.com/williamwen42 Co-authored-by: William Wen <william.wen42@gmail.com>	2025-08-01 21:30:41 +00:00
tiandeyu-cs	8c6c2e40eb	Edit a test case to detect potential bugs in all-gathering noncontiguous inputs in the Gloo backend (#159542 ) As suggested in the pull request #158903 by @H-huang, this pull request edits a test case to detect potential bugs in all-gathering noncontiguous inputs in the Gloo backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159542 Approved by: https://github.com/d4l3k, https://github.com/H-Huang	2025-08-01 21:20:25 +00:00
henrylhtsang	32840d19f9	[cutlass backend] skip stream k if shape is dynamic (#159442 ) Differential Revision: [D79229210](https://our.internmc.facebook.com/intern/diff/D79229210/) Motivation is workspace size is hard to determine, and varies for different shape. What I observed is sometimes the shape got smaller, but the workspace can increase. So it is hard to upper bound it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159442 Approved by: https://github.com/ColinPeppler	2025-08-01 20:42:24 +00:00
Xuehai Pan	2040f00112	[BE][Easy] respect `os.environ` in subprocess calls in tools/nightly.py (#159572 ) Respect parent shell's envvars, such as `UV_INDEX_STRATEGY`, `http{,s}_proxy`, etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159572 Approved by: https://github.com/Skylion007	2025-08-01 20:40:31 +00:00
Lucas Kabela	c137f9da0b	[Dynamo][Better Engineering] Add type coverage to dynamo/compiled_autograd.py (#159518 ) As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to `torch/_dynamo/compiled_autograd.py` Running ``` mypy torch/_dynamo/compiled_autograd.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Annotated \| Lines Total \| % lines covered \| Funcs Annotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 425 \| 1553 \| 27.37% \| 17 \| 62 \| 27.42% \| \| This PR \| 1623 \| 1623 \| 100.00% \| 62 \| 62 \| 100.00% \| \| Delta \| +1198\| +0 \| +72.63% \| +45 \| 0 \| +72.58% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/159518 Approved by: https://github.com/xmfan	2025-08-01 20:24:58 +00:00
Howard Huang	5e8b95605f	[PP] Support OVERLAP_F_B computation type (#158978 ) Some changes to validation code and visualizer to support a new computation type that will be used in DualPipeV (see https://github.com/pytorch/pytorch/pull/159591) The IR looks like: ``` [0F0, 0F1, 0F2, 0F3, 0F4, 0F5, 0F6, 7F0, 7I0, 7W0, 7F1, 7I1, 7W1, 7F2, 7I2, 7W2, 7F3, (0F7;7B3)OVERLAP_F_B, (7F4;0B0)OVERLAP_F_B, (0F8;7B4)OVERLAP_F_B, (7F5;0B1)OVERLAP_F_B, (0F9;7B5)OVERLAP_F_B, (7F6;0B2)OVERLAP_F_B, 7B6, (7F7;0B3)OVERLAP_F_B, 7B7, (7F8;0B4)OVERLAP_F_B, 7B8, (7F9;0B5)OVERLAP_F_B, 7B9, 0I6, 0W6, 0I7, 0W7, 0I8, 0W8, 0I9, 0W9] [1F0, 1F1, 1F2, 1F3, 1F4, 6F0, 1F5, 6F1, 6I0, 6W0, 6F2, 6I1, 6W1, 6F3, (1F6;6B2)OVERLAP_F_B, (6F4;1B0)OVERLAP_F_B, (1F7;6B3)OVERLAP_F_B, (6F5;1B1)OVERLAP_F_B, (1F8;6B4)OVERLAP_F_B, (6F6;1B2)OVERLAP_F_B, (1F9;6B5)OVERLAP_F_B, (6F7;1B3)OVERLAP_F_B, 6B6, (6F8;1B4)OVERLAP_F_B, 6B7, (6F9;1B5)OVERLAP_F_B, 6B8, 1B6, 6I9, 1I7, 6W9, 1I8, 1W7, 1I9, 1W8, 1W9] [2F0, 2F1, 2F2, 5F0, 2F3, 5F1, 2F4, 5F2, 5I0, 5W0, 5F3, (2F5;5B1)OVERLAP_F_B, (5F4;2B0)OVERLAP_F_B, (2F6;5B2)OVERLAP_F_B, (5F5;2B1)OVERLAP_F_B, (2F7;5B3)OVERLAP_F_B, (5F6;2B2)OVERLAP_F_B, (2F8;5B4)OVERLAP_F_B, (5F7;2B3)OVERLAP_F_B, (2F9;5B5)OVERLAP_F_B, (5F8;2B4)OVERLAP_F_B, 5B6, (5F9;2B5)OVERLAP_F_B, 5B7, 2B6, 5B8, 2I7, 5I9, 2I8, 2W7, 2I9, 5W9, 2W8, 2W9] [3F0, 4F0, 3F1, 4F1, 3F2, 4F2, 3F3, 4F3, 3F4, 4B0, (4F4;3B0)OVERLAP_F_B, (3F5;4B1)OVERLAP_F_B, (4F5;3B1)OVERLAP_F_B, (3F6;4B2)OVERLAP_F_B, (4F6;3B2)OVERLAP_F_B, (3F7;4B3)OVERLAP_F_B, (4F7;3B3)OVERLAP_F_B, (3F8;4B4)OVERLAP_F_B, (4F8;3B4)OVERLAP_F_B, (3F9;4B5)OVERLAP_F_B, (4F9;3B5)OVERLAP_F_B, 4B6, 3B6, 4B7, 3B7, 4I8, 3I8, 4I9, 3I9, 4W8, 3W8, 4W9, 3W9] ``` In this PR, the schedule execution will just treat the OVERLAP_F_B as two separate operations of F and B (so there is no actual overlap). The next step is to allow users to create a custom function to plug in what this operation does. `814629043a/torch/distributed/pipelining/schedules.py (L1205-L1216)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158978 Approved by: https://github.com/wconstab	2025-08-01 20:22:30 +00:00
Jane Xu	8ea86a6e31	Actually test STD_TORCH_CHECK, add testfile to CMake (#159603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159603 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-08-01 19:53:41 +00:00
PyTorch MergeBot	acad808545	Revert "[inductor] consolidate common GEMM triton param retrieval (#159383 )" This reverts commit e7cc42df58a86bee05944f6e80c535aa1d099443. Reverted https://github.com/pytorch/pytorch/pull/159383 on behalf of https://github.com/jataylo due to sorry but rocm CI is broken due to this PR ([comment](https://github.com/pytorch/pytorch/pull/159383#issuecomment-3145604831))	2025-08-01 19:49:21 +00:00
PyTorch MergeBot	c687446374	Revert "Fix rand_like decomposition to preserve strides (#159294 )" This reverts commit 2c46922ce4b33c39b1c48c302604805510a3f889. Reverted https://github.com/pytorch/pytorch/pull/159294 on behalf of https://github.com/yangw-dev due to breaking internal test ([comment](https://github.com/pytorch/pytorch/pull/159294#issuecomment-3145541845))	2025-08-01 19:19:51 +00:00
Will Constable	dd22ba09b4	[C10D] Document barrier interaction with device_id (#159389 ) Addresses #159262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159389 Approved by: https://github.com/malfet, https://github.com/H-Huang, https://github.com/kwen2501, https://github.com/fduwjj	2025-08-01 18:12:21 +00:00
Yu, Guangye	c0e0126399	Remove unused input parameter in ExpandableSegment (#159356 ) # Motivation While refactoring the caching allocator, I noticed that the `ExpandableSegment` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion. # Additional Context I noticed that `ExpandableSegment` is defined in cpp file, so it should be safe to make this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159356 Approved by: https://github.com/ngimel, https://github.com/albanD ghstack dependencies: #159159	2025-08-01 17:47:51 +00:00
Ivan Zaitsev	e4b123b5e4	Revert direct updates (#159654 ) reverts: ``` commit 5711a8f06948eeee56ed5f53f171fa519f78491c (tag: trunk/5711a8f06948eeee56ed5f53f171fa519f78491c, origin/main, main) Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com> Date: Fri Aug 1 09:32:52 2025 -0700 Update test_utils.py commit b4b71d011ed07a41c2086ff0dec2988a63662877 (tag: trunk/b4b71d011ed07a41c2086ff0dec2988a63662877) Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com> Date: Fri Aug 1 09:27:54 2025 -0700 Update utils.py commit 52376b9b6fbf9fe24f5d82038dc520f0c64b6f8d (tag: trunk/52376b9b6fbf9fe24f5d82038dc520f0c64b6f8d) Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com> Date: Fri Aug 1 09:26:05 2025 -0700 ``` (commits pushed directly to main by mistake) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159654 Approved by: https://github.com/atalman	2025-08-01 16:54:51 +00:00
Jovian Anthony Jaison	5711a8f069	Update test_utils.py	2025-08-01 09:32:52 -07:00
Jovian Anthony Jaison	b4b71d011e	Update utils.py	2025-08-01 09:27:54 -07:00
Jovian Anthony Jaison	52376b9b6f	Update convert_frame.py	2025-08-01 09:26:05 -07:00
Jane Xu	1371a98b0e	Migrate ScalarType to headeronly (#159416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159416 Approved by: https://github.com/albanD ghstack dependencies: #159415, #159411	2025-08-01 16:07:01 +00:00

2330 changed files with 123172 additions and 93950 deletions

									
										15

.bc-linter.yml
									
										Normal file
									
												View File
												
				@ -0,0 +1,15 @@

				version: 1

				paths:

				include:

				  - "**/*.py"

				exclude:

				  - ".*"

				  - ".*/**"

				  - "**/.*/**"

				  - "**/.*"

				  - "**/_*/**"

				  - "**/_*.py"

				  - "**/test/**"

				  - "**/benchmarks/**"

				  - "**/test_*.py"

				  - "**/*_test.py"

									
										26

.ci/aarch64_linux/aarch64_ci_build.sh
									
												View File
												
				@ -3,8 +3,20 @@ set -eux -o pipefail

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				if [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then

				# Set CUDA architecture lists to match x86 build_cuda.sh

				if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="8.0;9.0"

				elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"

				elif [[ "$GPU_ARCH_VERSION" == *"13.0"* ]]; then

				    export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;11.0;12.0+PTX"

				fi

				# Compress the fatbin with -compress-mode=size for CUDA 13

				if [[ "$DESIRED_CUDA" == *"13"* ]]; then

				    export TORCH_NVCC_FLAGS="-compress-mode=size"

				    # Bundle ptxas into the cu13 wheel, see https://github.com/pytorch/pytorch/issues/163801

				    export BUILD_BUNDLE_PTXAS=1

				fi

				SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"

				@ -18,7 +30,7 @@ cd /

				# on the mounted pytorch repo

				git config --global --add safe.directory /pytorch

				pip install -r /pytorch/requirements.txt

				pip install auditwheel==6.2.0

				pip install auditwheel==6.2.0 wheel

				if [ "$DESIRED_CUDA" = "cpu" ]; then

				    echo "BASE_CUDA_VERSION is not set. Building cpu wheel."

				    #USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

				@ -26,6 +38,16 @@ if [ "$DESIRED_CUDA" = "cpu" ]; then

				else

				    echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"

				    export USE_SYSTEM_NCCL=1

				    # Check if we should use NVIDIA libs from PyPI (similar to x86 build_cuda.sh logic)

				    if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				        echo "Bundling CUDA libraries with wheel for aarch64."

				    else

				        echo "Using nvidia libs from pypi for aarch64."

				        echo "Updated PYTORCH_EXTRA_INSTALL_REQUIREMENTS for aarch64: $PYTORCH_EXTRA_INSTALL_REQUIREMENTS"

				        export USE_NVIDIA_PYPI_LIBS=1

				    fi

				    #USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

				    USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda

				fi

									
										246

.ci/aarch64_linux/aarch64_wheel_ci_build.py
									
												View File
												
				@ -69,61 +69,186 @@ def replace_tag(filename) -> None:

				        f.writelines(lines)

				def patch_library_rpath(

				    folder: str,

				    lib_name: str,

				    use_nvidia_pypi_libs: bool = False,

				    desired_cuda: str = "",

				) -> None:

				    """Apply patchelf to set RPATH for a library in torch/lib"""

				    lib_path = f"{folder}/tmp/torch/lib/{lib_name}"

				    if use_nvidia_pypi_libs:

				        # For PyPI NVIDIA libraries, construct CUDA RPATH

				        cuda_rpaths = [

				            "$ORIGIN/../../nvidia/cudnn/lib",

				            "$ORIGIN/../../nvidia/nvshmem/lib",

				            "$ORIGIN/../../nvidia/nccl/lib",

				            "$ORIGIN/../../nvidia/cusparselt/lib",

				        ]

				        if "130" in desired_cuda:

				            cuda_rpaths.append("$ORIGIN/../../nvidia/cu13/lib")

				        else:

				            cuda_rpaths.extend(

				                [

				                    "$ORIGIN/../../nvidia/cublas/lib",

				                    "$ORIGIN/../../nvidia/cuda_cupti/lib",

				                    "$ORIGIN/../../nvidia/cuda_nvrtc/lib",

				                    "$ORIGIN/../../nvidia/cuda_runtime/lib",

				                    "$ORIGIN/../../nvidia/cufft/lib",

				                    "$ORIGIN/../../nvidia/curand/lib",

				                    "$ORIGIN/../../nvidia/cusolver/lib",

				                    "$ORIGIN/../../nvidia/cusparse/lib",

				                    "$ORIGIN/../../nvidia/nvtx/lib",

				                    "$ORIGIN/../../nvidia/cufile/lib",

				                ]

				            )

				        # Add $ORIGIN for local torch libs

				        rpath = ":".join(cuda_rpaths) + ":$ORIGIN"

				    else:

				        # For bundled libraries, just use $ORIGIN

				        rpath = "$ORIGIN"

				    if os.path.exists(lib_path):

				        os.system(

				            f"cd {folder}/tmp/torch/lib/; "

				            f"patchelf --set-rpath '{rpath}' --force-rpath {lib_name}"

				        )

				def copy_and_patch_library(

				    src_path: str,

				    folder: str,

				    use_nvidia_pypi_libs: bool = False,

				    desired_cuda: str = "",

				) -> None:

				    """Copy a library to torch/lib and patch its RPATH"""

				    if os.path.exists(src_path):

				        lib_name = os.path.basename(src_path)

				        shutil.copy2(src_path, f"{folder}/tmp/torch/lib/{lib_name}")

				        patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)

				def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				    """

				    Package the cuda wheel libraries

				    """

				    folder = os.path.dirname(wheel_path)

				    wheelname = os.path.basename(wheel_path)

				    os.mkdir(f"{folder}/tmp")

				    os.system(f"unzip {wheel_path} -d {folder}/tmp")

				    libs_to_copy = [

				        "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",

				        "/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",

				        "/usr/local/cuda/lib64/libcudnn.so.9",

				        "/usr/local/cuda/lib64/libcublas.so.12",

				        "/usr/local/cuda/lib64/libcublasLt.so.12",

				        "/usr/local/cuda/lib64/libcudart.so.12",

				        "/usr/local/cuda/lib64/libcufft.so.11",

				        "/usr/local/cuda/lib64/libcusparse.so.12",

				        "/usr/local/cuda/lib64/libcusparseLt.so.0",

				        "/usr/local/cuda/lib64/libcusolver.so.11",

				        "/usr/local/cuda/lib64/libcurand.so.10",

				        "/usr/local/cuda/lib64/libnccl.so.2",

				        "/usr/local/cuda/lib64/libnvJitLink.so.12",

				        "/usr/local/cuda/lib64/libnvrtc.so.12",

				        "/usr/local/cuda/lib64/libcudnn_adv.so.9",

				        "/usr/local/cuda/lib64/libcudnn_cnn.so.9",

				        "/usr/local/cuda/lib64/libcudnn_graph.so.9",

				        "/usr/local/cuda/lib64/libcudnn_ops.so.9",

				        "/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",

				        "/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",

				        "/usr/local/cuda/lib64/libcudnn_heuristic.so.9",

				        "/lib64/libgomp.so.1",

				        "/usr/lib64/libgfortran.so.5",

				        "/acl/build/libarm_compute.so",

				        "/acl/build/libarm_compute_graph.so",

				        "/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",

				        "/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",

				        "/usr/local/lib/libnvpl_lapack_core.so.0",

				        "/usr/local/lib/libnvpl_blas_core.so.0",

				    ]

				    # Delete original wheel since it will be repackaged

				    os.system(f"rm {wheel_path}")

				    if "129" in desired_cuda:

				        libs_to_copy += [

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.9",

				            "/usr/local/cuda/lib64/libcufile.so.0",

				            "/usr/local/cuda/lib64/libcufile_rdma.so.1",

				    # Check if we should use PyPI NVIDIA libraries or bundle system libraries

				    use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"

				    if use_nvidia_pypi_libs:

				        print("Using nvidia libs from pypi - skipping CUDA library bundling")

				        # For PyPI approach, we don't bundle CUDA libraries - they come from PyPI packages

				        # We only need to bundle non-NVIDIA libraries

				        minimal_libs_to_copy = [

				            "/lib64/libgomp.so.1",

				            "/usr/lib64/libgfortran.so.5",

				            "/acl/build/libarm_compute.so",

				            "/acl/build/libarm_compute_graph.so",

				            "/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",

				            "/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",

				            "/usr/local/lib/libnvpl_lapack_core.so.0",

				            "/usr/local/lib/libnvpl_blas_core.so.0",

				        ]

				    # Copy libraries to unzipped_folder/a/lib

				    for lib_path in libs_to_copy:

				        lib_name = os.path.basename(lib_path)

				        shutil.copy2(lib_path, f"{folder}/tmp/torch/lib/{lib_name}")

				        os.system(

				            f"cd {folder}/tmp/torch/lib/; "

				            f"patchelf --set-rpath '$ORIGIN' --force-rpath {folder}/tmp/torch/lib/{lib_name}"

				        )

				        # Copy minimal libraries to unzipped_folder/torch/lib

				        for lib_path in minimal_libs_to_copy:

				            copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)

				        # Patch torch libraries used for searching libraries

				        torch_libs_to_patch = [

				            "libtorch.so",

				            "libtorch_cpu.so",

				            "libtorch_cuda.so",

				            "libtorch_cuda_linalg.so",

				            "libtorch_global_deps.so",

				            "libtorch_python.so",

				            "libtorch_nvshmem.so",

				            "libc10.so",

				            "libc10_cuda.so",

				            "libcaffe2_nvrtc.so",

				            "libshm.so",

				        ]

				        for lib_name in torch_libs_to_patch:

				            patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)

				    else:

				        print("Bundling CUDA libraries with wheel")

				        # Original logic for bundling system CUDA libraries

				        # Common libraries for all CUDA versions

				        common_libs = [

				            # Non-NVIDIA system libraries

				            "/lib64/libgomp.so.1",

				            "/usr/lib64/libgfortran.so.5",

				            "/acl/build/libarm_compute.so",

				            "/acl/build/libarm_compute_graph.so",

				            # Common CUDA libraries (same for all versions)

				            "/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",

				            "/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",

				            "/usr/local/lib/libnvpl_lapack_core.so.0",

				            "/usr/local/lib/libnvpl_blas_core.so.0",

				            "/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",

				            "/usr/local/cuda/lib64/libcudnn.so.9",

				            "/usr/local/cuda/lib64/libcusparseLt.so.0",

				            "/usr/local/cuda/lib64/libcurand.so.10",

				            "/usr/local/cuda/lib64/libnccl.so.2",

				            "/usr/local/cuda/lib64/libnvshmem_host.so.3",

				            "/usr/local/cuda/lib64/libcudnn_adv.so.9",

				            "/usr/local/cuda/lib64/libcudnn_cnn.so.9",

				            "/usr/local/cuda/lib64/libcudnn_graph.so.9",

				            "/usr/local/cuda/lib64/libcudnn_ops.so.9",

				            "/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",

				            "/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",

				            "/usr/local/cuda/lib64/libcudnn_heuristic.so.9",

				            "/usr/local/cuda/lib64/libcufile.so.0",

				            "/usr/local/cuda/lib64/libcufile_rdma.so.1",

				            "/usr/local/cuda/lib64/libcusparse.so.12",

				        ]

				        # CUDA version-specific libraries

				        if "13" in desired_cuda:

				            minor_version = desired_cuda[-1]

				            version_specific_libs = [

				                "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13",

				                "/usr/local/cuda/lib64/libcublas.so.13",

				                "/usr/local/cuda/lib64/libcublasLt.so.13",

				                "/usr/local/cuda/lib64/libcudart.so.13",

				                "/usr/local/cuda/lib64/libcufft.so.12",

				                "/usr/local/cuda/lib64/libcusolver.so.12",

				                "/usr/local/cuda/lib64/libnvJitLink.so.13",

				                "/usr/local/cuda/lib64/libnvrtc.so.13",

				                f"/usr/local/cuda/lib64/libnvrtc-builtins.so.13.{minor_version}",

				            ]

				        elif "12" in desired_cuda:

				            # Get the last character for libnvrtc-builtins version (e.g., "129" -> "9")

				            minor_version = desired_cuda[-1]

				            version_specific_libs = [

				                "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",

				                "/usr/local/cuda/lib64/libcublas.so.12",

				                "/usr/local/cuda/lib64/libcublasLt.so.12",

				                "/usr/local/cuda/lib64/libcudart.so.12",

				                "/usr/local/cuda/lib64/libcufft.so.11",

				                "/usr/local/cuda/lib64/libcusolver.so.11",

				                "/usr/local/cuda/lib64/libnvJitLink.so.12",

				                "/usr/local/cuda/lib64/libnvrtc.so.12",

				                f"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.{minor_version}",

				            ]

				        else:

				            raise ValueError(f"Unsupported CUDA version: {desired_cuda}.")

				        # Combine all libraries

				        libs_to_copy = common_libs + version_specific_libs

				        # Copy libraries to unzipped_folder/torch/lib

				        for lib_path in libs_to_copy:

				            copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)

				    # Make sure the wheel is tagged with manylinux_2_28

				    for f in os.scandir(f"{folder}/tmp/"):

				@ -131,14 +256,8 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				            replace_tag(f"{f.path}/WHEEL")

				            break

				    os.mkdir(f"{folder}/cuda_wheel")

				    os.system(f"cd {folder}/tmp/; zip -r {folder}/cuda_wheel/{wheelname} *")

				    shutil.move(

				        f"{folder}/cuda_wheel/{wheelname}",

				        f"{folder}/{wheelname}",

				        copy_function=shutil.copy2,

				    )

				    os.system(f"rm -rf {folder}/tmp/ {folder}/cuda_wheel/")

				    os.system(f"wheel pack {folder}/tmp/ -d {folder}")

				    os.system(f"rm -rf {folder}/tmp/")

				def complete_wheel(folder: str) -> str:

				@ -161,14 +280,7 @@ def complete_wheel(folder: str) -> str:

				            f"/{folder}/dist/{repaired_wheel_name}",

				        )

				    else:

				        repaired_wheel_name = wheel_name.replace(

				            "linux_aarch64", "manylinux_2_28_aarch64"

				        )

				        print(f"Renaming {wheel_name} wheel to {repaired_wheel_name}")

				        os.rename(

				            f"/{folder}/dist/{wheel_name}",

				            f"/{folder}/dist/{repaired_wheel_name}",

				        )

				        repaired_wheel_name = list_dir(f"/{folder}/dist")[0]

				    print(f"Copying {repaired_wheel_name} to artifacts")

				    shutil.copy2(

				@ -208,7 +320,17 @@ if __name__ == "__main__":

				    build_vars = "CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "

				    # MAX_JOB=5 is not required for CPU backend (see commit 465d98b)

				    if enable_cuda:

				        build_vars = "MAX_JOBS=5 " + build_vars

				        build_vars += "MAX_JOBS=5 "

				        # Handle PyPI NVIDIA libraries vs bundled libraries

				        use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"

				        if use_nvidia_pypi_libs:

				            print("Configuring build for PyPI NVIDIA libraries")

				            # Configure for dynamic linking (matching x86 logic)

				            build_vars += "ATEN_STATIC_CUDA=0 USE_CUDA_STATIC_LINK=0 USE_CUPTI_SO=1 "

				        else:

				            print("Configuring build for bundled NVIDIA libraries")

				            # Keep existing static linking approach - already configured above

				    override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")

				    desired_cuda = os.getenv("DESIRED_CUDA")

									
										16

.ci/aarch64_linux/build_aarch64_wheel.py
									
												View File
												
				@ -438,9 +438,7 @@ def build_torchvision(

				        )

				        build_vars += f"BUILD_VERSION={version}.dev{build_date}"

				    elif build_version is not None:

				        build_vars += (

				            f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"

				        )

				        build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				@ -495,9 +493,7 @@ def build_torchdata(

				        )

				        build_vars += f"BUILD_VERSION={version}.dev{build_date}"

				    elif build_version is not None:

				        build_vars += (

				            f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"

				        )

				        build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				@ -553,9 +549,7 @@ def build_torchtext(

				        )

				        build_vars += f"BUILD_VERSION={version}.dev{build_date}"

				    elif build_version is not None:

				        build_vars += (

				            f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"

				        )

				        build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

				@ -613,9 +607,7 @@ def build_torchaudio(

				        )

				        build_vars += f"BUILD_VERSION={version}.dev{build_date}"

				    elif build_version is not None:

				        build_vars += (

				            f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"

				        )

				        build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"

				    if host.using_docker():

				        build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

									
										4

.ci/docker/README.md
									
												View File
												
				@ -120,8 +120,8 @@ If your new Docker image needs a library installed from a specific pinned commit

				   If you're introducing a new argument to the Docker build, make sure to add it in the Docker build step in `.ci/docker/build.sh`:

				   ```bash

				   docker build \

				      ....

				      --build-arg "NEW_ARG_1=${NEW_ARG_1}"

				     ....

				     --build-arg "NEW_ARG_1=${NEW_ARG_1}"

				   ```

				3. **Update Dockerfile logic**:

									
										6

.ci/docker/almalinux/Dockerfile
									
												View File
												
				@ -64,6 +64,10 @@ FROM cuda as cuda12.9

				RUN bash ./install_cuda.sh 12.9

				ENV DESIRED_CUDA=12.9

				FROM cuda as cuda13.0

				RUN bash ./install_cuda.sh 13.0

				ENV DESIRED_CUDA=13.0

				FROM ${ROCM_IMAGE} as rocm

				ENV PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

				ADD ./common/install_mkl.sh install_mkl.sh

				@ -76,10 +80,10 @@ ADD ./common/install_mnist.sh install_mnist.sh

				RUN bash ./install_mnist.sh

				FROM base as all_cuda

				COPY --from=cuda11.8  /usr/local/cuda-11.8 /usr/local/cuda-11.8

				COPY --from=cuda12.6  /usr/local/cuda-12.6 /usr/local/cuda-12.6

				COPY --from=cuda12.8  /usr/local/cuda-12.8 /usr/local/cuda-12.8

				COPY --from=cuda12.9  /usr/local/cuda-12.9 /usr/local/cuda-12.9

				COPY --from=cuda13.0  /usr/local/cuda-13.0 /usr/local/cuda-13.0

				# Final step

				FROM ${BASE_TARGET} as final

									
										162

.ci/docker/build.sh
									
												View File
												
				@ -76,10 +76,13 @@ elif [[ "$image" == *cuda*linter* ]]; then

				elif [[ "$image" == *linter* ]]; then

				  # Use a separate Dockerfile for linter to keep a small image size

				  DOCKERFILE="linter/Dockerfile"

				elif [[ "$image" == *riscv* ]]; then

				  # Use RISC-V specific Dockerfile

				  DOCKERFILE="ubuntu-cross-riscv/Dockerfile"

				fi

				_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb

				_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b

				_UCX_COMMIT=7836b165abdbe468a2f607e7254011c07d788152

				_UCC_COMMIT=430e241bf5d38cbc73fc7a6b89155397232e3f96

				if [[ "$image" == *rocm* ]]; then

				  _UCX_COMMIT=cc312eaa4655c0cc5c2bcd796db938f90563bcf6

				  _UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d

				@ -111,6 +114,16 @@ case "$tag" in

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc11)

				    CUDA_VERSION=13.0.0

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    ANACONDA_PYTHON_VERSION=3.10

				@ -122,38 +135,6 @@ case "$tag" in

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8.1

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.6.3

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm)

				    CUDA_VERSION=12.8.1

				    ANACONDA_PYTHON_VERSION=3.12

				@ -164,39 +145,6 @@ case "$tag" in

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.8.1

				    ANACONDA_PYTHON_VERSION=3.10

				@ -208,30 +156,18 @@ case "$tag" in

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3-clang12-onnx)

				    ANACONDA_PYTHON_VERSION=3.9

				    ANACONDA_PYTHON_VERSION=3.10

				    CLANG_VERSION=12

				    VISION=yes

				    ONNX=yes

				    ;;

				  pytorch-linux-jammy-py3.9-clang12)

				    ANACONDA_PYTHON_VERSION=3.9

				  pytorch-linux-jammy-py3.10-clang12)

				    ANACONDA_PYTHON_VERSION=3.10

				    CLANG_VERSION=12

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3.11-clang12)

				    ANACONDA_PYTHON_VERSION=3.11

				    CLANG_VERSION=12

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3.9-gcc9)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=9

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-rocm-n-py3 | pytorch-linux-noble-rocm-n-py3)

				  pytorch-linux-jammy-rocm-n-py3 | pytorch-linux-jammy-rocm-n-py3-benchmarks | pytorch-linux-noble-rocm-n-py3)

				    if [[ $tag =~ "jammy" ]]; then

				      ANACONDA_PYTHON_VERSION=3.10

				    else

				@ -245,7 +181,9 @@ case "$tag" in

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    INDUCTOR_BENCHMARKS=yes

				    if [[ $tag =~ "benchmarks" ]]; then

				      INDUCTOR_BENCHMARKS=yes

				    fi

				    ;;

				  pytorch-linux-noble-rocm-alpha-py3)

				    ANACONDA_PYTHON_VERSION=3.12

				@ -257,27 +195,26 @@ case "$tag" in

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    INDUCTOR_BENCHMARKS=yes

				    PYTORCH_ROCM_ARCH="gfx90a;gfx942;gfx950"

				    ;;

				  pytorch-linux-jammy-xpu-2025.0-py3)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    VISION=yes

				    XPU_VERSION=2025.0

				    NINJA_VERSION=1.9.0

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-xpu-2025.1-py3)

				    ANACONDA_PYTHON_VERSION=3.9

				  pytorch-linux-jammy-xpu-n-1-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				    XPU_VERSION=2025.1

				    NINJA_VERSION=1.9.0

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.9

				  pytorch-linux-jammy-xpu-n-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				    XPU_VERSION=2025.2

				    NINJA_VERSION=1.9.0

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				    KATEX=yes

				@ -285,8 +222,8 @@ case "$tag" in

				    DOCS=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12)

				    ANACONDA_PYTHON_VERSION=3.9

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-clang12)

				    ANACONDA_PYTHON_VERSION=3.10

				    CUDA_VERSION=12.8.1

				    CLANG_VERSION=12

				    VISION=yes

				@ -297,8 +234,8 @@ case "$tag" in

				    CLANG_VERSION=18

				    VISION=yes

				    ;;

				  pytorch-linux-jammy-py3.9-gcc11)

				    ANACONDA_PYTHON_VERSION=3.9

				  pytorch-linux-jammy-py3.10-gcc11)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    VISION=yes

				    KATEX=yes

				@ -325,13 +262,10 @@ case "$tag" in

				    TRITON_CPU=yes

				    ;;

				  pytorch-linux-jammy-linter)

				    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

				    # We will need to update mypy version eventually, but that's for another day. The task

				    # would be to upgrade mypy to 1.0.0 with Python 3.11

				    PYTHON_VERSION=3.9

				    PYTHON_VERSION=3.10

				    ;;

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter)

				    PYTHON_VERSION=3.9

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-linter)

				    PYTHON_VERSION=3.10

				    CUDA_VERSION=12.8.1

				    ;;

				  pytorch-linux-jammy-aarch64-py3.10-gcc11)

				@ -339,7 +273,6 @@ case "$tag" in

				    GCC_VERSION=11

				    ACL=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    OPENBLAS=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				@ -350,13 +283,15 @@ case "$tag" in

				    GCC_VERSION=11

				    ACL=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    OPENBLAS=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-noble-riscv64-py3.12-gcc14)

				    GCC_VERSION=14

				    ;;

				  *)

				    # Catch-all for builds that are not hardcoded.

				    VISION=yes

				@ -477,7 +412,14 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				fi

				if [ -n "$GCC_VERSION" ]; then

				  if !(drun gcc --version 2>&1 | grep -q " $GCC_VERSION\\W"); then

				  if [[ "$image" == *riscv* ]]; then

				    # Check RISC-V cross-compilation toolchain version

				    if !(drun riscv64-linux-gnu-gcc-${GCC_VERSION} --version 2>&1 | grep -q " $GCC_VERSION\\W"); then

				      echo "RISC-V GCC_VERSION=$GCC_VERSION, but:"

				      drun riscv64-linux-gnu-gcc-${GCC_VERSION} --version

				      exit 1

				    fi

				  elif !(drun gcc --version 2>&1 | grep -q " $GCC_VERSION\\W"); then

				    echo "GCC_VERSION=$GCC_VERSION, but:"

				    drun gcc --version

				    exit 1

2

.ci/docker/ci_commit_pins/huggingface-requirements.txt Normal file

View File

 @ -0,0 +1,2 @@
 transformers==4.54.0
 soxr==0.5.0

1

.ci/docker/ci_commit_pins/huggingface.txt

View File

				`@ -1 +0,0 @@`
				`243e186efbf7fb93328dd6b34927a4e8c8f24395`

1

.ci/docker/ci_commit_pins/nccl-cu13.txt Normal file

View File

				`@ -0,0 +1 @@`
				`v2.27.7-1`

1

.ci/docker/ci_commit_pins/torchbench.txt Normal file

View File

				`@ -0,0 +1 @@`
				`74a23feff57432129df84d8099e622773cf77925`

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 ae324eeac8e102a2b40370e341460f3791353398
 b0418a9a454b2b93ab8d71f40e59d2297157fae

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 ec6354315768a85da41032535e3b7b99c5f706
 bbb06c0334a6772b92d24bde54956e675c8c6604

									
										9

.ci/docker/common/install_cpython.sh
									
												View File
												
				@ -66,8 +66,9 @@ function do_cpython_build {

				        ln -s pip3 ${prefix}/bin/pip

				    fi

				    # install setuptools since python 3.12 is required to use distutils

				    ${prefix}/bin/pip install wheel==0.45.1 setuptools==80.9.0

				    local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")

				    # packaging is needed to create symlink since wheel no longer provides needed information

				    ${prefix}/bin/pip install packaging==25.0 wheel==0.45.1 setuptools==80.9.0

				    local abi_tag=$(${prefix}/bin/python -c "from packaging.tags import interpreter_name, interpreter_version; import sysconfig ; from sysconfig import get_config_var; print('{0}{1}-{0}{1}{2}'.format(interpreter_name(), interpreter_version(), 't' if sysconfig.get_config_var('Py_GIL_DISABLED') else ''))")

				    ln -sf ${prefix} /opt/python/${abi_tag}

				}

				@ -82,9 +83,9 @@ function build_cpython {

				        py_suffix=${py_ver::-1}

				        py_folder=$py_suffix

				    fi

				    # Only b3 is available now

				    # Update to rc2 due to https://github.com/python/cpython/commit/c72699086fe4

				    if [ "$py_suffix" == "3.14.0" ]; then

				        py_suffix="3.14.0b3"

				        py_suffix="3.14.0rc2"

				    fi

				    wget -q $PYTHON_DOWNLOAD_URL/$py_folder/Python-$py_suffix.tgz -O Python-$py_ver.tgz

				    do_cpython_build $py_ver Python-$py_suffix

									
										106

.ci/docker/common/install_cuda.sh
									
												View File
												
				@ -10,7 +10,7 @@ else

				  arch_path='sbsa'

				fi

				NVSHMEM_VERSION=3.3.9

				NVSHMEM_VERSION=3.3.24

				function install_cuda {

				  version=$1

				@ -62,14 +62,16 @@ function install_nvshmem {

				  mkdir -p "${tmpdir}" && cd "${tmpdir}"

				  # nvSHMEM license: https://docs.nvidia.com/nvshmem/api/sla.html

				  filename="libnvshmem_cuda${cuda_major_version}-linux-${arch_path}-${nvshmem_version}"

				  url="https://developer.download.nvidia.com/compute/redist/nvshmem/${nvshmem_version}/builds/cuda${cuda_major_version}/txz/agnostic/${dl_arch}/${filename}.tar.gz"

				  # This pattern is a lie as it is not consistent across versions, for 3.3.9 it was cuda_ver-arch-nvshhem-ver

				  filename="libnvshmem-linux-${arch_path}-${nvshmem_version}_cuda${cuda_major_version}-archive"

				  suffix=".tar.xz"

				  url="https://developer.download.nvidia.com/compute/nvshmem/redist/libnvshmem/linux-${arch_path}/${filename}${suffix}"

				  # download, unpack, install

				  wget -q "${url}"

				  tar xf "${filename}.tar.gz"

				  cp -a "libnvshmem/include/"* /usr/local/include/

				  cp -a "libnvshmem/lib/"*     /usr/local/lib/

				  tar xf "${filename}${suffix}"

				  cp -a "${filename}/include/"* /usr/local/cuda/include/

				  cp -a "${filename}/lib/"*     /usr/local/cuda/lib64/

				  # cleanup

				  cd ..

				@ -126,74 +128,6 @@ function install_129 {

				  ldconfig

				}

				function prune_124 {

				  echo "Pruning CUDA 12.4"

				  #####################################################################################

				  # CUDA 12.4 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then

				      export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.4 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.4/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/

				}

				function prune_126 {

				  echo "Pruning CUDA 12.6"

				  #####################################################################################

				  # CUDA 12.6 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then

				      export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.6 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.6/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/

				}

				function install_128 {

				  CUDNN_VERSION=9.8.0.87

				  echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"

				@ -212,18 +146,38 @@ function install_128 {

				  ldconfig

				}

				function install_130 {

				  CUDNN_VERSION=9.13.0.50

				  echo "Installing CUDA 13.0 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"

				  # install CUDA 13.0 in the same container

				  install_cuda 13.0.0 cuda_13.0.0_580.65.06_linux

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  install_cudnn 13 $CUDNN_VERSION

				  install_nvshmem 13 $NVSHMEM_VERSION

				  CUDA_VERSION=13.0 bash install_nccl.sh

				  CUDA_VERSION=13.0 bash install_cusparselt.sh

				  ldconfig

				}

				# idiomatic parameter and option handling in sh

				while test $# -gt 0

				do

				    case "$1" in

				    12.4) install_124; prune_124

				    12.4) install_124;

				        ;;

				    12.6|12.6.*) install_126; prune_126

				    12.6|12.6.*) install_126;

				        ;;

				    12.8|12.8.*) install_128;

				        ;;

				    12.9|12.9.*) install_129;

				        ;;

				    13.0|13.0.*) install_130;

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

				    esac

									
										10

.ci/docker/common/install_cusparselt.sh
									
												View File
												
				@ -5,7 +5,15 @@ set -ex

				# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				mkdir tmp_cusparselt && cd tmp_cusparselt

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-9]$ ]]; then

				if [[ ${CUDA_VERSION:0:4} =~ "13" ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.8.0.4_cuda13-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-9]$ ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

									
										31

.ci/docker/common/install_inductor_benchmark_deps.sh
									
												View File
												
				@ -5,9 +5,7 @@ set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				function install_huggingface() {

				  local version

				  commit=$(get_pinned_commit huggingface)

				  pip_install "git+https://github.com/huggingface/transformers@${commit}"

				  pip_install -r huggingface-requirements.txt

				}

				function install_timm() {

				@ -15,11 +13,34 @@ function install_timm() {

				  commit=$(get_pinned_commit timm)

				  pip_install "git+https://github.com/huggingface/pytorch-image-models@${commit}"

				  # Clean up

				  conda_run pip uninstall -y torch torchvision triton

				}

				function install_torchbench() {

				  local commit

				  commit=$(get_pinned_commit torchbench)

				  git clone https://github.com/pytorch/benchmark torchbench

				  pushd torchbench

				  git checkout "$commit"

				  python install.py --continue_on_fail

				  echo "Print all dependencies after TorchBench is installed"

				  python -mpip freeze

				  popd

				  chown -R jenkins torchbench

				  chown -R jenkins /opt/conda

				}

				# Pango is needed for weasyprint which is needed for doctr

				conda_install pango

				# Stable packages are ok here, just to satisfy TorchBench check

				pip_install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

				install_torchbench

				install_huggingface

				install_timm

				# Clean up

				conda_run pip uninstall -y torch torchvision torchaudio triton torchao

									
										2

.ci/docker/common/install_nccl.sh
									
												View File
												
				@ -7,6 +7,8 @@ if [[ ${CUDA_VERSION:0:2} == "11" ]]; then

				  NCCL_VERSION=$(cat ci_commit_pins/nccl-cu11.txt)

				elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then

				  NCCL_VERSION=$(cat ci_commit_pins/nccl-cu12.txt)

				elif [[ ${CUDA_VERSION:0:2} == "13" ]]; then

				  NCCL_VERSION=$(cat ci_commit_pins/nccl-cu13.txt)

				else

				  echo "Unexpected CUDA_VERSION ${CUDA_VERSION}"

				  exit 1

									
										4

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -19,8 +19,8 @@ pip_install \

				  transformers==4.36.2

				pip_install coloredlogs packaging

				pip_install onnxruntime==1.18.1

				pip_install onnxscript==0.3.1

				pip_install onnxruntime==1.22.1

				pip_install onnxscript==0.4.0

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

									
										4

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -57,7 +57,7 @@ if [ ! -f setup.py ]; then

				  cd python

				fi

				pip_install pybind11==2.13.6

				pip_install pybind11==3.0.1

				# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

				as_jenkins sed -i -e 's/https:\/\/tritonlang.blob.core.windows.net\/llvm-builds/https:\/\/oaitriton.blob.core.windows.net\/public\/llvm-builds/g' setup.py

				@ -103,5 +103,5 @@ fi

				# It depends on torch and triton. We don't want to install

				# triton and torch from production on Docker CI images

				if [[ "$ANACONDA_PYTHON_VERSION" != 3.9* ]]; then

				  pip_install helion==0.0.10 --no-deps

				  pip_install helion --no-deps

				fi

									
										8

.ci/docker/common/install_ucc.sh
									
												View File
												
				@ -44,8 +44,12 @@ function install_ucc() {

				  ./autogen.sh

				  # We only run distributed tests on Tesla M60 and A10G

				  NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"

				  if [[ -n "$CUDA_VERSION"  && $CUDA_VERSION == 13* ]]; then

				    NVCC_GENCODE="-gencode=arch=compute_86,code=compute_86"

				  else

				    # We only run distributed tests on Tesla M60 and A10G

				    NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"

				  fi

				  if [[ -n "$ROCM_VERSION" ]]; then

				    if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then

									
										61

.ci/docker/common/install_xpu.sh
									
												View File
												
				@ -34,18 +34,27 @@ function install_ubuntu() {

				    # The xpu-smi packages

				    apt-get install -y flex bison xpu-smi

				    # Compute and Media Runtimes

				    apt-get install -y \

				        intel-opencl-icd intel-level-zero-gpu level-zero \

				        intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \

				        libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \

				        libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \

				        mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

				    if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then

				        apt-get install -y intel-ocloc

				    if [[ "${XPU_DRIVER_TYPE,,}" == "lts" ]]; then

				        # Compute and Media Runtimes

				        apt-get install -y \

				            intel-opencl-icd intel-level-zero-gpu level-zero \

				            intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \

				            libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \

				            libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \

				            mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

				        # Development Packages

				        apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev

				    else # rolling driver

				        apt-get install -y \

				            intel-opencl-icd libze-intel-gpu1 libze1 \

				            intel-media-va-driver-non-free libmfx-gen1 libvpl2 \

				            libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \

				            libglapi-mesa libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \

				            mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo intel-ocloc

				        apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev libze-dev

				    fi

				    # Development Packages

				    apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev

				    # Install Intel Support Packages

				    apt-get install -y ${XPU_PACKAGES}

				@ -56,10 +65,14 @@ function install_ubuntu() {

				function install_rhel() {

				    . /etc/os-release

				    if [[ ! " 8.8 8.10 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then

				        echo "RHEL version ${VERSION_ID} not supported"

				        exit

				    if [[ "${ID}" == "rhel" ]]; then

				        if [[ ! " 8.8 8.9 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then

				            echo "RHEL version ${VERSION_ID} not supported"

				            exit

				        fi

				    elif [[ "${ID}" == "almalinux" ]]; then

				        # Workaround for almalinux8 which used by quay.io/pypa/manylinux_2_28_x86_64

				        VERSION_ID="8.8"

				    fi

				    dnf install -y 'dnf-command(config-manager)'

				@ -130,18 +143,18 @@ function install_sles() {

				}

				# Default use GPU driver LTS releases

				XPU_DRIVER_VERSION="/lts/2350"

				if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then

				    # Use GPU driver rolling releases

				    XPU_DRIVER_VERSION=""

				# Default use GPU driver rolling releases

				XPU_DRIVER_VERSION=""

				if [[ "${XPU_DRIVER_TYPE,,}" == "lts" ]]; then

				    # Use GPU driver LTS releases

				    XPU_DRIVER_VERSION="/lts/2350"

				fi

				# Default use Intel® oneAPI Deep Learning Essentials 2025.0

				if [[ "$XPU_VERSION" == "2025.1" ]]; then

				    XPU_PACKAGES="intel-deep-learning-essentials-2025.1"

				# Default use Intel® oneAPI Deep Learning Essentials 2025.1

				if [[ "$XPU_VERSION" == "2025.2" ]]; then

				    XPU_PACKAGES="intel-deep-learning-essentials-2025.2"

				else

				    XPU_PACKAGES="intel-deep-learning-essentials-2025.0"

				    XPU_PACKAGES="intel-deep-learning-essentials-2025.1"

				fi

				# The installation depends on the base OS

									
										9

.ci/docker/common/patch_libstdc.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,9 @@

				#!/bin/bash

				set -xe

				# Script used in Linux x86 and aarch64 CD pipeline

				# Workaround for exposing statically linked libstdc++ CXX11 ABI symbols.

				# see: https://github.com/pytorch/pytorch/issues/133437

				LIBNONSHARED=$(gcc -print-file-name=libstdc++_nonshared.a)

				nm -g $LIBNONSHARED | grep " T " | grep recursive_directory_iterator | cut -c 20-  > weaken-symbols.txt

				objcopy --weaken-symbols weaken-symbols.txt $LIBNONSHARED $LIBNONSHARED

									
										13

.ci/docker/libtorch/Dockerfile
									
												View File
												
				@ -69,6 +69,19 @@ RUN bash ./install_cuda.sh 12.9

				RUN bash ./install_magma.sh 12.9

				RUN ln -sf /usr/local/cuda-12.9 /usr/local/cuda

				FROM cuda as cuda13.0

				RUN bash ./install_cuda.sh 13.0

				RUN bash ./install_magma.sh 13.0

				RUN ln -sf /usr/local/cuda-13.0 /usr/local/cuda

				# Install libibverbs for libtorch and copy to CUDA directory

				RUN apt-get update -y && \

				    apt-get install -y libibverbs-dev librdmacm-dev && \

				    cp /usr/lib/x86_64-linux-gnu/libmlx5.so* /usr/local/cuda/lib64/ && \

				    cp /usr/lib/x86_64-linux-gnu/librdmacm.so* /usr/local/cuda/lib64/ && \

				    cp /usr/lib/x86_64-linux-gnu/libibverbs.so* /usr/local/cuda/lib64/ && \

				    cp /usr/lib/x86_64-linux-gnu/libnl* /usr/local/cuda/lib64/

				FROM cpu as rocm

				ARG ROCM_VERSION

				ARG PYTORCH_ROCM_ARCH

5

.ci/docker/manywheel/Dockerfile_2_28

View File

 @ -130,7 +130,8 @@ ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/op
 RUN for cpython_version in "cp312-cp312" "cp313-cp313" "cp313-cp313t"; do \
     /opt/python/${cpython_version}/bin/python -m pip install setuptools wheel; \
     done;
 ADD ./common/patch_libstdc.sh patch_libstdc.sh
 RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh
 # cmake-3.18.4 from pip; force in case cmake3 already exists
 RUN yum install -y python3-pip && \
 @ -175,6 +176,6 @@ ENV XPU_DRIVER_TYPE ROLLING
 RUN python3 -m pip install --upgrade pip && \
     python3 -mpip install cmake==3.28.4
 ADD ./common/install_xpu.sh install_xpu.sh
 ENV XPU_VERSION 2025.1
 ENV XPU_VERSION 2025.2
 RUN bash ./install_xpu.sh && rm install_xpu.sh
 RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

2

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

 @ -71,3 +71,5 @@ RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 COPY --from=openblas     /opt/OpenBLAS/  /opt/OpenBLAS/
 ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH
 ADD ./common/patch_libstdc.sh patch_libstdc.sh
 RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh

2

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

 @ -95,3 +95,5 @@ COPY --from=nvpl /opt/nvpl/lib/  /usr/local/lib/
 COPY --from=nvpl /opt/nvpl/include/  /usr/local/include/
 RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
 ENV PATH=/usr/local/cuda/bin:$PATH
 ADD ./common/patch_libstdc.sh patch_libstdc.sh
 RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh

									
										6

.ci/docker/manywheel/build.sh
									
												View File
												
				@ -67,6 +67,12 @@ case ${image} in

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=13"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    manylinux2_28-builder:cuda13*)

				        TARGET=cuda_final

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=13"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    manylinuxaarch64-builder:cuda*)

				        TARGET=cuda_final

				        GPU_IMAGE=amd64/almalinux:8

29

.ci/docker/requirements-ci.txt

View File

 @ -63,11 +63,12 @@ lark==0.12.0
 #Pinned versions: 0.12.0
 #test that import:
 librosa>=0.6.2 ; python_version < "3.11"
 librosa==0.10.2 ; python_version == "3.12"
 librosa>=0.6.2 ; python_version < "3.11" and platform_machine != "s390x"
 librosa==0.10.2 ; python_version == "3.12" and platform_machine != "s390x"
 #Description: A python package for music and audio analysis
 #Pinned versions: >=0.6.2
 #test that import: test_spectral_ops.py
 #librosa depends on numba; disable it for s390x while numba is disabled too
 #mkl #this breaks linux-bionic-rocm4.5-py3.7
 #Description: Intel oneAPI Math Kernel Library
 @ -92,8 +93,9 @@ librosa==0.10.2 ; python_version == "3.12"
 #Pinned versions:
 #test that import:
 mypy==1.16.0
 mypy==1.16.0 ; platform_system != "Windows"
 # Pin MyPy version because new errors are likely to appear with each release
 # Skip on Windows as lots of type annotations are POSIX specific
 #Description: linter
 #Pinned versions: 1.16.0
 #test that import: test_typing.py, test_type_hints.py
 @ -110,14 +112,15 @@ ninja==1.11.1.3
 #Pinned versions: 1.11.1.3
 #test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
 numba==0.49.0 ; python_version < "3.9"
 numba==0.55.2 ; python_version == "3.9"
 numba==0.55.2 ; python_version == "3.10"
 numba==0.60.0 ; python_version == "3.12"
 numba==0.49.0 ; python_version < "3.9" and platform_machine != "s390x"
 numba==0.55.2 ; python_version == "3.9" and platform_machine != "s390x"
 numba==0.55.2 ; python_version == "3.10" and platform_machine != "s390x"
 numba==0.60.0 ; python_version == "3.12" and platform_machine != "s390x"
 #Description: Just-In-Time Compiler for Numerical Functions
 #Pinned versions: 0.54.1, 0.49.0, <=0.49.1
 #test that import: test_numba_integration.py
 #For numba issue see https://github.com/pytorch/pytorch/issues/51511
 #Need release > 0.61.2 for s390x due to https://github.com/numba/numba/pull/10073
 #numpy
 #Description: Provides N-dimensional arrays and linear algebra
 @ -261,11 +264,6 @@ scipy==1.14.1 ; python_version >= "3.12"
 #Pinned versions:
 #test that import:
 tb-nightly==2.13.0a20230426
 #Description: TensorBoard
 #Pinned versions:
 #test that import:
 # needed by torchgen utils
 typing-extensions>=4.10.0
 #Description: type hints for python
 @ -307,7 +305,7 @@ pytest-cpp==2.3.0
 #Pinned versions: 2.3.0
 #test that import:
 z3-solver==4.15.1.0
 z3-solver==4.15.1.0 ; platform_machine != "s390x"
 #Description: The Z3 Theorem Prover Project
 #Pinned versions:
 #test that import:
 @ -342,7 +340,7 @@ onnx==1.18.0
 #Pinned versions:
 #test that import:
 onnxscript==0.3.1
 onnxscript==0.4.0
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 @ -361,7 +359,6 @@ pwlf==2.2.1
 #Pinned versions: 2.2.1
 #test that import: test_sac_estimator.py
 # To build PyTorch itself
 pyyaml
 pyzstd
 @ -383,7 +380,7 @@ dataclasses_json==0.6.7
 cmake==4.0.0
 #Description: required for building
 tlparse==0.3.30
 tlparse==0.4.0
 #Description: required for log parsing
 cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x"

2

.ci/docker/requirements-docs.txt

View File

 @ -1,7 +1,7 @@
 sphinx==5.3.0
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 5.3.0
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@722b7e6f9ca512fcc526ad07d62b3d28c50bb6cd#egg=pytorch_sphinx_theme2
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@71e55749be14ceb56e7f8211a9fb649866b87ad4#egg=pytorch_sphinx_theme2
 # TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
 # but it doesn't seem to work and hangs around idly. The initial thought that it is probably

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .4.0
 .5.0

2

.ci/docker/triton_xpu_version.txt

View File

 @ -1 +1 @@
 .4.0
 .5.0

									
										155

.ci/docker/ubuntu-cross-riscv/Dockerfile
									
										Normal file
									
												View File
												
				@ -0,0 +1,155 @@

				# Cross-compilation Docker container for RISC-V architecture

				ARG UBUNTU_VERSION

				FROM --platform=linux/amd64 ubuntu:${UBUNTU_VERSION} as base

				ARG UBUNTU_VERSION

				ENV GCC_VERSION=14

				ENV PYTHON_VERSION=3.12.3

				ENV DEBIAN_FRONTEND=noninteractive

				ENV CC=riscv64-linux-gnu-gcc-${GCC_VERSION}

				ENV CXX=riscv64-linux-gnu-g++-${GCC_VERSION}

				ENV QEMU_LD_PREFIX=/usr/riscv64-linux-gnu/

				ENV SYSROOT=/opt/sysroot

				# Install basic dependencies

				RUN apt-get update && apt-get install -y \

				    ninja-build \

				    autoconf \

				    automake \

				    libtool \

				    patchelf \

				    ccache \

				    git \

				    wget \

				    python3-pip \

				    python3-venv \

				    python-is-python3 \

				    cmake \

				    sudo \

				    lsb-release \

				    gcc-${GCC_VERSION}-riscv64-linux-gnu \

				    g++-${GCC_VERSION}-riscv64-linux-gnu \

				    pkg-config \

				    && rm -rf /var/lib/apt/lists/*

				# Install user

				COPY ./common/install_user.sh install_user.sh

				RUN bash ./install_user.sh && rm install_user.sh

				FROM base as python

				ARG ZLIB_VERSION=1.3.1

				ARG FFI_VERSION=3.4.6

				ARG BZ2_VERSION=1.0.8

				ARG XZ_VERSION=5.4.6

				ARG OPENSSL_VERSION=3.2.1

				# Set up sysroot directory for dependencies

				ENV PKG_CONFIG_PATH=${SYSROOT}/lib/pkgconfig

				ENV PKG_CONFIG_SYSROOT_DIR=${SYSROOT}

				WORKDIR /opt

				# Build zlib (for compression)

				RUN echo "--- Building zlib ---" \

				    && wget -c https://www.zlib.net/zlib-${ZLIB_VERSION}.tar.gz \

				    && tar -xf zlib-${ZLIB_VERSION}.tar.gz --no-same-permissions --no-same-owner \

				    && cd zlib-${ZLIB_VERSION}/ \

				    && mkdir build && cd build \

				    && ../configure --prefix=${SYSROOT} \

				    && make -j$(nproc) && make install \

				    && cd ../..

				# Build libffi (for ctypes module)

				RUN echo "--- Building libffi ---" \

				    && wget -c https://github.com/libffi/libffi/releases/download/v${FFI_VERSION}/libffi-${FFI_VERSION}.tar.gz \

				    && tar -xf libffi-${FFI_VERSION}.tar.gz --no-same-permissions --no-same-owner \

				    && cd libffi-${FFI_VERSION}/ \

				    && mkdir build && cd build \

				    && ../configure --prefix=${SYSROOT} --host=riscv64-linux-gnu --build=x86_64-linux-gnu \

				    && make -j$(nproc) && make install \

				    && cd ../..

				# Build bzip2 (for bz2 module)

				RUN echo "--- Building bzip2 ---" \

				    && wget -c https://sourceware.org/pub/bzip2/bzip2-${BZ2_VERSION}.tar.gz \

				    && tar -xf bzip2-${BZ2_VERSION}.tar.gz --no-same-permissions --no-same-owner \

				    && cd bzip2-${BZ2_VERSION}/ \

				    && make CC=riscv64-linux-gnu-gcc-${GCC_VERSION} bzip2 bzip2recover libbz2.a \

				    && make CC=riscv64-linux-gnu-gcc-${GCC_VERSION} -f Makefile-libbz2_so \

				    && make install PREFIX=${SYSROOT} \

				    && cp libbz2.so.${BZ2_VERSION} ${SYSROOT}/lib/ \

				    && cd ${SYSROOT}/lib/ \

				    && ln -sf libbz2.so.${BZ2_VERSION} libbz2.so.1.0 \

				    && ln -sf libbz2.so.1.0 libbz2.so \

				    && cd /opt/

				# Build xz (for lzma module)

				RUN echo "--- Building xz ---" \

				    && wget -c https://github.com/tukaani-project/xz/releases/download/v${XZ_VERSION}/xz-${XZ_VERSION}.tar.gz \

				    && tar -xf xz-${XZ_VERSION}.tar.gz --no-same-permissions --no-same-owner \

				    && cd xz-${XZ_VERSION} \

				    && mkdir build && cd build \

				    && ../configure --prefix=${SYSROOT} --host=riscv64-linux-gnu --build=x86_64-linux-gnu \

				    && make -j$(nproc) && make install \

				    && cd ../..

				# Build OpenSSL (for ssl module)

				RUN echo "--- Building OpenSSL ---" \

				    && wget -c https://www.openssl.org/source/openssl-${OPENSSL_VERSION}.tar.gz \

				    && tar -xf openssl-${OPENSSL_VERSION}.tar.gz --no-same-permissions --no-same-owner \

				    && cd openssl-${OPENSSL_VERSION}/ \

				    && mkdir build && cd build \

				    && ../Configure linux64-riscv64 --prefix=${SYSROOT} \

				    && make -j$(nproc) && make install_sw \

				    && cd ../..

				# Build SQLite3 (for sqlite3 module)

				RUN echo "--- Building SQLite3 ---" \

				    && wget -c https://www.sqlite.org/2024/sqlite-autoconf-3450200.tar.gz \

				    && tar -xf sqlite-autoconf-3450200.tar.gz --no-same-permissions --no-same-owner \

				    && cd sqlite-autoconf-3450200 \

				    && mkdir build && cd build \

				    && ../configure --prefix=${SYSROOT} --host=riscv64-linux-gnu --build=x86_64-linux-gnu \

				    && make -j$(nproc) && make install \

				    && cd ../..

				# Build and install RISC-V Python with all modules

				RUN wget -c https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tgz \

				    && tar -xf Python-${PYTHON_VERSION}.tgz --no-same-permissions --no-same-owner \

				    && cd Python-${PYTHON_VERSION} \

				    && mkdir build && cd build \

				    && ../configure \

				        --host=riscv64-linux-gnu \

				        --build=x86_64-linux-gnu \

				        --prefix=${SYSROOT} \

				        --enable-shared \

				        --disable-ipv6 \

				        --with-build-python=/usr/bin/python3 \

				        --with-ensurepip=no \

				        ac_cv_file__dev_ptmx=yes \

				        ac_cv_file__dev_ptc=no \

				    && make -j$(nproc) \

				    && make install

				FROM base as final

				COPY --from=python             /opt/sysroot                       /opt/sysroot

				# Install crossenv and cmake

				RUN pip install crossenv cmake==4.0.0 --break-system-packages \

				    && /usr/bin/python3 -m crossenv ${SYSROOT}/bin/python3 /opt/riscv-cross-env

				# Add pip-installed cmake binaries to PATH

				ENV PATH="/usr/local/bin:${PATH}"

				# Set up cross Python environment

				SHELL ["/bin/bash", "-c"]

				RUN source /opt/riscv-cross-env/bin/activate \

				    && pip install setuptools pyyaml typing_extensions wheel

				# Set default environment variables for PyTorch build

				ENV Python_ROOT_DIR=${SYSROOT}

				ENV OPENSSL_ROOT_DIR=${SYSROOT}

				USER jenkins

				CMD ["bash"]

									
										5

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -96,10 +96,11 @@ ARG ANACONDA_PYTHON_VERSION

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

				COPY ci_commit_pins/huggingface-requirements.txt huggingface-requirements.txt

				COPY ci_commit_pins/timm.txt timm.txt

				COPY ci_commit_pins/torchbench.txt torchbench.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface-requirements.txt torchbench.txt

				# (optional) Install non-default Ninja version

				ARG NINJA_VERSION

									
										4

.ci/docker/ubuntu-xpu/Dockerfile
									
												View File
												
				@ -56,10 +56,10 @@ RUN rm install_openssl.sh

				ARG INDUCTOR_BENCHMARKS

				COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

				COPY ci_commit_pins/huggingface-requirements.txt huggingface-requirements.txt

				COPY ci_commit_pins/timm.txt timm.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface-requirements.txt

				# Install XPU Dependencies

				ARG XPU_VERSION

									
										7

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -66,6 +66,7 @@ ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				# (optional) Install UCC

				ARG UCX_COMMIT

				ARG UCC_COMMIT

				ARG CUDA_VERSION

				ENV UCX_COMMIT $UCX_COMMIT

				ENV UCC_COMMIT $UCC_COMMIT

				ENV UCX_HOME /usr

				@ -96,10 +97,11 @@ RUN rm install_openssl.sh

				ARG INDUCTOR_BENCHMARKS

				COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

				COPY ci_commit_pins/huggingface-requirements.txt huggingface-requirements.txt

				COPY ci_commit_pins/timm.txt timm.txt

				COPY ci_commit_pins/torchbench.txt torchbench.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface-requirements.txt torchbench.txt

				ARG TRITON

				ARG TRITON_CPU

				@ -180,7 +182,6 @@ COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm

				RUN if [ -n "${SKIP_LLVM_SRC_BUILD_INSTALL}" ]; then set -eu; rm -rf /opt/llvm; fi

				# AWS specific CUDA build guidance

				ENV TORCH_CUDA_ARCH_LIST Maxwell

				ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"

				ENV CUDA_PATH /usr/local/cuda

									
										2

.ci/libtorch/build.sh
									
												View File
												
				@ -7,4 +7,4 @@ set -ex

				SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

				USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.9" ${SCRIPTPATH}/../manywheel/build.sh

				USE_NVSHMEM=0 USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.10" ${SCRIPTPATH}/../manywheel/build.sh

									
										31

.ci/lumen_cli/README.md
									
										Normal file
									
												View File
												
				@ -0,0 +1,31 @@

				# 🔧 Lumen_cli

				A Python CLI tool for building and testing PyTorch-based components, using a YAML configuration file for structured, repeatable workflows.

				## Features

				- **Build**

				    - external projects (e.g. vLLM)

				## 📦 Installation

				at the root of the pytorch repo

				```bash

				pip install -e .ci/lumen_cli

				```

				## Run the cli tool

				The cli tool must be used at root of pytorch repo, as example to run build external vllm:

				```bash

				python -m cli.run build external vllm

				```

				this will run the build steps with default behaviour for vllm project.

				to see help messages, run

				```bash

				python3 -m cli.run --help

				```

				## Add customized external build logics

				To add a new external build, for instance, add a new external build logics:

				1. create the build function in cli/lib folder

				2. register your target and the main build function at  EXTERNAL_BUILD_TARGET_DISPATCH in `cli/build_cli/register_build.py`

				3. [optional] create your ci config file in .github/ci_configs/${EXTERNAL_PACKAGE_NAME}.yaml

0

test/dynamo_expected_failures/CPython313-test_bool-BoolTest.test_blocked → .ci/lumen_cli/cli/build_cli/init.py

View File

									
										37

.ci/lumen_cli/cli/build_cli/register_build.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,37 @@

				import argparse

				import logging

				from cli.lib.common.cli_helper import register_targets, RichHelp, TargetSpec

				from cli.lib.core.vllm.vllm_build import VllmBuildRunner

				logger = logging.getLogger(__name__)

				# Maps targets to their argparse configuration and runner

				# it adds new target to path python -m cli.run build external {target} with buildrunner

				_TARGETS: dict[str, TargetSpec] = {

				    "vllm": {

				        "runner": VllmBuildRunner,

				        "help": "Build vLLM using docker buildx.",

				    }

				    # add yours ...

				}

				def register_build_commands(subparsers: argparse._SubParsersAction) -> None:

				    build_parser = subparsers.add_parser(

				        "build",

				        help="Build related commands",

				        formatter_class=RichHelp,

				    )

				    build_subparsers = build_parser.add_subparsers(dest="build_command", required=True)

				    overview = "\n".join(

				        f"  {name:12} {spec.get('help', '')}" for name, spec in _TARGETS.items()

				    )

				    external_parser = build_subparsers.add_parser(

				        "external",

				        help="Build external targets",

				        description="Build third-party targets.\n\nAvailable targets:\n" + overview,

				        formatter_class=RichHelp,

				    )

				    register_targets(external_parser, _TARGETS)

0

test/dynamo_expected_failures/CPython313-test_bool-BoolTest.test_bool_called_at_least_once → .ci/lumen_cli/cli/lib/init.py

View File

									
										71

.ci/lumen_cli/cli/lib/common/cli_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,71 @@

				"""

				Cli Argparser Utility helpers for CLI tasks.

				"""

				import argparse

				from abc import ABC, abstractmethod

				try:

				    from typing import Any, Callable, Required, TypedDict  # Python 3.11+

				except ImportError:

				    from typing import Any, Callable, TypedDict

				    from typing_extensions import Required  # Fallback for Python <3.11

				class BaseRunner(ABC):

				    def __init__(self, args: Any) -> None:

				        self.args = args

				    @abstractmethod

				    def run(self) -> None:

				        """runs main logics, required"""

				# Pretty help: keep newlines + show defaults

				class RichHelp(

				    argparse.ArgumentDefaultsHelpFormatter, argparse.RawDescriptionHelpFormatter

				):

				    pass

				class TargetSpec(TypedDict, total=False):

				    """CLI subcommand specification with bA."""

				    runner: Required[type[BaseRunner]]

				    help: str

				    description: str

				    add_arguments: Callable[[argparse.ArgumentParser], None]

				def register_targets(

				    parser: argparse.ArgumentParser,

				    target_specs: dict[str, TargetSpec],

				    common_args: Callable[[argparse.ArgumentParser], None] = lambda _: None,

				) -> None:

				    """Register target subcommands."""

				    targets = parser.add_subparsers(

				        dest="target",

				        required=True,

				        metavar="{" + ",".join(target_specs.keys()) + "}",

				    )

				    for name, spec in target_specs.items():

				        desc = spec.get("description") or spec["runner"].__doc__ or ""

				        p = targets.add_parser(

				            name,

				            help=spec.get("help", ""),

				            description=desc.strip(),

				            formatter_class=RichHelp,

				        )

				        p.set_defaults(

				            func=lambda args, cls=spec["runner"]: cls(args).run(),

				            _runner_class=spec["runner"],

				        )

				        if "add_arguments" in spec and callable(spec["add_arguments"]):

				            spec["add_arguments"](p)

				        if common_args:

				            common_args(p)

									
										42

.ci/lumen_cli/cli/lib/common/docker_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,42 @@

				"""

				Docker Utility helpers for CLI tasks.

				"""

				import logging

				from typing import Optional

				import docker

				from docker.errors import APIError, NotFound

				logger = logging.getLogger(__name__)

				# lazy singleton so we don't reconnect every call

				_docker_client: Optional[docker.DockerClient] = None

				def _get_client() -> docker.DockerClient:

				    global _docker_client

				    if _docker_client is None:

				        _docker_client = docker.from_env()

				    return _docker_client

				def local_image_exists(

				    image_name: str, client: Optional[docker.DockerClient] = None

				) -> bool:

				    """Return True if a local Docker image exists."""

				    if not image_name:

				        return False

				    client = client or _get_client()

				    try:

				        client.images.get(image_name)

				        return True

				    except (NotFound, APIError) as e:

				        logger.error(

				            "Error when checking Docker image '%s': %s",

				            image_name,

				            e.explanation if hasattr(e, "explanation") else str(e),

				        )

				        return False

									
										110

.ci/lumen_cli/cli/lib/common/envs_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,110 @@

				"""

				Environment Variables and Dataclasses Utility helpers for CLI tasks.

				"""

				import os

				from dataclasses import field, fields, is_dataclass, MISSING

				from pathlib import Path

				from textwrap import indent

				from typing import Optional, Union

				from cli.lib.common.utils import str2bool

				def get_env(name: str, default: str = "") -> str:

				    """Get environment variable with default fallback."""

				    return os.environ.get(name) or default

				def env_path_optional(

				    name: str,

				    default: Optional[Union[str, Path]] = None,

				    resolve: bool = True,

				) -> Optional[Path]:

				    """Get environment variable as optional Path."""

				    val = get_env(name) or default

				    if not val:

				        return None

				    path = Path(val)

				    return path.resolve() if resolve else path

				def env_path(

				    name: str,

				    default: Optional[Union[str, Path]] = None,

				    resolve: bool = True,

				) -> Path:

				    """Get environment variable as Path, raise if missing."""

				    path = env_path_optional(name, default, resolve)

				    if not path:

				        raise ValueError(f"Missing path value for {name}")

				    return path

				def env_bool(

				    name: str,

				    default: bool = False,

				) -> bool:

				    val = get_env(name)

				    if not val:

				        return default

				    return str2bool(val)

				def env_bool_field(

				    name: str,

				    default: bool = False,

				):

				    return field(default_factory=lambda: env_bool(name, default))

				def env_path_field(

				    name: str,

				    default: Union[str, Path] = "",

				    *,

				    resolve: bool = True,

				) -> Path:

				    return field(default_factory=lambda: env_path(name, default, resolve=resolve))

				def env_str_field(

				    name: str,

				    default: str = "",

				) -> str:

				    return field(default_factory=lambda: get_env(name, default))

				def generate_dataclass_help(cls) -> str:

				    """Auto-generate help text for dataclass fields."""

				    if not is_dataclass(cls):

				        raise TypeError(f"{cls} is not a dataclass")

				    def get_value(f):

				        if f.default is not MISSING:

				            return f.default

				        if f.default_factory is not MISSING:

				            try:

				                return f.default_factory()

				            except Exception as e:

				                return f"<error: {e}>"

				        return "<required>"

				    lines = [f"{f.name:<22} = {repr(get_value(f))}" for f in fields(cls)]

				    return indent("\n".join(lines), "    ")

				def with_params_help(params_cls: type, title: str = "Parameter defaults"):

				    """

				    Class decorator that appends a help table generated from another dataclass

				    (e.g., VllmParameters) to the decorated class's docstring.

				    """

				    if not is_dataclass(params_cls):

				        raise TypeError(f"{params_cls} must be a dataclass")

				    def _decorator(cls: type) -> type:

				        block = generate_dataclass_help(params_cls)

				        cls.__doc__ = (cls.__doc__ or "") + f"\n\n{title}:\n{block}"

				        return cls

				    return _decorator

									
										143

.ci/lumen_cli/cli/lib/common/gh_summary.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,143 @@

				from __future__ import annotations

				import logging

				import os

				import textwrap

				from pathlib import Path

				from typing import TYPE_CHECKING

				from cli.lib.common.utils import get_wheels

				from jinja2 import Template

				if TYPE_CHECKING:

				    from collections.abc import Iterable, Mapping

				logger = logging.getLogger(__name__)

				_TPL_CONTENT = Template(

				    textwrap.dedent("""\

				    ## {{ title }}

				    ```{{ lang }}

				    {{ content }}

				    ```

				""")

				)

				_TPL_LIST_ITEMS = Template(

				    textwrap.dedent("""\

				    ## {{ title }}

				    {% for it in items %}

				    - {{ it.pkg }}: {{ it.relpath }}

				    {% else %}

				    _(no item found)_

				    {% endfor %}

				    """)

				)

				_TPL_TABLE = Template(

				    textwrap.dedent("""\

				    {%- if rows %}

				    | {{ cols | join(' | ') }} |

				    |{%- for _ in cols %} --- |{%- endfor %}

				    {%- for r in rows %}

				    | {%- for c in cols %} {{ r.get(c, "") }} |{%- endfor %}

				    {%- endfor %}

				    {%- else %}

				    _(no data)_

				    {%- endif %}

				""")

				)

				def gh_summary_path() -> Path | None:

				    """Return the Path to the GitHub step summary file, or None if not set."""

				    p = os.environ.get("GITHUB_STEP_SUMMARY")

				    return Path(p) if p else None

				def write_gh_step_summary(md: str, *, append_content: bool = True) -> bool:

				    """

				    Write Markdown content to the GitHub Step Summary file if GITHUB_STEP_SUMMARY is set.

				    append_content: default true, if True, append to the end of the file, else overwrite the whole file

				    Returns:

				        True if written successfully (in GitHub Actions environment),

				        False if skipped (e.g., running locally where the variable is not set).

				    """

				    sp = gh_summary_path()

				    if not sp:

				        logger.info("[gh-summary] GITHUB_STEP_SUMMARY not set, skipping write.")

				        return False

				    md_clean = textwrap.dedent(md).strip() + "\n"

				    mode = "a" if append_content else "w"

				    with sp.open(mode, encoding="utf-8") as f:

				        f.write(md_clean)

				    return True

				def md_heading(text: str, level: int = 2) -> str:

				    """Generate a Markdown heading string with the given level (1-6)."""

				    return f"{'#' * max(1, min(level, 6))} {text}\n"

				def md_details(summary: str, content: str) -> str:

				    """Generate a collapsible <details> block with a summary and inner content."""

				    return f"<details>\n<summary>{summary}</summary>\n\n{content}\n\n</details>\n"

				def summarize_content_from_file(

				    output_dir: Path,

				    freeze_file: str,

				    title: str = "Content from file",

				    code_lang: str = "",  # e.g. "text" or "ini"

				) -> bool:

				    f = Path(output_dir) / freeze_file

				    if not f.exists():

				        return False

				    content = f.read_text(encoding="utf-8").strip()

				    md = render_content(content, title=title, lang=code_lang)

				    return write_gh_step_summary(md)

				def summarize_wheels(path: Path, title: str = "Wheels", max_depth: int = 3):

				    items = get_wheels(path, max_depth=max_depth)

				    if not items:

				        return False

				    md = render_list(items, title=title)

				    return write_gh_step_summary(md)

				def md_kv_table(rows: Iterable[Mapping[str, str | int | float]]) -> str:

				    """

				    Render a list of dicts as a Markdown table using Jinja template.

				    """

				    rows = list(rows)

				    cols = list({k for r in rows for k in r.keys()})

				    md = _TPL_TABLE.render(cols=cols, rows=rows).strip() + "\n"

				    return md

				def render_list(

				    items: Iterable[str],

				    *,

				    title: str = "List",

				) -> str:

				    tpl = _TPL_LIST_ITEMS

				    md = tpl.render(title=title, items=items)

				    return md

				def render_content(

				    content: str,

				    *,

				    title: str = "Content",

				    lang: str = "text",

				) -> str:

				    tpl = _TPL_CONTENT

				    md = tpl.render(title=title, content=content, lang=lang)

				    return md

									
										69

.ci/lumen_cli/cli/lib/common/git_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,69 @@

				"""

				Git Utility helpers for CLI tasks.

				"""

				import logging

				from pathlib import Path

				from cli.lib.common.path_helper import remove_dir

				from git import GitCommandError, RemoteProgress, Repo

				logger = logging.getLogger(__name__)

				class PrintProgress(RemoteProgress):

				    """Simple progress logger for git operations."""

				    def __init__(self, interval: int = 5):

				        super().__init__()

				        self._last_percent = -1

				        self._interval = interval

				    def update(self, op_code, cur, max=None, message=""):

				        msg = self._cur_line or message

				        if max and cur:

				            percent = int(cur / max * 100)

				            if percent != self._last_percent and percent % self._interval == 0:

				                self._last_percent = percent

				                logger.info("Progress: %d%% - %s", percent, msg)

				        elif msg:

				            logger.info(msg)

				def clone_external_repo(target: str, repo: str, dst: str = "", update_submodules=False):

				    """Clone repository with pinned commit and optional submodules."""

				    dst = dst or target

				    try:

				        logger.info("Cloning %s to %s", target, dst)

				        # Clone and fetch

				        remove_dir(dst)

				        r = Repo.clone_from(repo, dst, progress=PrintProgress())

				        r.git.fetch("--all", "--tags")

				        # Checkout pinned commit

				        commit = get_post_build_pinned_commit(target)

				        logger.info("Checking out pinned %s commit %s", target, commit)

				        r.git.checkout(commit)

				        # Update submodules if requested

				        if update_submodules and r.submodules:

				            logger.info("Updating %d submodule(s)", len(r.submodules))

				            for sm in r.submodules:

				                sm.update(init=True, recursive=True, progress=PrintProgress())

				        logger.info("Successfully cloned %s", target)

				        return r, commit

				    except GitCommandError as e:

				        logger.error("Git operation failed: %s", e)

				        raise

				def get_post_build_pinned_commit(name: str, prefix=".github/ci_commit_pins") -> str:

				    path = Path(prefix) / f"{name}.txt"

				    if not path.exists():

				        raise FileNotFoundError(f"Pin file not found: {path}")

				    return path.read_text(encoding="utf-8").strip()

									
										14

.ci/lumen_cli/cli/lib/common/logger.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,14 @@

				"""

				Logger Utility helpers for CLI tasks.

				"""

				import logging

				import sys

				def setup_logging(level: int = logging.INFO):

				    logging.basicConfig(

				        level=level,

				        format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",

				        stream=sys.stdout,

				    )

									
										62

.ci/lumen_cli/cli/lib/common/path_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,62 @@

				"""Path utility helpers for CLI tasks."""

				import logging

				import shutil

				from pathlib import Path

				from typing import Union

				logger = logging.getLogger(__name__)

				def get_path(path: Union[str, Path], resolve: bool = False) -> Path:

				    """Convert to Path object, optionally resolving to absolute path."""

				    if not path:

				        raise ValueError("Path cannot be None or empty")

				    result = Path(path)

				    return result.resolve() if resolve else result

				def ensure_dir_exists(path: Union[str, Path]) -> Path:

				    """Create directory if it doesn't exist."""

				    path_obj = get_path(path)

				    path_obj.mkdir(parents=True, exist_ok=True)

				    return path_obj

				def remove_dir(path: Union[str, Path, None]) -> None:

				    """Remove directory if it exists."""

				    if not path:

				        return

				    path_obj = get_path(path)

				    if path_obj.exists():

				        shutil.rmtree(path_obj)

				def force_create_dir(path: Union[str, Path]) -> Path:

				    """Remove directory if exists, then create fresh empty directory."""

				    remove_dir(path)

				    return ensure_dir_exists(path)

				def copy(src: Union[str, Path], dst: Union[str, Path]) -> None:

				    """Copy file or directory from src to dst."""

				    src_path = get_path(src, resolve=True)

				    dst_path = get_path(dst, resolve=True)

				    if not src_path.exists():

				        raise FileNotFoundError(f"Source does not exist: {src_path}")

				    dst_path.parent.mkdir(parents=True, exist_ok=True)

				    if src_path.is_file():

				        shutil.copy2(src_path, dst_path)

				    elif src_path.is_dir():

				        shutil.copytree(src_path, dst_path, dirs_exist_ok=True)

				    else:

				        raise ValueError(f"Unsupported path type: {src_path}")

				def is_path_exist(path: Union[str, Path, None]) -> bool:

				    """Check if path exists."""

				    return bool(path and get_path(path).exists())

									
										71

.ci/lumen_cli/cli/lib/common/pip_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,71 @@

				import glob

				import logging

				import shlex

				import shutil

				import sys

				from collections.abc import Iterable

				from importlib.metadata import PackageNotFoundError, version  # noqa: UP035

				from typing import Optional, Union

				from cli.lib.common.utils import run_command

				logger = logging.getLogger(__name__)

				def pip_install_packages(

				    packages: Iterable[str] = (),

				    env=None,

				    *,

				    requirements: Optional[str] = None,

				    constraints: Optional[str] = None,

				    prefer_uv: bool = False,

				) -> None:

				    use_uv = prefer_uv and shutil.which("uv") is not None

				    base = (

				        [sys.executable, "-m", "uv", "pip", "install"]

				        if use_uv

				        else [sys.executable, "-m", "pip", "install"]

				    )

				    cmd = base[:]

				    if requirements:

				        cmd += ["-r", requirements]

				    if constraints:

				        cmd += ["-c", constraints]

				    cmd += list(packages)

				    logger.info("pip installing packages: %s", " ".join(map(shlex.quote, cmd)))

				    run_command(" ".join(map(shlex.quote, cmd)), env=env)

				def pip_install_first_match(pattern: str, extras: Optional[str] = None, pref_uv=False):

				    wheel = first_matching_pkg(pattern)

				    target = f"{wheel}[{extras}]" if extras else wheel

				    logger.info("Installing %s...", target)

				    pip_install_packages([target], prefer_uv=pref_uv)

				def run_python(args: Union[str, list[str]], env=None):

				    """

				    Run the python in the current environment.

				    """

				    if isinstance(args, str):

				        args = shlex.split(args)

				    cmd = [sys.executable] + args

				    run_command(" ".join(map(shlex.quote, cmd)), env=env)

				def pkg_exists(name: str) -> bool:

				    try:

				        pkg_version = version(name)

				        logger.info("%s already exist with version: %s", name, pkg_version)

				        return True

				    except PackageNotFoundError:

				        logger.info("%s is not installed", name)

				        return False

				def first_matching_pkg(pattern: str) -> str:

				    matches = sorted(glob.glob(pattern))

				    if not matches:

				        raise FileNotFoundError(f"No wheel matching: {pattern}")

				    return matches[0]

									
										139

.ci/lumen_cli/cli/lib/common/utils.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,139 @@

				"""

				General Utility helpers for CLI tasks.

				"""

				import logging

				import os

				import shlex

				import subprocess

				import sys

				from contextlib import contextmanager

				from pathlib import Path

				from typing import Optional

				logger = logging.getLogger(__name__)

				def run_command(

				    cmd: str,

				    use_shell: bool = False,

				    log_cmd: bool = True,

				    cwd: Optional[str] = None,

				    env: Optional[dict] = None,

				    check: bool = True,

				) -> int:

				    """Run a command with optional shell execution."""

				    if use_shell:

				        args = cmd

				        log_prefix = "[shell]"

				        executable = "/bin/bash"

				    else:

				        args = shlex.split(cmd)

				        log_prefix = "[cmd]"

				        executable = None

				    if log_cmd:

				        display_cmd = cmd if use_shell else " ".join(args)

				        logger.info("%s %s", log_prefix, display_cmd)

				    run_env = {**os.environ, **(env or {})}

				    proc = subprocess.run(

				        args,

				        shell=use_shell,

				        executable=executable,

				        stdout=sys.stdout,

				        stderr=sys.stderr,

				        cwd=cwd,

				        env=run_env,

				        check=False,

				    )

				    if check and proc.returncode != 0:

				        logger.error(

				            "%s Command failed (exit %s): %s", log_prefix, proc.returncode, cmd

				        )

				        raise subprocess.CalledProcessError(

				            proc.returncode, args if not use_shell else cmd

				        )

				    return proc.returncode

				def str2bool(value: Optional[str]) -> bool:

				    """Convert environment variables to boolean values."""

				    if not value:

				        return False

				    if not isinstance(value, str):

				        raise ValueError(

				            f"Expected a string value for boolean conversion, got {type(value)}"

				        )

				    value = value.strip().lower()

				    true_value_set = {"1", "true", "t", "yes", "y", "on", "enable", "enabled", "found"}

				    false_value_set = {"0", "false", "f", "no", "n", "off", "disable"}

				    if value in true_value_set:

				        return True

				    if value in false_value_set:

				        return False

				    raise ValueError(f"Invalid string value for boolean conversion: {value}")

				@contextmanager

				def temp_environ(updates: dict[str, str]):

				    """

				    Temporarily set environment variables and restore them after the block.

				    Args:

				        updates: Dict of environment variables to set.

				    """

				    missing = object()

				    old: dict[str, str | object] = {k: os.environ.get(k, missing) for k in updates}

				    try:

				        os.environ.update(updates)

				        yield

				    finally:

				        for k, v in old.items():

				            if v is missing:

				                os.environ.pop(k, None)

				            else:

				                os.environ[k] = v  # type: ignore[arg-type]

				@contextmanager

				def working_directory(path: str):

				    """

				    Temporarily change the working directory inside a context.

				    """

				    if not path:

				        # No-op context

				        yield

				        return

				    prev_cwd = os.getcwd()

				    try:

				        os.chdir(path)

				        yield

				    finally:

				        os.chdir(prev_cwd)

				def get_wheels(

				    output_dir: Path,

				    max_depth: Optional[int] = None,

				) -> list[str]:

				    """Return a list of wheels found in the given output directory."""

				    root = Path(output_dir)

				    if not root.exists():

				        return []

				    items = []

				    for dirpath, _, filenames in os.walk(root):

				        depth = Path(dirpath).relative_to(root).parts

				        if max_depth is not None and len(depth) > max_depth:

				            continue

				        for fname in sorted(filenames):

				            if fname.endswith(".whl"):

				                pkg = fname.split("-")[0]

				                relpath = str((Path(dirpath) / fname).relative_to(root))

				                items.append({"pkg": pkg, "relpath": relpath})

				    return items

									
										292

.ci/lumen_cli/cli/lib/core/vllm/lib.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,292 @@

				import logging

				import os

				import textwrap

				from typing import Any

				from cli.lib.common.gh_summary import write_gh_step_summary

				from cli.lib.common.git_helper import clone_external_repo

				from cli.lib.common.pip_helper import pip_install_packages

				from cli.lib.common.utils import run_command, temp_environ, working_directory

				from jinja2 import Template

				logger = logging.getLogger(__name__)

				_TPL_VLLM_INFO = Template(

				    textwrap.dedent("""\

				    ##  Vllm against Pytorch CI Test Summary

				    **Vllm Commit**: [{{ vllm_commit }}](https://github.com/vllm-project/vllm/commit/{{ vllm_commit }})

				    {%- if torch_sha %}

				    **Pytorch Commit**: [{{ torch_sha }}](https://github.com/pytorch/pytorch/commit/{{ torch_sha }})

				    {%- endif %}

				""")

				)

				def sample_vllm_test_library():

				    """

				    Simple sample to unblock the vllm ci development, which is mimic to

				    https://github.com/vllm-project/vllm/blob/main/.buildkite/test-pipeline.yaml

				    see run_test_plan for more details

				    """

				    # TODO(elainewy): Read from yaml file to handle the env and tests for vllm

				    return {

				        "vllm_basic_correctness_test": {

				            "title": "Basic Correctness Test",

				            "id": "vllm_basic_correctness_test",

				            "env_vars": {

				                "VLLM_WORKER_MULTIPROC_METHOD": "spawn",

				            },

				            "steps": [

				                "pytest -v -s basic_correctness/test_cumem.py",

				                "pytest -v -s basic_correctness/test_basic_correctness.py",

				                "pytest -v -s basic_correctness/test_cpu_offload.py",

				            ],

				        },

				        "vllm_basic_models_test": {

				            "title": "Basic models test",

				            "id": "vllm_basic_models_test",

				            "steps": [

				                "pytest -v -s models/test_transformers.py",

				                "pytest -v -s models/test_registry.py",

				                "pytest -v -s models/test_utils.py",

				                "pytest -v -s models/test_vision.py",

				                "pytest -v -s models/test_initialization.py",

				            ],

				        },

				        "vllm_entrypoints_test": {

				            "title": "Entrypoints Test ",

				            "id": "vllm_entrypoints_test",

				            "env_vars": {

				                "VLLM_WORKER_MULTIPROC_METHOD": "spawn",

				            },

				            "steps": [

				                " ".join(

				                    [

				                        "pytest",

				                        "-v",

				                        "-s",

				                        "entrypoints/llm",

				                        "--ignore=entrypoints/llm/test_generate.py",

				                        "--ignore=entrypoints/llm/test_collective_rpc.py",

				                    ]

				                ),

				                "pytest -v -s entrypoints/llm/test_generate.py",

				                "pytest -v -s entrypoints/offline_mode",

				            ],

				        },

				        "vllm_regression_test": {

				            "title": "Regression Test",

				            "id": "vllm_regression_test",

				            "package_install": ["modelscope"],

				            "steps": [

				                "pytest -v -s test_regression.py",

				            ],

				        },

				        "vllm_lora_tp_test_distributed": {

				            "title": "LoRA TP Test (Distributed)",

				            "id": "vllm_lora_tp_test_distributed",

				            "env_vars": {

				                "VLLM_WORKER_MULTIPROC_METHOD": "spawn",

				            },

				            "num_gpus": 4,

				            "steps": [

				                "pytest -v -s -x lora/test_chatglm3_tp.py",

				                "pytest -v -s -x lora/test_llama_tp.py",

				                "pytest -v -s -x lora/test_llm_with_multi_loras.py",

				            ],

				        },

				        "vllm_distributed_test_28_failure_test": {

				            "title": "Distributed Tests (2 GPUs) pytorch 2.8 release failure",

				            "id": "vllm_distributed_test_28_failure_test",

				            "env_vars": {

				                "VLLM_WORKER_MULTIPROC_METHOD": "spawn",

				            },

				            "num_gpus": 4,

				            "steps": [

				                "pytest -v -s distributed/test_sequence_parallel.py",

				            ],

				        },

				        "vllm_lora_28_failure_test": {

				            "title": "LoRA pytorch 2.8 failure test",

				            "id": "vllm_lora_28_failure_test",

				            "steps": ["pytest -v lora/test_quant_model.py"],

				        },

				        "vllm_multi_model_processor_test": {

				            "title": "Multi-Modal Processor Test",

				            "id": "vllm_multi_model_processor_test",

				            "package_install": ["git+https://github.com/TIGER-AI-Lab/Mantis.git"],

				            "steps": [

				                "pytest -v -s models/multimodal/processing --ignore models/multimodal/processing/test_tensor_schema.py",

				            ],

				        },

				        "vllm_multi_model_test_28_failure_test": {

				            "title": "Multi-Model Test (Failed 2.8 release)",

				            "id": "vllm_multi_model_test_28_failure_test",

				            "package_install": ["git+https://github.com/TIGER-AI-Lab/Mantis.git"],

				            "steps": [

				                "pytest -v -s models/multimodal/generation/test_voxtral.py",

				                "pytest -v -s models/multimodal/pooling",

				            ],

				        },

				        "vllm_pytorch_compilation_unit_tests": {

				            "title": "PyTorch Compilation Unit Tests",

				            "id": "vllm_pytorch_compilation_unit_tests",

				            "steps": [

				                "pytest -v -s compile/test_pass_manager.py",

				                "pytest -v -s compile/test_fusion.py",

				                "pytest -v -s compile/test_fusion_attn.py",

				                "pytest -v -s compile/test_silu_mul_quant_fusion.py",

				                "pytest -v -s compile/test_sequence_parallelism.py",

				                "pytest -v -s compile/test_async_tp.py",

				                "pytest -v -s compile/test_fusion_all_reduce.py",

				                "pytest -v -s compile/test_decorator.py",

				            ],

				        },

				        "vllm_languagde_model_test_extended_generation_28_failure_test": {

				            "title": "Language Models Test (Extended Generation) 2.8 release failure",

				            "id": "vllm_languagde_model_test_extended_generation_28_failure_test",

				            "package_install": [

				                "--no-build-isolation",

				                "git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8",

				            ],

				            "steps": [

				                "pytest -v -s models/language/generation/test_mistral.py",

				            ],

				        },

				        "vllm_distributed_test_2_gpu_28_failure_test": {

				            "title": "Distributed Tests (2 GPUs) pytorch 2.8 release failure",

				            "id": "vllm_distributed_test_2_gpu_28_failure_test",

				            "env_vars": {

				                "VLLM_WORKER_MULTIPROC_METHOD": "spawn",

				            },

				            "num_gpus": 4,

				            "steps": [

				                "pytest -v -s distributed/test_sequence_parallel.py",

				            ],

				        },

				        # TODO(elainewy):need to add g6 with 4 gpus to run this test

				        "vllm_lora_test": {

				            "title": "LoRA Test %N",

				            "id": "lora_test",

				            "parallelism": 4,

				            "steps": [

				                "echo '[checking] list sharded lora tests:'",

				                " ".join(

				                    [

				                        "pytest -q --collect-only lora",

				                        "--shard-id=$$BUILDKITE_PARALLEL_JOB",

				                        "--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT",

				                        "--ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py",

				                    ]

				                ),

				                "echo '[checking] Done. list lora tests'",

				                " ".join(

				                    [

				                        "pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB",

				                        "--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT",

				                        "--ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py",

				                    ]

				                ),

				            ],

				        },

				    }

				def check_parallelism(tests: Any, title: str, shard_id: int = 0, num_shards: int = 0):

				    """

				    a method to check if the test plan is parallelism or not.

				    """

				    parallelism = int(tests.get("parallelism", "0"))

				    is_parallel = parallelism and parallelism > 1

				    if not is_parallel:

				        return False

				    if shard_id > num_shards:

				        raise RuntimeError(

				            f"Test {title} expects {num_shards} shards, but invalid {shard_id} is provided"

				        )

				    if num_shards != parallelism:

				        raise RuntimeError(

				            f"Test {title} expects {parallelism} shards, but invalid {num_shards} is provided"

				        )

				    return True

				def run_test_plan(

				    test_plan: str,

				    test_target: str,

				    tests_map: dict[str, Any],

				    shard_id: int = 0,

				    num_shards: int = 0,

				):

				    """

				    a method to run list of tests based on the test plan.

				    """

				    logger.info("run %s tests.....", test_target)

				    if test_plan not in tests_map:

				        raise RuntimeError(

				            f"test {test_plan} not found, please add it to test plan pool"

				        )

				    tests = tests_map[test_plan]

				    pkgs = tests.get("package_install", [])

				    title = tests.get("title", "unknown test")

				    is_parallel = check_parallelism(tests, title, shard_id, num_shards)

				    if is_parallel:

				        title = title.replace("%N", f"{shard_id}/{num_shards}")

				    logger.info("Running tests: %s", title)

				    if pkgs:

				        logger.info("Installing packages: %s", pkgs)

				        pip_install_packages(packages=pkgs, prefer_uv=True)

				    with (

				        working_directory(tests.get("working_directory", "tests")),

				        temp_environ(tests.get("env_vars", {})),

				    ):

				        failures = []

				        for step in tests["steps"]:

				            logger.info("Running step: %s", step)

				            if is_parallel:

				                step = replace_buildkite_placeholders(step, shard_id, num_shards)

				                logger.info("Running parallel step: %s", step)

				            code = run_command(cmd=step, check=False, use_shell=True)

				            if code != 0:

				                failures.append(step)

				            logger.info("Finish running step: %s", step)

				        if failures:

				            logger.error("Failed tests: %s", failures)

				            raise RuntimeError(f"{len(failures)} pytest runs failed: {failures}")

				        logger.info("Done. All tests passed")

				def clone_vllm(dst: str = "vllm"):

				    _, commit = clone_external_repo(

				        target="vllm",

				        repo="https://github.com/vllm-project/vllm.git",

				        dst=dst,

				        update_submodules=True,

				    )

				    return commit

				def replace_buildkite_placeholders(step: str, shard_id: int, num_shards: int) -> str:

				    mapping = {

				        "$$BUILDKITE_PARALLEL_JOB_COUNT": str(num_shards),

				        "$$BUILDKITE_PARALLEL_JOB": str(shard_id),

				    }

				    for k in sorted(mapping, key=len, reverse=True):

				        step = step.replace(k, mapping[k])

				    return step

				def summarize_build_info(vllm_commit: str) -> bool:

				    torch_sha = os.getenv("GITHUB_SHA")

				    md = (

				        _TPL_VLLM_INFO.render(vllm_commit=vllm_commit, torch_sha=torch_sha).strip()

				        + "\n"

				    )

				    return write_gh_step_summary(md)

									
										285

.ci/lumen_cli/cli/lib/core/vllm/vllm_build.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,285 @@

				import logging

				import os

				import textwrap

				from dataclasses import dataclass

				from pathlib import Path

				from typing import Optional

				from cli.lib.common.cli_helper import BaseRunner

				from cli.lib.common.docker_helper import local_image_exists

				from cli.lib.common.envs_helper import (

				    env_bool_field,

				    env_path_field,

				    env_str_field,

				    with_params_help,

				)

				from cli.lib.common.gh_summary import (

				    gh_summary_path,

				    summarize_content_from_file,

				    summarize_wheels,

				)

				from cli.lib.common.path_helper import (

				    copy,

				    ensure_dir_exists,

				    force_create_dir,

				    get_path,

				    is_path_exist,

				)

				from cli.lib.common.utils import run_command

				from cli.lib.core.vllm.lib import clone_vllm, summarize_build_info

				logger = logging.getLogger(__name__)

				# Default path for docker build artifacts

				_DEFAULT_RESULT_PATH = "./shared"

				# Temp folder in vllm work place to cp torch whls in vllm work directory for docker build

				_VLLM_TEMP_FOLDER = "tmp"

				@dataclass

				class VllmBuildParameters:

				    """

				    Parameters defining the vllm external input configurations.

				    Combine with VllmDockerBuildArgs to define the vllm build environment

				    """

				    # USE_TORCH_WHEEL: when true, use local Torch wheels; requires TORCH_WHEELS_PATH.

				    # Otherwise docker build pull torch nightly during build

				    # TORCH_WHEELS_PATH: directory containing local torch wheels when use_torch_whl is True

				    use_torch_whl: bool = env_bool_field("USE_TORCH_WHEEL", True)

				    torch_whls_path: Path = env_path_field("TORCH_WHEELS_PATH", "./dist")

				    # USE_LOCAL_BASE_IMAGE: when true, use an existing local Docker base image; requires BASE_IMAGE

				    # Otherwise, pull dockerfile's default image remotely

				    # BASE_IMAGE: name:tag (only needed when use_local_base_image is True)

				    use_local_base_image: bool = env_bool_field("USE_LOCAL_BASE_IMAGE", True)

				    base_image: str = env_str_field("BASE_IMAGE")

				    # USE_LOCAL_DOCKERFILE: when true("1"), use a local Dockerfile; requires DOCKERFILE_PATH.

				    # otherwise, use vllm's default dockerfile.torch_nightly for build

				    # DOCKERFILE_PATH: path to Dockerfile used when use_local_dockerfile is True"

				    use_local_dockerfile: bool = env_bool_field("USE_LOCAL_DOCKERFILE", True)

				    dockerfile_path: Path = env_path_field(

				        "DOCKERFILE_PATH", ".github/ci_configs/vllm/Dockerfile.tmp_vllm"

				    )

				    # OUTPUT_DIR: where docker buildx (local exporter) will write artifacts

				    output_dir: Path = env_path_field("OUTPUT_DIR", "external/vllm")

				    # --- Build args ----------------------------------------------------------

				    target_stage: str = env_str_field("TARGET_STAGE", "export-wheels")

				    tag_name: str = env_str_field("TAG", "vllm-wheels")

				    cuda_version: str = env_str_field("CUDA_VERSION", "12.8.1")

				    python_version: str = env_str_field("PYTHON_VERSION", "3.12")

				    max_jobs: str = env_str_field("MAX_JOBS", "64")

				    sccache_bucket: str = env_str_field("SCCACHE_BUCKET")

				    sccache_region: str = env_str_field("SCCACHE_REGION")

				    torch_cuda_arch_list: str = env_str_field("TORCH_CUDA_ARCH_LIST", "8.9")

				    def __post_init__(self):

				        checks = [

				            (

				                self.use_torch_whl,  # flag

				                True,  # trigger_value

				                "torch_whls_path",  # resource

				                is_path_exist,  # check_func

				                "TORCH_WHEELS_PATH is not provided, but USE_TORCH_WHEEL is set to 1",

				            ),

				            (

				                self.use_local_base_image,

				                True,

				                "base_image",

				                local_image_exists,

				                f"BASE_IMAGE {self.base_image} does not found, but USE_LOCAL_BASE_IMAGE is set to 1",

				            ),

				            (

				                self.use_local_dockerfile,

				                True,

				                "dockerfile_path",

				                is_path_exist,

				                " DOCKERFILE_PATH path does not found, but USE_LOCAL_DOCKERFILE is set to 1",

				            ),

				        ]

				        for flag, trigger_value, attr_name, check_func, error_msg in checks:

				            value = getattr(self, attr_name)

				            if flag == trigger_value:

				                if not value or not check_func(value):

				                    raise ValueError(error_msg)

				            else:

				                logger.info("flag  %s is not set", flag)

				        if not self.output_dir:

				            raise ValueError("missing required output_dir")

				@with_params_help(VllmBuildParameters)

				class VllmBuildRunner(BaseRunner):

				    """

				    Build vLLM using docker buildx.

				    Environment variable options:

				        "USE_TORCH_WHEEL":      "1: use local wheels; 0: pull nightly from pypi",

				        "TORCH_WHEELS_PATH":    "Path to local wheels (when USE_TORCH_WHEEL=1)",

				        "USE_LOCAL_BASE_IMAGE": "1: use local base image; 0: default image",

				         "BASE_IMAGE":           "name:tag to indicate base image the dockerfile depends on (when USE_LOCAL_BASE_IMAGE=1)",

				        "USE_LOCAL_DOCKERFILE": "1: use local Dockerfile; 0: vllm repo default dockerfile.torch_nightly",

				        "DOCKERFILE_PATH":      "Path to Dockerfile (when USE_LOCAL_DOCKERFILE=1)",

				        "OUTPUT_DIR":           "e.g. './shared'",

				        "TORCH_CUDA_ARCH_LIST": "e.g. '8.0' or '8.0;9.0'",

				        "CUDA_VERSION":         "e.g. '12.8.1'",

				        "PYTHON_VERSION":       "e.g. '3.12'",

				        "MAX_JOBS":             "e.g. '64'",

				        "SCCACHE_BUCKET":       "e.g. 'my-bucket'",

				        "SCCACHE_REGION":       "e.g. 'us-west-2'",

				    """

				    def __init__(self, args=None):

				        self.work_directory = "vllm"

				    def run(self):

				        """

				        main function to run vllm build

				        1. prepare vllm build environment

				        2. prepare the docker build command args

				        3. run docker build

				        """

				        inputs = VllmBuildParameters()

				        logger.info("Running vllm build with inputs: %s", inputs)

				        vllm_commit = clone_vllm()

				        self.cp_dockerfile_if_exist(inputs)

				        # cp torch wheels from root direct to vllm workspace if exist

				        self.cp_torch_whls_if_exist(inputs)

				        # make sure the output dir to store the build artifacts exist

				        ensure_dir_exists(Path(inputs.output_dir))

				        cmd = self._generate_docker_build_cmd(inputs)

				        logger.info("Running docker build: \n %s", cmd)

				        try:

				            run_command(cmd, cwd="vllm", env=os.environ.copy())

				        finally:

				            self.genearte_vllm_build_summary(vllm_commit, inputs)

				    def genearte_vllm_build_summary(

				        self, vllm_commit: str, inputs: VllmBuildParameters

				    ):

				        if not gh_summary_path():

				            return logger.info("Skipping, not detect GH Summary env var....")

				        logger.info("Generate GH Summary ...")

				        # summarize vllm build info

				        summarize_build_info(vllm_commit)

				        # summarize vllm build artifacts

				        vllm_artifact_dir = inputs.output_dir / "wheels"

				        summarize_content_from_file(

				            vllm_artifact_dir,

				            "build_summary.txt",

				            title="Vllm build env pip package summary",

				        )

				        summarize_wheels(

				            inputs.torch_whls_path, max_depth=3, title="Torch Wheels Artifacts"

				        )

				        summarize_wheels(vllm_artifact_dir, max_depth=3, title="Vllm Wheels Artifacts")

				    def cp_torch_whls_if_exist(self, inputs: VllmBuildParameters) -> str:

				        if not inputs.use_torch_whl:

				            return ""

				        tmp_dir = f"./{self.work_directory}/{_VLLM_TEMP_FOLDER}"

				        tmp_path = Path(tmp_dir)

				        force_create_dir(tmp_path)

				        copy(inputs.torch_whls_path, tmp_dir)

				        return tmp_dir

				    def cp_dockerfile_if_exist(self, inputs: VllmBuildParameters):

				        if not inputs.use_local_dockerfile:

				            logger.info("using vllm default dockerfile.torch_nightly for build")

				            return

				        dockerfile_path = get_path(inputs.dockerfile_path, resolve=True)

				        vllm_torch_dockerfile = Path(

				            f"./{self.work_directory}/docker/Dockerfile.nightly_torch"

				        )

				        copy(dockerfile_path, vllm_torch_dockerfile)

				    def get_result_path(self, path):

				        """

				        Get the absolute path of the result path

				        """

				        if not path:

				            path = _DEFAULT_RESULT_PATH

				        abs_path = get_path(path, resolve=True)

				        return abs_path

				    def _get_torch_wheel_path_arg(self, torch_whl_dir: Optional[Path]) -> str:

				        if not torch_whl_dir:

				            return ""

				        return f"--build-arg TORCH_WHEELS_PATH={_VLLM_TEMP_FOLDER}"

				    def _get_base_image_args(self, inputs: VllmBuildParameters) -> tuple[str, str, str]:

				        """

				        Returns:

				            - base_image_arg: docker buildx arg string for base image

				            - final_base_image_arg:  docker buildx arg string for vllm-base stage

				            - pull_flag: --pull=true or --pull=false depending on whether the image exists locally

				        """

				        if not inputs.use_local_base_image:

				            return "", "", ""

				        base_image = inputs.base_image

				        # set both base image and final base image to the same local image

				        base_image_arg = f"--build-arg BUILD_BASE_IMAGE={base_image}"

				        final_base_image_arg = f"--build-arg FINAL_BASE_IMAGE={base_image}"

				        if local_image_exists(base_image):

				            pull_flag = "--pull=false"

				            return base_image_arg, final_base_image_arg, pull_flag

				        logger.info(

				            "[INFO] Local image not found:%s will try to pull from remote", {base_image}

				        )

				        return base_image_arg, final_base_image_arg, ""

				    def _generate_docker_build_cmd(

				        self,

				        inputs: VllmBuildParameters,

				    ) -> str:

				        base_image_arg, final_base_image_arg, pull_flag = self._get_base_image_args(

				            inputs

				        )

				        torch_arg = self._get_torch_wheel_path_arg(inputs.torch_whls_path)

				        return textwrap.dedent(

				            f"""

				            docker buildx build \

				                --output type=local,dest={inputs.output_dir} \

				                -f docker/Dockerfile.nightly_torch \

				                {pull_flag} \

				                {torch_arg} \

				                {base_image_arg} \

				                {final_base_image_arg} \

				                --build-arg max_jobs={inputs.max_jobs} \

				                --build-arg CUDA_VERSION={inputs.cuda_version} \

				                --build-arg PYTHON_VERSION={inputs.python_version} \

				                --build-arg USE_SCCACHE={int(bool(inputs.sccache_bucket and inputs.sccache_region))} \

				                --build-arg SCCACHE_BUCKET_NAME={inputs.sccache_bucket} \

				                --build-arg SCCACHE_REGION_NAME={inputs.sccache_region} \

				                --build-arg torch_cuda_arch_list='{inputs.torch_cuda_arch_list}' \

				                --target {inputs.target_stage} \

				                -t {inputs.tag_name} \

				                --progress=plain .

				        """

				        ).strip()

									
										269

.ci/lumen_cli/cli/lib/core/vllm/vllm_test.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,269 @@

				import logging

				import os

				import re

				import subprocess

				import sys

				from collections.abc import Iterable

				from dataclasses import dataclass

				from enum import Enum

				from pathlib import Path

				from typing import Any

				from cli.lib.common.cli_helper import BaseRunner

				from cli.lib.common.envs_helper import env_path_field, env_str_field, get_env

				from cli.lib.common.path_helper import copy, remove_dir

				from cli.lib.common.pip_helper import (

				    pip_install_first_match,

				    pip_install_packages,

				    pkg_exists,

				    run_python,

				)

				from cli.lib.common.utils import run_command, working_directory

				from cli.lib.core.vllm.lib import clone_vllm, run_test_plan, sample_vllm_test_library

				logger = logging.getLogger(__name__)

				@dataclass

				class VllmTestParameters:

				    """

				    Parameters defining the vllm external test input

				    !!!DO NOT ADD SECRETS IN THIS CLASS!!!

				    you can put environment variable name in VllmTestParameters if it's not the same as the secret one

				    fetch secrests directly from env variables during runtime

				    """

				    torch_whls_path: Path = env_path_field("WHEELS_PATH", "./dist")

				    vllm_whls_path: Path = env_path_field(

				        "VLLM_WHEELS_PATH", "./dist/external/vllm/wheels"

				    )

				    torch_cuda_arch_list: str = env_str_field("TORCH_CUDA_ARCH_LIST", "8.9")

				    def __post_init__(self):

				        if not self.torch_whls_path.exists():

				            raise ValueError("missing torch_whls_path")

				        if not self.vllm_whls_path.exists():

				            raise ValueError("missing vllm_whls_path")

				class TestInpuType(Enum):

				    TEST_PLAN = "test_plan"

				    UNKNOWN = "unknown"

				class VllmTestRunner(BaseRunner):

				    def __init__(self, args: Any):

				        self.work_directory = "vllm"

				        self.test_plan = ""

				        self.test_type = TestInpuType.UNKNOWN

				        self.shard_id = args.shard_id

				        self.num_shards = args.num_shards

				        if args.test_plan:

				            self.test_plan = args.test_plan

				            self.test_type = TestInpuType.TEST_PLAN

				        # Matches the structeur in the artifacts.zip from torcb build

				        self.TORCH_WHL_PATH_REGEX = "torch*.whl"

				        self.TORCH_WHL_EXTRA = "opt-einsum"

				        self.TORCH_ADDITIONAL_WHLS_REGEX = [

				            "vision/torchvision*.whl",

				            "audio/torchaudio*.whl",

				        ]

				        # Match the structure of the artifacts.zip from vllm external build

				        self.VLLM_TEST_WHLS_REGEX = [

				            "xformers/*.whl",

				            "vllm/vllm*.whl",

				            "flashinfer-python/flashinfer*.whl",

				        ]

				    def prepare(self):

				        """

				        prepare test environment for vllm. This includes clone vllm repo, install all wheels, test dependencies and set env

				        """

				        params = VllmTestParameters()

				        logger.info("Display VllmTestParameters %s", params)

				        self._set_envs(params)

				        clone_vllm(dst=self.work_directory)

				        with working_directory(self.work_directory):

				            remove_dir(Path("vllm"))

				            self._install_wheels(params)

				            self._install_dependencies()

				        # verify the torches are not overridden by test dependencies

				        check_versions()

				    def run(self):

				        """

				        main function to run vllm test

				        """

				        self.prepare()

				        try:

				            with working_directory(self.work_directory):

				                if self.test_type == TestInpuType.TEST_PLAN:

				                    if self.num_shards > 1:

				                        run_test_plan(

				                            self.test_plan,

				                            "vllm",

				                            sample_vllm_test_library(),

				                            self.shard_id,

				                            self.num_shards,

				                        )

				                    else:

				                        run_test_plan(

				                            self.test_plan, "vllm", sample_vllm_test_library()

				                        )

				                else:

				                    raise ValueError(f"Unknown test type {self.test_type}")

				        finally:

				            # double check the torches are not overridden by other packages

				            check_versions()

				    def _install_wheels(self, params: VllmTestParameters):

				        logger.info("Running vllm test with inputs: %s", params)

				        if not pkg_exists("torch"):

				            # install torch from local whls if it's not installed yet.

				            torch_p = f"{str(params.torch_whls_path)}/{self.TORCH_WHL_PATH_REGEX}"

				            pip_install_first_match(torch_p, self.TORCH_WHL_EXTRA)

				        torch_whls_path = [

				            f"{str(params.torch_whls_path)}/{whl_path}"

				            for whl_path in self.TORCH_ADDITIONAL_WHLS_REGEX

				        ]

				        for torch_whl in torch_whls_path:

				            pip_install_first_match(torch_whl)

				        logger.info("Done. Installed torch and other torch-related wheels ")

				        logger.info("Installing vllm wheels")

				        vllm_whls_path = [

				            f"{str(params.vllm_whls_path)}/{whl_path}"

				            for whl_path in self.VLLM_TEST_WHLS_REGEX

				        ]

				        for vllm_whl in vllm_whls_path:

				            pip_install_first_match(vllm_whl)

				        logger.info("Done. Installed vllm wheels")

				    def _install_test_dependencies(self):

				        """

				        This method replaces torch dependencies with local torch wheel info in

				        requirements/test.in file from vllm repo. then generates the test.txt

				        in runtime

				        """

				        logger.info("generate test.txt from requirements/test.in with local torch whls")

				        preprocess_test_in()

				        copy("requirements/test.txt", "snapshot_constraint.txt")

				        run_command(

				            f"{sys.executable} -m uv pip compile requirements/test.in "

				            "-o test.txt "

				            "--index-strategy unsafe-best-match "

				            "--constraint snapshot_constraint.txt "

				            "--torch-backend cu128"

				        )

				        pip_install_packages(requirements="test.txt", prefer_uv=True)

				        logger.info("Done. installed requirements for test dependencies")

				    def _install_dependencies(self):

				        pip_install_packages(packages=["-e", "tests/vllm_test_utils"], prefer_uv=True)

				        pip_install_packages(packages=["hf_transfer"], prefer_uv=True)

				        os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

				        # using script from vllm repo to remove all torch packages from requirements txt

				        run_python("use_existing_torch.py")

				        # install common packages

				        for requirements in ["requirements/common.txt", "requirements/build.txt"]:

				            pip_install_packages(

				                requirements=requirements,

				                prefer_uv=True,

				            )

				        # install test packages

				        self._install_test_dependencies()

				    def _set_envs(self, inputs: VllmTestParameters):

				        os.environ["TORCH_CUDA_ARCH_LIST"] = inputs.torch_cuda_arch_list

				        if not validate_cuda(get_env("TORCH_CUDA_ARCH_LIST")):

				            logger.warning(

				                "Missing supported TORCH_CUDA_ARCH_LIST. "

				                "Currently support TORCH_CUDA_ARCH_LIST env var "

				                "with supported arch [8.0, 8.9, 9.0]"

				            )

				        os.environ["HF_TOKEN"] = os.getenv("VLLM_TEST_HUGGING_FACE_TOKEN", "")

				        if not get_env("HF_TOKEN"):

				            raise ValueError(

				                "missing required HF_TOKEN, please set VLLM_TEST_HUGGING_FACE_TOKEN env var"

				            )

				        if not get_env("TORCH_CUDA_ARCH_LIST"):

				            raise ValueError(

				                "missing required TORCH_CUDA_ARCH_LIST, please set TORCH_CUDA_ARCH_LIST env var"

				            )

				def preprocess_test_in(

				    target_file: str = "requirements/test.in", additional_packages: Iterable[str] = ()

				):

				    """

				    This modifies the target_file file in place in vllm work directory.

				    It removes torch and unwanted packages in target_file and replace with local torch whls

				    package  with format "$WHEEL_PACKAGE_NAME @ file://<LOCAL_PATH>"

				    """

				    additional_package_to_move = list(additional_packages or ())

				    pkgs_to_remove = [

				        "torch",

				        "torchvision",

				        "torchaudio",

				        "xformers",

				        "mamba_ssm",

				    ] + additional_package_to_move

				    # Read current requirements

				    target_path = Path(target_file)

				    lines = target_path.read_text().splitlines()

				    pkgs_to_add = []

				    # Remove lines starting with the package names (==, @, >=) — case-insensitive

				    pattern = re.compile(rf"^({'|'.join(pkgs_to_remove)})\s*(==|@|>=)", re.IGNORECASE)

				    kept_lines = [line for line in lines if not pattern.match(line)]

				    # Get local installed torch/vision/audio from pip freeze

				    # This is hacky, but it works

				    pip_freeze = subprocess.check_output(["pip", "freeze"], text=True)

				    header_lines = [

				        line

				        for line in pip_freeze.splitlines()

				        if re.match(

				            r"^(torch|torchvision|torchaudio)\s*@\s*file://", line, re.IGNORECASE

				        )

				    ]

				    # Write back: header_lines + blank + kept_lines

				    out_lines = header_lines + [""] + kept_lines

				    if pkgs_to_add:

				        out_lines += [""] + pkgs_to_add

				    out = "\n".join(out_lines) + "\n"

				    target_path.write_text(out)

				    logger.info("[INFO] Updated %s", target_file)

				def validate_cuda(value: str) -> bool:

				    VALID_VALUES = {"8.0", "8.9", "9.0"}

				    return all(v in VALID_VALUES for v in value.split())

				def check_versions():

				    """

				    check installed packages version

				    """

				    logger.info("Double check installed packages")

				    patterns = ["torch", "xformers", "torchvision", "torchaudio", "vllm"]

				    for pkg in patterns:

				        pkg_exists(pkg)

				    logger.info("Done. checked installed packages")

									
										40

.ci/lumen_cli/cli/run.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,40 @@

				# main.py

				import argparse

				import logging

				from cli.build_cli.register_build import register_build_commands

				from cli.lib.common.logger import setup_logging

				from cli.test_cli.register_test import register_test_commands

				logger = logging.getLogger(__name__)

				def main():

				    # Define top-level parser

				    parser = argparse.ArgumentParser(description="Lumos CLI")

				    subparsers = parser.add_subparsers(dest="command", required=True)

				    parser.add_argument(

				        "--log-level", default="INFO", help="Log level (DEBUG, INFO, WARNING, ERROR)"

				    )

				    # registers second-level subcommands

				    register_build_commands(subparsers)

				    register_test_commands(subparsers)

				    # parse args after all options are registered

				    args = parser.parse_args()

				    # setup global logging

				    setup_logging(getattr(logging, args.log_level.upper(), logging.INFO))

				    logger.debug("Parsed args: %s", args)

				    if hasattr(args, "func"):

				        args.func(args)

				    else:

				        parser.print_help()

				if __name__ == "__main__":

				    main()

0

test/dynamo_expected_failures/CPython313-test_bool-BoolTest.test_complex → .ci/lumen_cli/cli/test_cli/init.py

View File

									
										62

.ci/lumen_cli/cli/test_cli/register_test.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,62 @@

				import argparse

				import logging

				from cli.lib.common.cli_helper import register_targets, RichHelp, TargetSpec

				from cli.lib.core.vllm.vllm_test import VllmTestRunner

				logger = logging.getLogger(__name__)

				# Maps targets to their argparse configuration and runner

				# it adds new target to path python -m cli.run build external {target} with buildrunner

				_TARGETS: dict[str, TargetSpec] = {

				    "vllm": {

				        "runner": VllmTestRunner,

				        "help": "test vLLM with pytorch main",

				    }

				    # add yours ...

				}

				def common_args(parser: argparse.ArgumentParser) -> None:

				    """

				    Add common CLI arguments to the given parser.

				    """

				    parser.add_argument(

				        "--shard-id",

				        type=int,

				        default=1,

				        help="a shard id to run, e.g. '0,1,2,3'",

				    )

				    parser.add_argument(

				        "--num-shards",

				        type=int,

				        default=1,

				        help="a number of shards to run, e.g. '4'",

				    )

				    group = parser.add_mutually_exclusive_group(required=True)

				    group.add_argument(

				        "-tp",

				        "--test-plan",

				        type=str,

				        help="a pre-defined test plan to run, e.g. 'basic_correctness_test'",

				    )

				def register_test_commands(subparsers: argparse._SubParsersAction) -> None:

				    build_parser = subparsers.add_parser(

				        "test",

				        help="test related commands",

				        formatter_class=RichHelp,

				    )

				    build_subparsers = build_parser.add_subparsers(dest="test_command", required=True)

				    overview = "\n".join(

				        f"  {name:12} {spec.get('help', '')}" for name, spec in _TARGETS.items()

				    )

				    external_parser = build_subparsers.add_parser(

				        "external",

				        help="Test external targets",

				        description="Test third-party targets.\n\nAvailable targets:\n" + overview,

				        formatter_class=RichHelp,

				    )

				    register_targets(external_parser, _TARGETS, common_args=common_args)

									
										23

.ci/lumen_cli/pyproject.toml
									
										Normal file
									
												View File
												
				@ -0,0 +1,23 @@

				[project]

				name = "lumen-ci"

				version = "0.1.0"

				dependencies = [

				    "pyyaml==6.0.2",

				    "GitPython==3.1.45",

				    "docker==7.1.0",

				    "pytest==7.3.2",

				    "uv==0.8.6"

				]

				[tool.setuptools]

				packages = ["cli"]

				[tool.setuptools.package-dir]

				cli = "cli"

				[tool.ruff.lint]

				# Enable preview mode for linting

				preview = true

				# Now you can select your preview rules, like RUF048

				extend-select = ["RUF048"]

									
										47

.ci/lumen_cli/tests/test_app.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,47 @@

				# tests/test_cli.py

				import io

				import sys

				import unittest

				from contextlib import redirect_stderr, redirect_stdout

				from unittest.mock import patch

				from cli.run import main

				class TestArgparseCLI(unittest.TestCase):

				    @patch("cli.build_cli.register_build.VllmBuildRunner.run", return_value=None)

				    @patch("cli.build_cli.register_build.VllmBuildRunner.__init__", return_value=None)

				    def test_cli_run_build_external(self, mock_init, mock_run):

				        from cli.run import main  # import after patches if needed

				        test_args = ["cli.run", "build", "external", "vllm"]

				        with patch.object(sys, "argv", test_args):

				            # argparse may call sys.exit on error; capture to avoid test aborts

				            try:

				                main()

				            except SystemExit:

				                pass

				        mock_init.assert_called_once()  # got constructed

				        mock_run.assert_called_once_with()  # run() called

				    def test_build_help(self):

				        test_args = ["cli.run", "build", "--help"]

				        with patch.object(sys, "argv", test_args):

				            stdout = io.StringIO()

				            stderr = io.StringIO()

				            # --help always raises SystemExit(0)

				            with self.assertRaises(SystemExit) as cm:

				                with redirect_stdout(stdout), redirect_stderr(stderr):

				                    main()

				            self.assertEqual(cm.exception.code, 0)

				            output = stdout.getvalue()

				            self.assertIn("usage", output)

				            self.assertIn("external", output)

				if __name__ == "__main__":

				    unittest.main()

									
										115

.ci/lumen_cli/tests/test_cli_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,115 @@

				import argparse

				import io

				import unittest

				from contextlib import redirect_stderr

				from unittest.mock import patch

				from cli.lib.common.cli_helper import BaseRunner, register_targets, RichHelp, TargetSpec

				# ---- Dummy runners for unittests----

				class FooRunner(BaseRunner):

				    """Foo description from docstring."""

				    def run(self) -> None:  # replaced by mock

				        pass

				class BarRunner(BaseRunner):

				    def run(self) -> None:  # replaced by mock

				        pass

				def add_foo_args(p: argparse.ArgumentParser) -> None:

				    p.add_argument("--x", type=int, required=True, help="x value")

				def common_args(p: argparse.ArgumentParser) -> None:

				    p.add_argument("--verbose", action="store_true", help="verbose flag")

				def build_parser(specs: dict[str, TargetSpec]) -> argparse.ArgumentParser:

				    parser = argparse.ArgumentParser(prog="app", formatter_class=RichHelp)

				    register_targets(

				        parser=parser,

				        target_specs=specs,

				        common_args=common_args,

				    )

				    return parser

				def get_subparser(

				    parser: argparse.ArgumentParser, name: str

				) -> argparse.ArgumentParser:

				    subparsers_action = next(

				        a

				        for a in parser._subparsers._group_actions  # type: ignore[attr-defined]

				        if isinstance(a, argparse._SubParsersAction)

				    )

				    return subparsers_action.choices[name]

				class TestRegisterTargets(unittest.TestCase):

				    def test_metavar_lists_targets(self):

				        specs: dict[str, TargetSpec] = {

				            "foo": {"runner": FooRunner, "add_arguments": add_foo_args},

				            "bar": {"runner": BarRunner},

				        }

				        parser = build_parser(specs)

				        subparsers_action = next(

				            a

				            for a in parser._subparsers._group_actions  # type: ignore[attr-defined]

				            if isinstance(a, argparse._SubParsersAction)

				        )

				        self.assertEqual(subparsers_action.metavar, "{foo,bar}")

				    def test_add_arguments_and_common_args_present(self):

				        specs: dict[str, TargetSpec] = {

				            "foo": {"runner": FooRunner, "add_arguments": add_foo_args},

				        }

				        parser = build_parser(specs)

				        foo = get_subparser(parser, "foo")

				        help_text = foo.format_help()

				        self.assertIn("--x", help_text)

				        self.assertIn("--verbose", help_text)

				    def test_runner_constructed_with_ns_and_run_called(self):

				        specs: dict[str, TargetSpec] = {

				            "foo": {"runner": FooRunner, "add_arguments": add_foo_args},

				        }

				        parser = build_parser(specs)

				        with (

				            patch.object(FooRunner, "__init__", return_value=None) as mock_init,

				            patch.object(FooRunner, "run", return_value=None) as mock_run,

				        ):

				            ns = parser.parse_args(["foo", "--x", "3", "--verbose"])

				            ns.func(ns)  # set by register_targets

				            # __init__ received the Namespace

				            self.assertEqual(mock_init.call_count, 1)

				            (called_ns,), _ = mock_init.call_args

				            self.assertIsInstance(called_ns, argparse.Namespace)

				            # run() called with no args

				            mock_run.assert_called_once_with()

				    def test_runner_docstring_used_as_description_when_missing(self):

				        specs: dict[str, TargetSpec] = {

				            "foo": {"runner": FooRunner, "add_arguments": add_foo_args},

				        }

				        parser = build_parser(specs)

				        foo = get_subparser(parser, "foo")

				        help_text = foo.format_help()

				        self.assertIn("Foo description from docstring.", help_text)

				    def test_missing_target_raises_systemexit_with_usage(self):

				        specs: dict[str, TargetSpec] = {"foo": {"runner": FooRunner}}

				        parser = build_parser(specs)

				        buf = io.StringIO()

				        with self.assertRaises(SystemExit), redirect_stderr(buf):

				            parser.parse_args([])

				        err = buf.getvalue()

				        self.assertIn("usage:", err)

				if __name__ == "__main__":

				    unittest.main()

									
										75

.ci/lumen_cli/tests/test_docker_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,75 @@

				import unittest

				from unittest import mock

				from unittest.mock import MagicMock

				import docker.errors as derr

				from cli.lib.common.docker_helper import _get_client, local_image_exists

				class TestDockerImageHelpers(unittest.TestCase):

				    def setUp(self):

				        # Reset the singleton in the target module

				        patcher = mock.patch("cli.lib.common.docker_helper._docker_client", None)

				        self.addCleanup(patcher.stop)

				        patcher.start()

				    def test_local_image_exists_true(self):

				        # Mock a docker client whose images.get returns an object (no exception)

				        mock_client = MagicMock()

				        mock_client.images.get.return_value = object()

				        ok = local_image_exists("repo:tag", client=mock_client)

				        self.assertTrue(ok)

				    def test_local_image_exists_not_found_false(self):

				        mock_client = MagicMock()

				        # Raise docker.errors.NotFound

				        mock_client.images.get.side_effect = derr.NotFound("nope")

				        ok = local_image_exists("missing:latest", client=mock_client)

				        self.assertFalse(ok)

				    def test_local_image_exists_api_error_false(self):

				        mock_client = MagicMock()

				        mock_client.images.get.side_effect = derr.APIError("boom", None)

				        ok = local_image_exists("broken:tag", client=mock_client)

				        self.assertFalse(ok)

				    def test_local_image_exists_uses_lazy_singleton(self):

				        # Patch docker.from_env used by _get_client()

				        with mock.patch(

				            "cli.lib.common.docker_helper.docker.from_env"

				        ) as mock_from_env:

				            mock_docker_client = MagicMock()

				            mock_from_env.return_value = mock_docker_client

				            # First call should create and cache the client

				            c1 = _get_client()

				            self.assertIs(c1, mock_docker_client)

				            mock_from_env.assert_called_once()

				            # Second call should reuse cached client (no extra from_env calls)

				            c2 = _get_client()

				            self.assertIs(c2, mock_docker_client)

				            mock_from_env.assert_called_once()  # still once

				    def test_local_image_exists_without_client_param_calls_get_client_once(self):

				        # Ensure _get_client is called and cached; local_image_exists should reuse it

				        with mock.patch("cli.lib.common.docker_helper._get_client") as mock_get_client:

				            mock_client = MagicMock()

				            mock_get_client.return_value = mock_client

				            # 1st call

				            local_image_exists("repo:tag")

				            # 2nd call

				            local_image_exists("repo:tag2")

				            # local_image_exists should call _get_client each time,

				            # but your _get_client itself caches docker.from_env.

				            self.assertEqual(mock_get_client.call_count, 2)

				            self.assertEqual(mock_client.images.get.call_count, 2)

				            mock_client.images.get.assert_any_call("repo:tag")

				            mock_client.images.get.assert_any_call("repo:tag2")

				if __name__ == "__main__":

				    unittest.main()

									
										149

.ci/lumen_cli/tests/test_envs_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,149 @@

				import os

				import unittest

				from dataclasses import dataclass

				from pathlib import Path

				from unittest.mock import patch

				import cli.lib.common.envs_helper as m

				class TestEnvHelpers(unittest.TestCase):

				    def setUp(self):

				        # Keep a copy of the original environment to restore later

				        self._env_backup = dict(os.environ)

				    def tearDown(self):

				        # Restore environment to original state

				        os.environ.clear()

				        os.environ.update(self._env_backup)

				    # -------- get_env --------

				    def test_get_env_unset_returns_default(self):

				        with patch.dict(os.environ, {}, clear=True):

				            self.assertEqual(m.get_env("FOO", "default"), "default")

				    def test_get_env_empty_returns_default(self):

				        with patch.dict(os.environ, {"FOO": ""}, clear=True):

				            self.assertEqual(m.get_env("FOO", "default"), "default")

				    def test_get_env_set_returns_value(self):

				        with patch.dict(os.environ, {"FOO": "bar"}, clear=True):

				            self.assertEqual(m.get_env("FOO", "default"), "bar")

				    def test_get_env_not_exist_returns_default(self):

				        with patch.dict(os.environ, {"FOO": "bar"}, clear=True):

				            self.assertEqual(m.get_env("TEST_NOT_EXIST", "default"), "default")

				    def test_get_env_not_exist_without_default(self):

				        with patch.dict(os.environ, {"FOO": "bar"}, clear=True):

				            self.assertEqual(m.get_env("TEST_NOT_EXIST"), "")

				    # -------- env_bool --------

				    def test_env_bool_uses_default_when_unset(self):

				        with patch.dict(os.environ, {}, clear=True):

				            self.assertTrue(m.env_bool("FLAG", default=True))

				            self.assertFalse(m.env_bool("FLAG", default=False))

				    def test_env_bool_uses_str2bool_when_set(self):

				        # Patch str2bool used by env_bool so we don't depend on its exact behavior

				        def fake_str2bool(s: str) -> bool:

				            return s.lower() in {"1", "true", "yes", "on", "y"}

				        with (

				            patch.dict(os.environ, {"FLAG": "yEs"}, clear=True),

				            patch.object(m, "str2bool", fake_str2bool),

				        ):

				            self.assertTrue(m.env_bool("FLAG", default=False))

				    # -------- env_path_optional / env_path --------

				    def test_env_path_optional_unset_returns_none_by_default(self):

				        with patch.dict(os.environ, {}, clear=True):

				            self.assertIsNone(m.env_path_optional("P"))

				    def test_env_path_optional_unset_returns_none_when_env_var_is_empty(self):

				        with patch.dict(os.environ, {"P": ""}, clear=True):

				            self.assertIsNone(m.env_path_optional("P"))

				    def test_env_path_optional_unset_returns_default_str(self):

				        # default as string; resolve=True by default -> absolute path

				        default_str = "x/y"

				        with patch.dict(os.environ, {}, clear=True):

				            p = m.env_path_optional("P", default=default_str)

				            self.assertIsInstance(p, Path)

				            self.assertIsNotNone(p)

				            if p:

				                self.assertTrue(p.is_absolute())

				                self.assertEqual(p.parts[-2:], ("x", "y"))

				    def test_env_path_optional_unset_returns_default_path_no_resolve(self):

				        d = Path("z")

				        with patch.dict(os.environ, {}, clear=True):

				            p = m.env_path_optional("P", default=d, resolve=False)

				            self.assertEqual(p, d)

				    def test_env_path_optional_respects_resolve_true(self):

				        with patch.dict(os.environ, {"P": "a/b"}, clear=True):

				            p = m.env_path_optional("P", resolve=True)

				            self.assertIsInstance(p, Path)

				            if p:

				                self.assertTrue(p.is_absolute())

				    def test_env_path_optional_respects_resolve_false(self):

				        with patch.dict(os.environ, {"P": "rel/dir"}, clear=True):

				            p = m.env_path_optional("P", resolve=False)

				            self.assertEqual(p, Path("rel/dir"))

				            if p:

				                self.assertFalse(p.is_absolute())

				    def test_env_path_raises_when_missing_and_default_none(self):

				        with patch.dict(os.environ, {}, clear=True):

				            with self.assertRaises(ValueError):

				                m.env_path("P", None, resolve=True)

				    def test_env_path_returns_path_when_present(self):

				        tmp = Path("./b").resolve()

				        with patch.dict(os.environ, {"P": str(tmp)}, clear=True):

				            p = m.env_path("P", None, resolve=True)

				            self.assertEqual(p, tmp)

				    # -------- dataclass field helpers --------

				    def test_dataclass_fields_read_env_at_instantiation(self):

				        @dataclass

				        class Cfg:

				            flag: bool = m.env_bool_field("FLAG", default=False)

				            out: Path = m.env_path_field("OUT", default="ab", resolve=True)

				            name: str = m.env_str_field("NAME", default="anon")

				        # First instantiation

				        with patch.dict(

				            os.environ, {"FLAG": "true", "OUT": "outdir", "NAME": "alice"}, clear=True

				        ):

				            cfg1 = Cfg()

				            self.assertTrue(cfg1.flag)

				            self.assertIsInstance(cfg1.out, Path)

				            self.assertTrue(cfg1.out.is_absolute())

				            self.assertEqual(cfg1.name, "alice")

				            cfg1.name = "bob"  # change instance value

				            self.assertEqual(cfg1.name, "bob")  # change is reflected

				        # Change env; new instance should reflect new values

				        with patch.dict(os.environ, {"FLAG": "false", "NAME": ""}, clear=True):

				            cfg2 = Cfg()

				            self.assertFalse(cfg2.flag)  # str2bool("false") -> False

				            self.assertTrue("ab" in str(cfg2.out))

				            self.assertIsInstance(cfg2.out, Path)

				            self.assertTrue(cfg2.out.is_absolute())

				            self.assertEqual(cfg2.name, "anon")  # empty -> fallback to default

				    def test_dataclass_path_field_with_default_value(self):

				        @dataclass

				        class C2:

				            out: Path = m.env_path_field("OUT", default="some/dir", resolve=False)

				        with patch.dict(os.environ, {}, clear=True):

				            c = C2()

				            self.assertEqual(c.out, Path("some/dir"))

				if __name__ == "__main__":

				    unittest.main()

									
										122

.ci/lumen_cli/tests/test_path_helper.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,122 @@

				# test_path_utils.py

				# Run: pytest -q

				import os

				import unittest

				from pathlib import Path

				from tempfile import TemporaryDirectory

				from cli.lib.common.path_helper import (

				    copy,

				    ensure_dir_exists,

				    force_create_dir,

				    get_path,

				    is_path_exist,

				    remove_dir,

				)

				class TestPathHelper(unittest.TestCase):

				    def setUp(self):

				        self.tmpdir = TemporaryDirectory()

				        self.tmp_path = Path(self.tmpdir.name)

				    def tearDown(self):

				        self.tmpdir.cleanup()

				    # -------- get_path --------

				    def test_get_path_returns_path_for_str(self):

				        # Use relative path to avoid absolute-ness

				        rel_str = "sub/f.txt"

				        os.chdir(self.tmp_path)

				        p = get_path(rel_str, resolve=False)

				        self.assertIsInstance(p, Path)

				        self.assertFalse(p.is_absolute())

				        self.assertEqual(str(p), rel_str)

				    def test_get_path_resolves(self):

				        rel_str = "sub/f.txt"

				        p = get_path(str(self.tmp_path / rel_str), resolve=True)

				        self.assertTrue(p.is_absolute())

				        self.assertTrue(str(p).endswith(rel_str))

				    def test_get_path_with_path_input(self):

				        p_in = self.tmp_path / "sub/f.txt"

				        p_out = get_path(p_in, resolve=False)

				        self.assertTrue(str(p_out) == str(p_in))

				    def test_get_path_with_none_raises(self):

				        with self.assertRaises(ValueError):

				            get_path(None)  # type: ignore[arg-type]

				    def test_get_path_invalid_type_raises(self):

				        with self.assertRaises(TypeError):

				            get_path(123)  # type: ignore[arg-type]

				    # -------- ensure_dir_exists / force_create_dir / remove_dir --------

				    def test_ensure_dir_exists_creates_and_is_idempotent(self):

				        d = self.tmp_path / "made"

				        ensure_dir_exists(d)

				        self.assertTrue(d.exists() and d.is_dir())

				        ensure_dir_exists(d)

				    def test_force_create_dir_clears_existing(self):

				        d = self.tmp_path / "fresh"

				        (d / "inner").mkdir(parents=True)

				        (d / "inner" / "f.txt").write_text("x")

				        force_create_dir(d)

				        self.assertTrue(d.exists())

				        self.assertEqual(list(d.iterdir()), [])

				    def test_remove_dir_none_is_noop(self):

				        remove_dir(None)  # type: ignore[arg-type]

				    def test_remove_dir_nonexistent_is_noop(self):

				        ghost = self.tmp_path / "ghost"

				        remove_dir(ghost)

				    def test_remove_dir_accepts_str(self):

				        d = self.tmp_path / "to_rm"

				        d.mkdir()

				        remove_dir(str(d))

				        self.assertFalse(d.exists())

				    # -------- copy --------

				    def test_copy_file_to_file(self):

				        src = self.tmp_path / "src.txt"

				        dst = self.tmp_path / "out" / "dst.txt"

				        src.write_text("hello")

				        copy(src, dst)

				        self.assertEqual(dst.read_text(), "hello")

				    def test_copy_dir_to_new_dir(self):

				        src = self.tmp_path / "srcdir"

				        (src / "a").mkdir(parents=True)

				        (src / "a" / "f.txt").write_text("content")

				        dst = self.tmp_path / "destdir"

				        copy(src, dst)

				        self.assertEqual((dst / "a" / "f.txt").read_text(), "content")

				    def test_copy_dir_into_existing_dir_overwrite_true_merges(self):

				        src = self.tmp_path / "srcdir"

				        dst = self.tmp_path / "destdir"

				        (src / "x").mkdir(parents=True)

				        (src / "x" / "new.txt").write_text("new")

				        dst.mkdir()

				        (dst / "existing.txt").write_text("old")

				        copy(src, dst)

				        self.assertEqual((dst / "existing.txt").read_text(), "old")

				        self.assertEqual((dst / "x" / "new.txt").read_text(), "new")

				    def test_is_str_path_exist(self):

				        p = self.tmp_path / "x.txt"

				        p.write_text("1")

				        self.assertTrue(is_path_exist(str(p)))

				        self.assertTrue(is_path_exist(p))

				        self.assertFalse(is_path_exist(str(self.tmp_path / "missing")))

				        self.assertFalse(is_path_exist(self.tmp_path / "missing"))

				        self.assertFalse(is_path_exist(""))

				if __name__ == "__main__":

				    unittest.main()

									
										185

.ci/lumen_cli/tests/test_run_plan.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,185 @@

				# tests/test_run_test_plan.py

				import importlib

				from contextlib import nullcontext

				from types import SimpleNamespace

				from unittest.mock import MagicMock

				import pytest

				MOD = "cli.lib.core.vllm.lib"

				# We import inside tests so the MOD override above applies everywhere

				run_test_plan_import_path = f"{MOD}.run_test_plan"

				def _get_cmd(c):

				    # Support both kwargs and positional args

				    return c.kwargs.get("cmd", c.args[0] if c.args else None)

				def _get_check(c):

				    if "check" in c.kwargs:

				        return c.kwargs["check"]

				    # If positional, assume second arg is 'check' when present; default False

				    return c.args[1] if len(c.args) > 1 else False

				@pytest.fixture

				def patch_module(monkeypatch):

				    """

				    Patch helpers ('pip_install_packages', 'temp_environ', 'working_directory',

				    'run_command', 'logger') inside the target module and expose them.

				    """

				    module = importlib.import_module(MOD)

				    # Create fakes/mocks

				    pip_install_packages = MagicMock(name="pip_install_packages")

				    run_command = MagicMock(name="run_command", return_value=0)

				    # temp_environ / working_directory: record calls but act as context managers

				    temp_calls: list[dict] = []

				    workdir_calls: list[str] = []

				    def fake_working_directory(path: str):

				        workdir_calls.append(path)

				        return nullcontext()

				    def fake_temp_env(map: dict[str, str]):

				        temp_calls.append(map)

				        return nullcontext()

				    logger = SimpleNamespace(

				        info=MagicMock(name="logger.info"),

				        error=MagicMock(name="logger.error"),

				    )

				    # Apply patches (raise if attribute doesn't exist)

				    monkeypatch.setattr(

				        module, "pip_install_packages", pip_install_packages, raising=True

				    )

				    monkeypatch.setattr(module, "run_command", run_command, raising=True)

				    monkeypatch.setattr(

				        module, "working_directory", fake_working_directory, raising=True

				    )

				    monkeypatch.setattr(module, "temp_environ", fake_temp_env, raising=True)

				    monkeypatch.setattr(module, "logger", logger, raising=True)

				    return SimpleNamespace(

				        module=module,

				        run_test_plan=module.run_test_plan,  # expose to avoid getattr("constant") (Ruff B009)

				        pip_install_packages=pip_install_packages,

				        run_command=run_command,

				        temp_calls=temp_calls,

				        workdir_calls=workdir_calls,

				        logger=logger,

				    )

				def test_success_runs_all_steps_and_uses_env_and_workdir(monkeypatch, patch_module):

				    run_test_plan = patch_module.run_test_plan

				    tests_map = {

				        "basic": {

				            "title": "Basic suite",

				            "package_install": [],

				            "working_directory": "tests",

				            "env_vars": {"GLOBAL_FLAG": "1"},

				            "steps": [

				                "export A=x && pytest -q",

				                "export B=y && pytest -q tests/unit",

				            ],

				        }

				    }

				    # One exit code per step (export + two pytest)

				    patch_module.run_command.side_effect = [0, 0, 0]

				    run_test_plan("basic", "cpu", tests_map)

				    calls = patch_module.run_command.call_args_list

				    cmds = [_get_cmd(c) for c in calls]

				    checks = [_get_check(c) for c in calls]

				    assert cmds == [

				        "export A=x && pytest -q",

				        "export B=y && pytest -q tests/unit",

				    ]

				    assert all(chk is False for chk in checks)

				    assert patch_module.workdir_calls == ["tests"]

				    assert patch_module.temp_calls == [{"GLOBAL_FLAG": "1"}]

				def test_installs_packages_when_present(monkeypatch, patch_module):

				    run_test_plan = patch_module.module.run_test_plan

				    tests_map = {

				        "with_pkgs": {

				            "title": "Needs deps",

				            "package_install": ["timm==1.0.0", "flash-attn"],

				            "steps": ["pytest -q"],

				        }

				    }

				    patch_module.run_command.return_value = 0

				    run_test_plan("with_pkgs", "gpu", tests_map)

				    patch_module.pip_install_packages.assert_called_once_with(

				        packages=["timm==1.0.0", "flash-attn"],

				        prefer_uv=True,

				    )

				def test_raises_on_missing_plan(patch_module):

				    run_test_plan = patch_module.module.run_test_plan

				    with pytest.raises(RuntimeError) as ei:

				        run_test_plan("nope", "cpu", tests_map={})

				    assert "test nope not found" in str(ei.value)

				def test_aggregates_failures_and_raises(monkeypatch, patch_module):

				    run_test_plan = patch_module.module.run_test_plan

				    tests_map = {

				        "mix": {

				            "title": "Some pass some fail",

				            "steps": [

				                "pytest test_a.py",  # 0 → pass

				                "pytest test_b.py",  # 1 → fail

				                "pytest test_c.py",  # 2 → fail

				            ],

				        }

				    }

				    # Simulate pass, fail, fail

				    patch_module.run_command.side_effect = [0, 1, 2]

				    with pytest.raises(RuntimeError) as ei:

				        run_test_plan("mix", "cpu", tests_map)

				    msg = str(ei.value)

				    assert "2 pytest runs failed" in msg

				    # Ensure logger captured failed tests list

				    patch_module.logger.error.assert_called_once()

				    # And we attempted all three commands

				    assert patch_module.run_command.call_count == 3

				def test_custom_working_directory_used(patch_module):

				    run_test_plan = patch_module.module.run_test_plan

				    tests_map = {

				        "customwd": {

				            "title": "Custom wd",

				            "working_directory": "examples/ci",

				            "steps": ["pytest -q"],

				        }

				    }

				    patch_module.run_command.return_value = 0

				    run_test_plan("customwd", "cpu", tests_map)

				    assert patch_module.workdir_calls == ["examples/ci"]

									
										143

.ci/lumen_cli/tests/test_utils.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,143 @@

				import os

				import tempfile

				import unittest

				from pathlib import Path

				from cli.lib.common.utils import temp_environ, working_directory  # <-- replace import

				class EnvIsolatedTestCase(unittest.TestCase):

				    """Base class that snapshots os.environ and CWD for isolation."""

				    def setUp(self):

				        import os

				        import tempfile

				        self._env_backup = dict(os.environ)

				        # Snapshot/repair CWD if it's gone

				        try:

				            self._cwd_backup = os.getcwd()

				        except FileNotFoundError:

				            # If CWD no longer exists, switch to a safe place and record that

				            self._cwd_backup = tempfile.gettempdir()

				            os.chdir(self._cwd_backup)

				        # Create a temporary directory for the test to run in

				        self._temp_dir = tempfile.mkdtemp()

				        os.chdir(self._temp_dir)

				    def tearDown(self):

				        import os

				        import shutil

				        import tempfile

				        # Restore cwd first (before cleaning up temp dir)

				        try:

				            os.chdir(self._cwd_backup)

				        except OSError:

				            os.chdir(tempfile.gettempdir())

				        # Clean up temporary directory

				        try:

				            shutil.rmtree(self._temp_dir, ignore_errors=True)

				        except Exception:

				            pass  # Ignore cleanup errors

				        # Restore env

				        to_del = set(os.environ.keys()) - set(self._env_backup.keys())

				        for k in to_del:

				            os.environ.pop(k, None)

				        for k, v in self._env_backup.items():

				            os.environ[k] = v

				class TestTempEnviron(EnvIsolatedTestCase):

				    def test_sets_and_restores_new_var(self):

				        var = "TEST_TMP_ENV_NEW"

				        self.assertNotIn(var, os.environ)

				        with temp_environ({var: "123"}):

				            self.assertEqual(os.environ[var], "123")

				        self.assertNotIn(var, os.environ)  # removed after exit

				    def test_overwrites_and_restores_existing_var(self):

				        var = "TEST_TMP_ENV_OVERWRITE"

				        os.environ[var] = "orig"

				        with temp_environ({var: "override"}):

				            self.assertEqual(os.environ[var], "override")

				        self.assertEqual(os.environ[var], "orig")  # restored

				    def test_multiple_vars_and_missing_cleanup(self):

				        v1, v2 = "TEST_ENV_V1", "TEST_ENV_V2"

				        os.environ.pop(v1, None)

				        os.environ[v2] = "keep"

				        with temp_environ({v1: "a", v2: "b"}):

				            self.assertEqual(os.environ[v1], "a")

				            self.assertEqual(os.environ[v2], "b")

				        self.assertNotIn(v1, os.environ)  # newly-added -> removed

				        self.assertEqual(os.environ[v2], "keep")  # pre-existing -> restored

				    def test_restores_even_on_exception(self):

				        var = "TEST_TMP_ENV_EXCEPTION"

				        self.assertNotIn(var, os.environ)

				        with self.assertRaises(RuntimeError):

				            with temp_environ({var: "x"}):

				                self.assertEqual(os.environ[var], "x")

				                raise RuntimeError("boom")

				        self.assertNotIn(var, os.environ)  # removed after exception

				class TestWorkingDirectory(EnvIsolatedTestCase):

				    def test_changes_and_restores(self):

				        start = Path.cwd()

				        with tempfile.TemporaryDirectory() as td:

				            target = Path(td) / "wd"

				            target.mkdir()

				            with working_directory(str(target)):

				                self.assertEqual(Path.cwd().resolve(), target.resolve())

				        self.assertEqual(Path.cwd(), start)

				    def test_noop_when_empty_path(self):

				        start = Path.cwd()

				        with working_directory(""):

				            self.assertEqual(Path.cwd(), start)

				        self.assertEqual(Path.cwd(), start)

				    def test_restores_on_exception(self):

				        start = Path.cwd()

				        with tempfile.TemporaryDirectory() as td:

				            target = Path(td) / "wd_exc"

				            target.mkdir()

				            with self.assertRaises(ValueError):

				                with working_directory(str(target)):

				                    # Normalize both sides to handle /var -> /private/var

				                    self.assertEqual(Path.cwd().resolve(), target.resolve())

				                    raise ValueError("boom")

				        self.assertEqual(Path.cwd().resolve(), start.resolve())

				    def test_raises_for_missing_dir(self):

				        start = Path.cwd()

				        with tempfile.TemporaryDirectory() as td:

				            missing = Path(td) / "does_not_exist"

				            with self.assertRaises(FileNotFoundError):

				                # os.chdir should raise before yielding

				                with working_directory(str(missing)):

				                    pass

				        self.assertEqual(Path.cwd(), start)

				if __name__ == "__main__":

				    unittest.main(verbosity=2)

									
										176

.ci/lumen_cli/tests/test_vllm.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,176 @@

				import os

				import tempfile

				import unittest

				from pathlib import Path

				from unittest.mock import MagicMock, patch

				import cli.lib.core.vllm.vllm_build as vllm_build

				_VLLM_BUILD_MODULE = "cli.lib.core.vllm.vllm_build"

				class TestVllmBuildParameters(unittest.TestCase):

				    @patch(f"{_VLLM_BUILD_MODULE}.local_image_exists", return_value=True)

				    @patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=True)

				    @patch(

				        "cli.lib.common.envs_helper.env_path_optional",

				        side_effect=lambda name, default=None, resolve=True: {

				            "DOCKERFILE_PATH": Path("/abs/vllm/Dockerfile"),

				            "TORCH_WHEELS_PATH": Path("/abs/dist"),

				            "OUTPUT_DIR": Path("/abs/shared"),

				        }.get(name, Path(default) if default is not None else None),

				    )

				    @patch.dict(

				        os.environ,

				        {

				            "USE_TORCH_WHEEL": "1",

				            "USE_LOCAL_BASE_IMAGE": "1",

				            "USE_LOCAL_DOCKERFILE": "1",

				            "BASE_IMAGE": "my/image:tag",

				            "DOCKERFILE_PATH": "vllm/Dockerfile",

				            "TORCH_WHEELS_PATH": "dist",

				            "OUTPUT_DIR": "shared",

				        },

				        clear=True,

				    )

				    def test_params_success_normalizes_and_validates(

				        self, mock_env_path, mock_is_path, mock_local_img

				    ):

				        params = vllm_build.VllmBuildParameters()

				        self.assertEqual(params.torch_whls_path, Path("/abs/dist"))

				        self.assertEqual(params.dockerfile_path, Path("/abs/vllm/Dockerfile"))

				        self.assertEqual(params.output_dir, Path("/abs/shared"))

				        self.assertEqual(params.base_image, "my/image:tag")

				    @patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=False)

				    @patch.dict(

				        os.environ, {"USE_TORCH_WHEEL": "1", "TORCH_WHEELS_PATH": "dist"}, clear=True

				    )

				    def test_params_missing_torch_whls_raises(self, _is_path):

				        with tempfile.TemporaryDirectory() as td:

				            os.chdir(td)

				            with self.assertRaises(ValueError) as cm:

				                vllm_build.VllmBuildParameters(

				                    use_local_base_image=False,

				                    use_local_dockerfile=False,

				                )

				        err = cm.exception

				        self.assertIn("TORCH_WHEELS_PATH", str(err))

				    @patch(f"{_VLLM_BUILD_MODULE}.local_image_exists", return_value=False)

				    @patch.dict(

				        os.environ, {"USE_LOCAL_BASE_IMAGE": "1", "BASE_IMAGE": "img:tag"}, clear=True

				    )

				    def test_params_missing_local_base_image_raises(self, _local_img):

				        with tempfile.TemporaryDirectory() as td:

				            os.chdir(td)

				            with self.assertRaises(ValueError) as cm:

				                vllm_build.VllmBuildParameters(

				                    use_torch_whl=False,

				                    use_local_dockerfile=False,

				                )

				        err = cm.exception

				        self.assertIn("BASE_IMAGE", str(err))

				    @patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=False)

				    @patch.dict(

				        os.environ,

				        {"USE_LOCAL_DOCKERFILE": "1", "DOCKERFILE_PATH": "Dockerfile"},

				        clear=True,

				    )

				    def test_params_missing_dockerfile_raises(self, _is_path):

				        with tempfile.TemporaryDirectory() as td:

				            os.chdir(td)

				            with self.assertRaises(ValueError) as cm:

				                vllm_build.VllmBuildParameters(

				                    use_torch_whl=False,

				                    use_local_base_image=False,

				                )

				        err = cm.exception

				        self.assertIn("DOCKERFILE_PATH", str(err))

				    @patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=False)

				    @patch.dict(

				        os.environ,

				        {"OUTPUT_DIR": ""},

				        clear=True,

				    )

				    def test_params_missing_output_dir(self, _is_path):

				        with self.assertRaises(FileNotFoundError):

				            vllm_build.VllmBuildParameters()

				class TestBuildCmdAndRun(unittest.TestCase):

				    @patch(f"{_VLLM_BUILD_MODULE}.local_image_exists", return_value=True)

				    def test_generate_docker_build_cmd_includes_bits(self, _exists):

				        runner = vllm_build.VllmBuildRunner()

				        inputs = MagicMock()

				        inputs.output_dir = Path("/abs/out")

				        inputs.use_local_base_image = True

				        inputs.base_image = "img:tag"

				        inputs.torch_whls_path = Path("./vllm/tmp")

				        inputs.max_jobs = 64

				        inputs.cuda_version = "12.8.1"

				        inputs.python_version = "3.12"

				        inputs.sccache_bucket = "my-bucket"

				        inputs.sccache_region = "us-west-2"

				        inputs.torch_cuda_arch_list = "8.0;9.0"

				        inputs.target_stage = "export-wheels"

				        inputs.tag_name = "vllm-wheels"

				        cmd = runner._generate_docker_build_cmd(inputs)

				        squashed = " ".join(cmd.split())

				        self.assertIn("--output type=local,dest=/abs/out", squashed)

				        self.assertIn("-f docker/Dockerfile.nightly_torch", squashed)

				        self.assertIn("--pull=false", squashed)

				        self.assertIn("--build-arg TORCH_WHEELS_PATH=tmp", squashed)

				        self.assertIn("--build-arg BUILD_BASE_IMAGE=img:tag", squashed)

				        self.assertIn("--build-arg FINAL_BASE_IMAGE=img:tag", squashed)

				        self.assertIn("--build-arg max_jobs=64", squashed)

				        self.assertIn("--build-arg CUDA_VERSION=12.8.1", squashed)

				        self.assertIn("--build-arg PYTHON_VERSION=3.12", squashed)

				        self.assertIn("--build-arg USE_SCCACHE=1", squashed)

				        self.assertIn("--build-arg SCCACHE_BUCKET_NAME=my-bucket", squashed)

				        self.assertIn("--build-arg SCCACHE_REGION_NAME=us-west-2", squashed)

				        self.assertIn("--build-arg torch_cuda_arch_list='8.0;9.0'", squashed)

				        self.assertIn("--target export-wheels", squashed)

				        self.assertIn("-t vllm-wheels", squashed)

				    @patch(f"{_VLLM_BUILD_MODULE}.run_command")

				    @patch(f"{_VLLM_BUILD_MODULE}.ensure_dir_exists")

				    @patch(f"{_VLLM_BUILD_MODULE}.clone_vllm")

				    @patch.object(

				        vllm_build.VllmBuildRunner,

				        "_generate_docker_build_cmd",

				        return_value="docker buildx ...",

				    )

				    @patch.dict(

				        os.environ,

				        {

				            "USE_TORCH_WHEEL": "0",

				            "USE_LOCAL_BASE_IMAGE": "0",

				            "USE_LOCAL_DOCKERFILE": "0",

				            "OUTPUT_DIR": "shared",

				        },

				        clear=True,

				    )

				    def test_run_calls_clone_prepare_and_build(

				        self, mock_gen, mock_clone, mock_ensure, mock_run

				    ):

				        params = MagicMock()

				        params.output_dir = Path("shared")

				        params.use_local_dockerfile = False

				        params.use_torch_whl = False

				        with patch(f"{_VLLM_BUILD_MODULE}.VllmBuildParameters", return_value=params):

				            runner = vllm_build.VllmBuildRunner()

				            runner.run()

				        mock_clone.assert_called_once()

				        mock_ensure.assert_called_once_with(Path("shared"))

				        mock_gen.assert_called_once_with(params)

				        mock_run.assert_called_once()

				        _, kwargs = mock_run.call_args

				        assert kwargs.get("cwd") == "vllm"

									
										7

.ci/magma/Makefile
									
												View File
												
				@ -16,6 +16,7 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \

					magma/build_magma.sh

				.PHONY: all

				all: magma-cuda130

				all: magma-cuda129

				all: magma-cuda128

				all: magma-cuda126

				@ -25,6 +26,12 @@ clean:

					$(RM) -r magma-*

					$(RM) -r output

				.PHONY: magma-cuda130

				magma-cuda130: DESIRED_CUDA := 13.0

				magma-cuda130: CUDA_ARCH_LIST := -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120

				magma-cuda130:

					$(DOCKER_RUN)

				.PHONY: magma-cuda129

				magma-cuda129: DESIRED_CUDA := 12.9

				magma-cuda129: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120

									
										2

.ci/magma/build_magma.sh
									
												View File
												
				@ -28,6 +28,7 @@ pushd ${PACKAGE_DIR}/magma-${MAGMA_VERSION}

				patch < ${PACKAGE_FILES}/CMake.patch

				patch < ${PACKAGE_FILES}/cmakelists.patch

				patch -p0 < ${PACKAGE_FILES}/thread_queue.patch

				patch -p1 < ${PACKAGE_FILES}/cuda13.patch

				patch -p1 < ${PACKAGE_FILES}/getrf_shfl.patch

				patch -p1 < ${PACKAGE_FILES}/getrf_nbparam.patch

				# The build.sh script expects to be executed from the sources root folder

				@ -37,6 +38,7 @@ popd

				# Package recipe, license and tarball

				# Folder and package name are backward compatible for the build workflow

				cp ${PACKAGE_FILES}/build.sh ${PACKAGE_RECIPE}/build.sh

				cp ${PACKAGE_FILES}/cuda13.patch ${PACKAGE_RECIPE}/cuda13.patch

				cp ${PACKAGE_FILES}/thread_queue.patch ${PACKAGE_RECIPE}/thread_queue.patch

				cp ${PACKAGE_FILES}/cmakelists.patch ${PACKAGE_RECIPE}/cmakelists.patch

				cp ${PACKAGE_FILES}/getrf_shfl.patch ${PACKAGE_RECIPE}/getrf_shfl.patch

									
										26

.ci/magma/package_files/cuda13.patch
									
										Normal file
									
												View File
												
				@ -0,0 +1,26 @@

				diff --git a/interface_cuda/interface.cpp b/interface_cuda/interface.cpp

				index 73fed1b20..e77519bfe 100644

				--- a/interface_cuda/interface.cpp

				+++ b/interface_cuda/interface.cpp

				@@ -438,14 +438,20 @@ magma_print_environment()

				         cudaDeviceProp prop;

				         err = cudaGetDeviceProperties( &prop, dev );

				         check_error( err );

				+        #ifdef MAGMA_HAVE_CUDA

				+#if CUDA_VERSION < 13000

				         printf( "%% device %d: %s, %.1f MHz clock, %.1f MiB memory, capability %d.%d\n",

				                 dev,

				                 prop.name,

				                 prop.clockRate / 1000.,

				+#else

				+        printf( "%% device %d: %s, ??? MHz clock, %.1f MiB memory, capability %d.%d\n",

				+                dev,

				+                prop.name,

				+#endif

				                 prop.totalGlobalMem / (1024.*1024.),

				                 prop.major,

				                 prop.minor );

				-        #ifdef MAGMA_HAVE_CUDA

				         int arch = prop.major*100 + prop.minor*10;

				         if ( arch < MAGMA_CUDA_ARCH_MIN ) {

				             printf("\n"

									
										4

.ci/manywheel/build.sh
									
												View File
												
				@ -5,10 +5,6 @@ set -ex

				SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

				case "${GPU_ARCH_TYPE:-BLANK}" in

				    BLANK)

				        # Legacy behavior for CircleCI

				        bash "${SCRIPTPATH}/build_cuda.sh"

				        ;;

				    cuda)

				        bash "${SCRIPTPATH}/build_cuda.sh"

				        ;;

									
										33

.ci/manywheel/build_common.sh
									
												View File
												
				@ -138,28 +138,11 @@ fi

				echo "Calling setup.py bdist at $(date)"

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 \

				time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				    EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				    echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				    time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    CMAKE_FRESH=1 python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				    echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				else

				    time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				        EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				        BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				        USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				        python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				fi

				echo "Finished setup.py bdist at $(date)"

				# Build libtorch packages

				@ -272,10 +255,6 @@ ls /tmp/$WHEELHOUSE_DIR

				mkdir -p "/$WHEELHOUSE_DIR"

				mv /tmp/$WHEELHOUSE_DIR/torch*linux*.whl /$WHEELHOUSE_DIR/

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    mv /tmp/$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/ || true

				fi

				if [[ -n "$BUILD_PYTHONLESS" ]]; then

				    mkdir -p /$LIBTORCH_HOUSE_DIR

				    mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR

				@ -452,16 +431,8 @@ if [[ -z "$BUILD_PYTHONLESS" ]]; then

				  pushd $PYTORCH_ROOT/test

				  # Install the wheel for this Python version

				  if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    pip uninstall -y "$TORCH_NO_PYTHON_PACKAGE_NAME" || true

				  fi

				  pip uninstall -y "$TORCH_PACKAGE_NAME"

				  if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    pip install "$TORCH_NO_PYTHON_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v

				  fi

				  pip install "$TORCH_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v

				  # Print info on the libraries installed in this wheel

									
										104

.ci/manywheel/build_cuda.sh
									
												View File
												
				@ -66,6 +66,9 @@ case ${CUDA_VERSION} in

				            TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"

				        fi

				        ;;

				    13.0)

				        TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX"

				        ;;

				    12.6)

				        TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6;9.0"

				        ;;

				@ -110,13 +113,18 @@ DEPS_SONAME=(

				)

				# CUDA_VERSION 12.6, 12.8, 12.9

				if [[ $CUDA_VERSION == 12* ]]; then

				# CUDA_VERSION 12.*, 13.*

				if [[ $CUDA_VERSION == 12* || $CUDA_VERSION == 13* ]]; then

				    export USE_STATIC_CUDNN=0

				    # Try parallelizing nvcc as well

				    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				    TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				    # Compress the fatbin with -compress-mode=size for CUDA 13

				    if [[ $CUDA_VERSION == 13* ]]; then

				        export TORCH_NVCC_FLAGS="$TORCH_NVCC_FLAGS -compress-mode=size"

				    fi

				    if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				        echo "Bundling with cudnn and cublas."

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcudnn_adv.so.9"

				            "/usr/local/cuda/lib64/libcudnn_cnn.so.9"

				@ -126,15 +134,11 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            "/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_heuristic.so.9"

				            "/usr/local/cuda/lib64/libcudnn.so.9"

				            "/usr/local/cuda/lib64/libcublas.so.12"

				            "/usr/local/cuda/lib64/libcublasLt.so.12"

				            "/usr/local/cuda/lib64/libcusparseLt.so.0"

				            "/usr/local/cuda/lib64/libcudart.so.12"

				            "/usr/local/cuda/lib64/libnvrtc.so.12"

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so"

				            "/usr/local/cuda/lib64/libcufile.so.0"

				            "/usr/local/cuda/lib64/libcufile_rdma.so.1"

				            "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12"

				            "/usr/local/cuda/lib64/libnvshmem_host.so.3"

				            "/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so"

				        )

				        DEPS_SONAME+=(

				@ -146,41 +150,83 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            "libcudnn_engines_precompiled.so.9"

				            "libcudnn_heuristic.so.9"

				            "libcudnn.so.9"

				            "libcublas.so.12"

				            "libcublasLt.so.12"

				            "libcusparseLt.so.0"

				            "libcudart.so.12"

				            "libnvrtc.so.12"

				            "libnvrtc-builtins.so"

				            "libnvshmem_host.so.3"

				            "libcufile.so.0"

				            "libcufile_rdma.so.1"

				            "libcupti.so.12"

				            "libnvperf_host.so"

				        )

				        # Add libnvToolsExt only if CUDA version is not 12.9

				        if [[ $CUDA_VERSION != 12.9* ]]; then

				            DEPS_LIST+=("/usr/local/cuda/lib64/libnvToolsExt.so.1")

				            DEPS_SONAME+=("libnvToolsExt.so.1")

				        if [[ $CUDA_VERSION == 13* ]]; then

				            DEPS_LIST+=(

				                "/usr/local/cuda/lib64/libcublas.so.13"

				                "/usr/local/cuda/lib64/libcublasLt.so.13"

				                "/usr/local/cuda/lib64/libcudart.so.13"

				                "/usr/local/cuda/lib64/libnvrtc.so.13"

				                "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13"

				                "/usr/local/cuda/lib64/libibverbs.so.1"

				                "/usr/local/cuda/lib64/librdmacm.so.1"

				                "/usr/local/cuda/lib64/libmlx5.so.1"

				                "/usr/local/cuda/lib64/libnl-3.so.200"

				                "/usr/local/cuda/lib64/libnl-route-3.so.200")

				            DEPS_SONAME+=(

				                "libcublas.so.13"

				                "libcublasLt.so.13"

				                "libcudart.so.13"

				                "libnvrtc.so.13"

				                "libcupti.so.13"

				                "libibverbs.so.1"

				                "librdmacm.so.1"

				                "libmlx5.so.1"

				                "libnl-3.so.200"

				                "libnl-route-3.so.200")

				            export USE_CUPTI_SO=1

				            export ATEN_STATIC_CUDA=0

				            export USE_CUDA_STATIC_LINK=0

				            export USE_CUFILE=0

				        else

				            DEPS_LIST+=(

				                "/usr/local/cuda/lib64/libnvToolsExt.so.1"

				                "/usr/local/cuda/lib64/libcublas.so.12"

				                "/usr/local/cuda/lib64/libcublasLt.so.12"

				                "/usr/local/cuda/lib64/libcudart.so.12"

				                "/usr/local/cuda/lib64/libnvrtc.so.12"

				                "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12")

				            DEPS_SONAME+=(

				                "libnvToolsExt.so.1"

				                "libcublas.so.12"

				                "libcublasLt.so.12"

				                "libcudart.so.12"

				                "libnvrtc.so.12"

				                "libcupti.so.12")

				        fi

				    else

				        echo "Using nvidia libs from pypi."

				        CUDA_RPATHS=(

				            '$ORIGIN/../../nvidia/cublas/lib'

				            '$ORIGIN/../../nvidia/cuda_cupti/lib'

				            '$ORIGIN/../../nvidia/cuda_nvrtc/lib'

				            '$ORIGIN/../../nvidia/cuda_runtime/lib'

				            '$ORIGIN/../../nvidia/cudnn/lib'

				            '$ORIGIN/../../nvidia/cufft/lib'

				            '$ORIGIN/../../nvidia/curand/lib'

				            '$ORIGIN/../../nvidia/cusolver/lib'

				            '$ORIGIN/../../nvidia/cusparse/lib'

				            '$ORIGIN/../../nvidia/cusparselt/lib'

				            '$ORIGIN/../../cusparselt/lib'

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/nvshmem/lib'

				            '$ORIGIN/../../nvidia/nvtx/lib'

				            '$ORIGIN/../../nvidia/cufile/lib'

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/cusparselt/lib'

				        )

				        if [[ $CUDA_VERSION == 13* ]]; then

				            CUDA_RPATHS+=('$ORIGIN/../../nvidia/cu13/lib')

				        else

				            CUDA_RPATHS+=(

				                '$ORIGIN/../../nvidia/cublas/lib'

				                '$ORIGIN/../../nvidia/cuda_cupti/lib'

				                '$ORIGIN/../../nvidia/cuda_nvrtc/lib'

				                '$ORIGIN/../../nvidia/cuda_runtime/lib'

				                '$ORIGIN/../../nvidia/cufft/lib'

				                '$ORIGIN/../../nvidia/curand/lib'

				                '$ORIGIN/../../nvidia/cusolver/lib'

				                '$ORIGIN/../../nvidia/cusparse/lib'

				                '$ORIGIN/../../cusparselt/lib'

				                '$ORIGIN/../../nvidia/nvtx/lib'

				                '$ORIGIN/../../nvidia/cufile/lib'

				            )

				        fi

				        CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")

				        export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'

				        export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

									
										2

.ci/manywheel/build_rocm.sh
									
												View File
												
				@ -194,7 +194,7 @@ ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library

				ROCBLAS_LIB_DST=lib/rocblas/library

				ROCBLAS_ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)

				ROCBLAS_OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)

				ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $OTHER_FILES)

				ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $ROCBLAS_OTHER_FILES)

				# hipblaslt library files

				HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library

									
										1

.ci/manywheel/build_xpu.sh
									
												View File
												
				@ -25,6 +25,7 @@ source /opt/intel/oneapi/mpi/latest/env/vars.sh

				export USE_STATIC_MKL=1

				export USE_ONEMKL=1

				export USE_XCCL=1

				export USE_MPI=0

				WHEELHOUSE_DIR="wheelhousexpu"

				LIBTORCH_HOUSE_DIR="libtorch_housexpu"

									
										60

.ci/pytorch/build.sh
									
												View File
												
				@ -50,9 +50,6 @@ if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then

				  export ATEN_THREADING=NATIVE

				fi

				# Enable LLVM dependency for TensorExpr testing

				export USE_LLVM=/opt/llvm

				export LLVM_DIR=/opt/llvm/lib/cmake/llvm

				if ! which conda; then

				  # In ROCm CIs, we are doing cross compilation on build machines with

				@ -95,6 +92,27 @@ if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then

				  export ACL_ROOT_DIR=/ComputeLibrary

				fi

				if [[ "$BUILD_ENVIRONMENT" == *riscv64* ]]; then

				  if [[ -f /opt/riscv-cross-env/bin/activate ]]; then

				    # shellcheck disable=SC1091

				    source /opt/riscv-cross-env/bin/activate

				  else

				    echo "Activation file not found"

				    exit 1

				  fi

				  export CMAKE_CROSSCOMPILING=TRUE

				  export CMAKE_SYSTEM_NAME=Linux

				  export CMAKE_SYSTEM_PROCESSOR=riscv64

				  export USE_CUDA=0

				  export USE_MKLDNN=0

				  export SLEEF_TARGET_EXEC_USE_QEMU=ON

				  sudo chown -R jenkins /var/lib/jenkins/workspace /opt

				fi

				if [[ "$BUILD_ENVIRONMENT" == *libtorch* ]]; then

				  POSSIBLE_JAVA_HOMES=()

				  POSSIBLE_JAVA_HOMES+=(/usr/local)

				@ -155,6 +173,7 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  source /opt/intel/oneapi/mpi/latest/env/vars.sh

				  # Enable XCCL build

				  export USE_XCCL=1

				  export USE_MPI=0

				  # XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA

				  export USE_KINETO=0

				  export TORCH_XPU_ARCH_LIST=pvc

				@ -176,8 +195,16 @@ fi

				# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of

				# memory to build and will OOM

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]]; then

				  export BUILD_CUSTOM_STEP="ninja -C build flash_attention -j 2"

				if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && echo "${TORCH_CUDA_ARCH_LIST}" | tr ' ' '\n' | sed 's/$/>= 8.0/' | bc | grep -q 1; then

				  J=2  # default to 2 jobs

				  case "$RUNNER" in

				    linux.12xlarge.memory|linux.24xlarge.memory)

				      J=24

				      ;;

				  esac

				  echo "Building FlashAttention with job limit $J"

				  export BUILD_CUSTOM_STEP="ninja -C build flash_attention -j ${J}"

				fi

				if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then

				@ -192,7 +219,6 @@ if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then

				  export USE_ASAN=1

				  export REL_WITH_DEB_INFO=1

				  export UBSAN_FLAGS="-fno-sanitize-recover=all"

				  unset USE_LLVM

				fi

				if [[ "${BUILD_ENVIRONMENT}" == *no-ops* ]]; then

				@ -213,7 +239,7 @@ fi

				# Do not change workspace permissions for ROCm and s390x CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /var/lib/jenkins/workspace ]]; then

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *riscv64* && -d /var/lib/jenkins/workspace ]]; then

				  # Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

				  WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")

				  cleanup_workspace() {

				@ -258,29 +284,19 @@ else

				    # XLA test build fails when WERROR=1

				    # set only when building other architectures

				    # or building non-XLA tests.

				    if [[ "$BUILD_ENVIRONMENT" != *rocm*  &&

				          "$BUILD_ENVIRONMENT" != *xla* ]]; then

				    if [[ "$BUILD_ENVIRONMENT" != *rocm*  && "$BUILD_ENVIRONMENT" != *xla* && "$BUILD_ENVIRONMENT" != *riscv64* ]]; then

				      # Install numpy-2.0.2 for builds which are backward compatible with 1.X

				      python -mpip install numpy==2.0.2

				      WERROR=1 python setup.py clean

				      if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				        python3 tools/packaging/split_wheel.py bdist_wheel

				      else

				        WERROR=1 python setup.py bdist_wheel

				      fi

				      WERROR=1 python setup.py bdist_wheel

				    else

				      python setup.py clean

				      if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then

				        source .ci/pytorch/install_cache_xla.sh

				      fi

				      if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				        echo "USE_SPLIT_BUILD cannot be used with xla or rocm"

				        exit 1

				      else

				        python setup.py bdist_wheel

				      fi

				      python setup.py bdist_wheel

				    fi

				    pip_install_whl "$(echo dist/*.whl)"

				@ -405,7 +421,7 @@ if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]];

				  # don't do this for libtorch as libtorch is C++ only and thus won't have python tests run on its build

				  python tools/stats/export_test_times.py

				fi

				# don't do this for bazel or s390x as they don't use sccache

				if [[ "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then

				# don't do this for bazel or s390x or riscv64 as they don't use sccache

				if [[ "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *riscv64* && "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then

				  print_sccache_stats

				fi

									
										23

.ci/pytorch/check_binary.sh
									
												View File
												
				@ -67,7 +67,7 @@ fi

				#       wheels with cxx11-abi

				echo "Checking that the gcc ABI is what we expect"

				if [[ "$(uname)" != 'Darwin' ]]; then

				if [[ "$(uname)" != 'Darwin' &&  "$(uname -m)" != "s390x" ]]; then

				  # We also check that there are cxx11 symbols in libtorch

				  #

				  echo "Checking that symbols in libtorch.so have the right gcc abi"

				@ -300,24 +300,3 @@ except RuntimeError as e:

				    exit 1

				  fi

				fi

				###############################################################################

				# Check for C++ ABI compatibility to GCC-11 - GCC 13

				###############################################################################

				if [[ "$(uname)" == 'Linux' &&  "$PACKAGE_TYPE" == 'manywheel' ]]; then

				  pushd /tmp

				  # Per https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html

				  # gcc-11 is ABI16, gcc-13 is ABI18, gcc-14 is ABI19

				  # gcc 11 - CUDA 11.8, xpu, rocm

				  # gcc 13 - CUDA 12.6, 12.8 and cpu

				  # Please see issue for reference: https://github.com/pytorch/pytorch/issues/152426

				  if [[ "$(uname -m)" == "s390x" ]]; then

				    cxx_abi="19"

				  elif [[ "$DESIRED_CUDA" != 'xpu' && "$DESIRED_CUDA" != 'rocm'* ]]; then

				    cxx_abi="18"

				  else

				    cxx_abi="16"

				  fi

				  python -c "import torch; exit(0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi10${cxx_abi}' else 1)"

				  popd

				fi

									
										43

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -149,6 +149,19 @@ function get_pinned_commit() {

				  cat .github/ci_commit_pins/"${1}".txt

				}

				function detect_cuda_arch() {

				  if [[ "${BUILD_ENVIRONMENT}" == *cuda* ]]; then

				    if command -v nvidia-smi; then

				      TORCH_CUDA_ARCH_LIST=$(nvidia-smi --query-gpu=compute_cap --format=csv | tail -n 1)

				    elif [[ "${TEST_CONFIG}" == *nogpu* ]]; then

				      # There won't be nvidia-smi in nogpu tests, so just set TORCH_CUDA_ARCH_LIST to the default

				      # minimum supported value here

				      TORCH_CUDA_ARCH_LIST=8.0

				    fi

				    export TORCH_CUDA_ARCH_LIST

				  fi

				}

				function install_torchaudio() {

				  local commit

				  commit=$(get_pinned_commit audio)

				@ -229,7 +242,6 @@ function install_torchrec_and_fbgemm() {

				    pip_install tabulate  # needed for newer fbgemm

				    pip_install patchelf  # needed for rocm fbgemm

				    pushd /tmp

				    local wheel_dir=dist/fbgemm_gpu

				    local found_whl=0

				@ -245,7 +257,7 @@ function install_torchrec_and_fbgemm() {

				    if [ "${found_whl}" == "0" ]; then

				      git clone --recursive https://github.com/pytorch/fbgemm

				      pushd fbgemm/fbgemm_gpu

				      git checkout "${fbgemm_commit}"

				      git checkout "${fbgemm_commit}" --recurse-submodules

				      python setup.py bdist_wheel \

				        --build-variant=rocm \

				        -DHIP_ROOT_DIR="${ROCM_PATH}" \

				@ -264,7 +276,6 @@ function install_torchrec_and_fbgemm() {

				    done

				    rm -rf fbgemm

				    popd

				  else

				    pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec

				    pip_build_and_install "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#subdirectory=fbgemm_gpu" dist/fbgemm_gpu

				@ -273,7 +284,7 @@ function install_torchrec_and_fbgemm() {

				function clone_pytorch_xla() {

				  if [[ ! -d ./xla ]]; then

				    git clone --recursive --quiet https://github.com/pytorch/xla.git

				    git clone --recursive -b r2.9 https://github.com/pytorch/xla.git

				    pushd xla

				    # pin the xla hash so that we don't get broken by changes to xla

				    git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"

				@ -283,30 +294,6 @@ function clone_pytorch_xla() {

				  fi

				}

				function checkout_install_torchbench() {

				  local commit

				  commit=$(get_pinned_commit torchbench)

				  git clone https://github.com/pytorch/benchmark torchbench

				  pushd torchbench

				  git checkout "$commit"

				  if [ "$1" ]; then

				    python install.py --continue_on_fail models "$@"

				  else

				    # Occasionally the installation may fail on one model but it is ok to continue

				    # to install and test other models

				    python install.py --continue_on_fail

				  fi

				  # TODO (huydhn): transformers-4.44.2 added by https://github.com/pytorch/benchmark/pull/2488

				  # is regressing speedup metric. This needs to be investigated further

				  pip install transformers==4.38.1

				  echo "Print all dependencies after TorchBench is installed"

				  python -mpip freeze

				  popd

				}

				function install_torchao() {

				  local commit

				  commit=$(get_pinned_commit torchao)

									
										2

.ci/pytorch/cpp_doc_push_script.sh
									
												View File
												
				@ -58,7 +58,7 @@ time python tools/setup_helpers/generate_code.py \

				# Build the docs

				pushd docs/cpp

				time make VERBOSE=1 html -j

				time make VERBOSE=1 html

				popd

				popd

									
										77

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -157,6 +157,34 @@ test_jit_hooks() {

				  assert_git_not_dirty

				}

				# Shellcheck doesn't like it when you pass no arguments to a function

				# that can take args. See https://www.shellcheck.net/wiki/SC2120

				# shellcheck disable=SC2120

				checkout_install_torchbench() {

				  local commit

				  commit=$(cat .ci/docker/ci_commit_pins/torchbench.txt)

				  git clone https://github.com/pytorch/benchmark torchbench

				  pushd torchbench

				  git checkout "$commit"

				  if [ "$1" ]; then

				    python install.py --continue_on_fail models "$@"

				  else

				    # Occasionally the installation may fail on one model but it is ok to continue

				    # to install and test other models

				    python install.py --continue_on_fail

				  fi

				  popd

				  pip install -r .ci/docker/ci_commit_pins/huggingface-requirements.txt

				  # https://github.com/pytorch/pytorch/issues/160689 to remove torchao because

				  # its current version 0.12.0 doesn't work with transformers 4.54.0

				  pip uninstall -y torchao

				  echo "Print all dependencies after TorchBench is installed"

				  python -mpip freeze

				}

				torchbench_setup_macos() {

				  git clone --recursive https://github.com/pytorch/vision torchvision

				  git clone --recursive https://github.com/pytorch/audio torchaudio

				@ -167,7 +195,7 @@ torchbench_setup_macos() {

				  git checkout "$(cat ../.github/ci_commit_pins/vision.txt)"

				  git submodule update --init --recursive

				  python setup.py clean

				  python setup.py develop

				  python -m pip install -e . -v --no-build-isolation

				  popd

				  pushd torchaudio

				@ -176,11 +204,9 @@ torchbench_setup_macos() {

				  git submodule update --init --recursive

				  python setup.py clean

				  #TODO: Remove me, when figure out how to make TorchAudio find brew installed openmp

				  USE_OPENMP=0 python setup.py develop

				  USE_OPENMP=0 python -m pip install -e . -v --no-build-isolation

				  popd

				  # Shellcheck doesn't like it when you pass no arguments to a function that can take args. See https://www.shellcheck.net/wiki/SC2120

				  # shellcheck disable=SC2119,SC2120

				  checkout_install_torchbench

				}

				@ -276,6 +302,47 @@ test_torchbench_smoketest() {

				    fi

				  done

				  echo "Pytorch benchmark on mps device completed"

				}

				test_aoti_torchbench_smoketest() {

				  print_cmake_info

				  echo "Launching AOTInductor torchbench setup"

				  pip_benchmark_deps

				  # shellcheck disable=SC2119,SC2120

				  torchbench_setup_macos

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  local device=mps

				  local dtypes=(undefined float16 bfloat16 notset)

				  local dtype=${dtypes[$1]}

				  local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor timm_resnet timm_vovnet vgg16)

				  echo "Launching torchbench inference performance run for AOT Inductor and dtype ${dtype}"

				  local dtype_arg="--${dtype}"

				  if [ "$dtype" == notset ]; then

				      dtype_arg="--float32"

				  fi

				  touch "$TEST_REPORTS_DIR/aot_inductor_torchbench_${dtype}_inference_${device}_performance.csv"

				  for model in "${models[@]}"; do

				    PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				      --performance --only "$model" --export-aot-inductor --inference --devices "$device" "$dtype_arg" \

				      --output "$TEST_REPORTS_DIR/aot_inductor_torchbench_${dtype}_inference_${device}_performance.csv" || true

				    PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				      --accuracy --only "$model" --export-aot-inductor --inference --devices "$device" "$dtype_arg" \

				      --output "$TEST_REPORTS_DIR/aot_inductor_torchbench_${dtype}_inference_${device}_accuracy.csv" || true

				  done

				  echo "Launching HuggingFace inference performance run for AOT Inductor and dtype ${dtype}"

				  PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \

				    --performance --export-aot-inductor --inference --devices "$device" "$dtype_arg" \

				    --output "$TEST_REPORTS_DIR/aot_inductor_huggingface_${dtype}_inference_${device}_performance.csv" || true

				  PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \

				    --accuracy --export-aot-inductor --inference --devices "$device" "$dtype_arg" \

				    --output "$TEST_REPORTS_DIR/aot_inductor_huggingface_${dtype}_inference_${device}_accuracy.csv" || true

				  echo "Pytorch benchmark on mps device completed"

				}

				@ -324,6 +391,8 @@ elif [[ $TEST_CONFIG == *"perf_timm"* ]]; then

				  test_timm_perf

				elif [[ $TEST_CONFIG == *"perf_smoketest"* ]]; then

				  test_torchbench_smoketest "${SHARD_NUMBER}"

				elif [[ $TEST_CONFIG == *"aot_inductor_perf_smoketest"* ]]; then

				  test_aoti_torchbench_smoketest "${SHARD_NUMBER}"

				elif [[ $TEST_CONFIG == *"mps"* ]]; then

				  test_python_mps

				elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then

									
										1

.ci/pytorch/multigpu-test.sh
									
												View File
												
				@ -45,6 +45,7 @@ if [[ "${SHARD_NUMBER:-2}" == "2" ]]; then

				    # DTensor tests

				    time python test/run_test.py --verbose -i distributed/tensor/test_random_ops

				    time python test/run_test.py --verbose -i distributed/tensor/test_dtensor_compile

				    time python test/run_test.py --verbose -i distributed/tensor/test_utils.py

				    # DeviceMesh test

				    time python test/run_test.py --verbose -i distributed/test_device_mesh

									
										25

.ci/pytorch/numba-cuda-13.patch
									
										Normal file
									
												View File
												
				@ -0,0 +1,25 @@

				From 6e08c9d08e9de59c7af28b720289debbbd384764 Mon Sep 17 00:00:00 2001

				From: Michael Wang <13521008+isVoid@users.noreply.github.com>

				Date: Tue, 1 Apr 2025 17:28:05 -0700

				Subject: [PATCH] Avoid bumping certain driver API to avoid future breakage

				 (#185)

				Co-authored-by: isVoid <isVoid@users.noreply.github.com>

				---

				 numba_cuda/numba/cuda/cudadrv/driver.py | 3 +++

				 1 file changed, 3 insertions(+)

				diff --git a/numba_cuda/numba/cuda/cudadrv/driver.py b/numba_cuda/numba/cuda/cudadrv/driver.py

				index 1641bf77..233e9ed7 100644

				--- a/numba_cuda/numba/cuda/cudadrv/driver.py

				+++ b/numba_cuda/numba/cuda/cudadrv/driver.py

				@@ -365,6 +365,9 @@ def _find_api(self, fname):

				         else:

				             variants = ('_v2', '')

				+        if fname in ("cuCtxGetDevice", "cuCtxSynchronize"):

				+            return getattr(self.lib, fname)

				+

				         for variant in variants:

				             try:

				                 return getattr(self.lib, f'{fname}{variant}')

									
										23

.ci/pytorch/smoke_test/check_binary_symbols.py
									
												View File
												
				@ -32,6 +32,9 @@ LIBTORCH_NAMESPACE_LIST = (

				    "torch::",

				)

				# Patterns for detecting statically linked libstdc++ symbols

				STATICALLY_LINKED_CXX11_ABI = [re.compile(r".*recursive_directory_iterator.*")]

				def _apply_libtorch_symbols(symbols):

				    return [

				@ -53,12 +56,17 @@ def get_symbols(lib: str) -> list[tuple[str, str, str]]:

				    return [x.split(" ", 2) for x in lines.decode("latin1").split("\n")[:-1]]

				def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:

				def grep_symbols(

				    lib: str, patterns: list[Any], symbol_type: str | None = None

				) -> list[str]:

				    def _grep_symbols(

				        symbols: list[tuple[str, str, str]], patterns: list[Any]

				    ) -> list[str]:

				        rc = []

				        for _s_addr, _s_type, s_name in symbols:

				            # Filter by symbol type if specified

				            if symbol_type and _s_type != symbol_type:

				                continue

				            for pattern in patterns:

				                if pattern.match(s_name):

				                    rc.append(s_name)

				@ -80,6 +88,18 @@ def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:

				        return functools.reduce(list.__add__, (x.result() for x in tasks), [])

				def check_lib_statically_linked_libstdc_cxx_abi_symbols(lib: str) -> None:

				    cxx11_statically_linked_symbols = grep_symbols(

				        lib, STATICALLY_LINKED_CXX11_ABI, symbol_type="T"

				    )

				    num_statically_linked_symbols = len(cxx11_statically_linked_symbols)

				    print(f"num_statically_linked_symbols (T): {num_statically_linked_symbols}")

				    if num_statically_linked_symbols > 0:

				        raise RuntimeError(

				            f"Found statically linked libstdc++ symbols (recursive_directory_iterator): {cxx11_statically_linked_symbols[:100]}"

				        )

				def check_lib_symbols_for_abi_correctness(lib: str) -> None:

				    print(f"lib: {lib}")

				    cxx11_symbols = grep_symbols(lib, LIBTORCH_CXX11_PATTERNS)

				@ -107,6 +127,7 @@ def main() -> None:

				    libtorch_cpu_path = str(install_root / "lib" / "libtorch_cpu.so")

				    check_lib_symbols_for_abi_correctness(libtorch_cpu_path)

				    check_lib_statically_linked_libstdc_cxx_abi_symbols(libtorch_cpu_path)

				if __name__ == "__main__":

									
										94

.ci/pytorch/test.sh
									
												View File
												
				@ -32,6 +32,16 @@ if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /v

				  git config --global --add safe.directory /var/lib/jenkins/workspace

				fi

				# Patch numba to avoid CUDA-13 crash, see https://github.com/pytorch/pytorch/issues/162878

				NUMBA_CUDA_DIR=$(python -c "import os;import numba.cuda; print(os.path.dirname(numba.cuda.__file__))" 2>/dev/null || true)

				if [ -n "$NUMBA_CUDA_DIR" ]; then

				  NUMBA_PATCH="$(dirname "$(realpath "${BASH_SOURCE[0]}")")/numba-cuda-13.patch"

				  pushd "$NUMBA_CUDA_DIR"

				  patch -p4 <"$NUMBA_PATCH"

				  popd

				fi

				echo "Environment variables:"

				env

				@ -91,6 +101,7 @@ if [[ "$BUILD_ENVIRONMENT" == *clang9* || "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  export VALGRIND=OFF

				fi

				detect_cuda_arch

				if [[ "$BUILD_ENVIRONMENT" == *s390x* ]]; then

				  # There are additional warnings on s390x, maybe due to newer gcc.

				@ -495,6 +506,14 @@ test_inductor_cpp_wrapper_shard() {

				    -k 'take' \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				  if [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then

				    python test/run_test.py \

				      --include inductor/test_mkldnn_pattern_matcher \

				      -k 'xpu' \

				      --shard "$1" "$NUM_TEST_SHARDS" \

				      --verbose

				  fi

				}

				# "Global" flags for inductor benchmarking controlled by TEST_CONFIG

				@ -627,6 +646,8 @@ test_perf_for_dashboard() {

				    device=cuda_a10g

				  elif [[ "${TEST_CONFIG}" == *h100* ]]; then

				    device=cuda_h100

				  elif [[ "${TEST_CONFIG}" == *b200* ]]; then

				    device=cuda_b200

				  elif [[ "${TEST_CONFIG}" == *rocm* ]]; then

				    device=rocm

				  fi

				@ -801,6 +822,16 @@ test_dynamo_benchmark() {

				  if [[ "${TEST_CONFIG}" == *perf_compare* ]]; then

				    test_single_dynamo_benchmark "training" "$suite" "$shard_id" --training --amp "$@"

				  elif [[ "${TEST_CONFIG}" == *perf* ]]; then

				    # TODO (huydhn): Just smoke test some sample models

				    if [[ "${TEST_CONFIG}" == *b200* ]]; then

				      if [[ "${suite}" == "huggingface" ]]; then

				        export TORCHBENCH_ONLY_MODELS="DistillGPT2"

				      elif [[ "${suite}" == "timm_models" ]]; then

				        export TORCHBENCH_ONLY_MODELS="inception_v3"

				      elif [[ "${suite}" == "torchbench" ]]; then

				        export TORCHBENCH_ONLY_MODELS="hf_Bert"

				      fi

				    fi

				    test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"

				  else

				    if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				@ -1039,20 +1070,10 @@ test_libtorch_api() {

				    mkdir -p $TEST_REPORTS_DIR

				    OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" "$TORCH_BIN_DIR"/test_api --gtest_filter='-IMethodTest.*' --gtest_output=xml:$TEST_REPORTS_DIR/test_api.xml

				    "$TORCH_BIN_DIR"/test_tensorexpr --gtest_output=xml:$TEST_REPORTS_DIR/test_tensorexpr.xml

				  else

				    # Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy

				    OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_api -k "not IMethodTest"

				    # On s390x, pytorch is built without llvm.

				    # Even if it would be built with llvm, llvm currently doesn't support used features on s390x and

				    # test fails with errors like:

				    # JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer

				    # unknown file: Failure

				    # C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }

				    if [[ "${BUILD_ENVIRONMENT}" != *s390x* ]]; then

				      python test/run_test.py --cpp --verbose -i cpp/test_tensorexpr

				    fi

				  fi

				  # quantization is not fully supported on s390x yet

				@ -1603,6 +1624,25 @@ test_operator_benchmark() {

				      --expected "expected_ci_operator_benchmark_eager_float32_cpu.csv"

				}

				test_operator_microbenchmark() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  TEST_DIR=$(pwd)

				  cd benchmarks/operator_benchmark/pt_extension

				  python -m pip install .

				  cd "${TEST_DIR}"/benchmarks/operator_benchmark

				  for OP_BENCHMARK_TESTS in matmul mm addmm bmm; do

				    $TASKSET python -m pt.${OP_BENCHMARK_TESTS}_test --tag-filter long \

				      --output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_microbenchmark_${OP_BENCHMARK_TESTS}_compile.json" \

				      --benchmark-name "PyTorch operator microbenchmark" --use-compile

				    $TASKSET python -m pt.${OP_BENCHMARK_TESTS}_test --tag-filter long \

				      --output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_microbenchmark_${OP_BENCHMARK_TESTS}.json" \

				      --benchmark-name "PyTorch operator microbenchmark"

				  done

				}

				if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

				  (cd test && python -c "import torch; print(torch.__config__.show())")

				@ -1627,6 +1667,10 @@ elif [[ "${TEST_CONFIG}" == *xla* ]]; then

				  install_torchvision

				  build_xla

				  test_xla

				elif [[ "$TEST_CONFIG" == *vllm* ]]; then

				    echo "vLLM CI uses TORCH_CUDA_ARCH_LIST: $TORCH_CUDA_ARCH_LIST"

				    (cd .ci/lumen_cli && python -m pip install -e .)

				    python -m cli.run test external vllm --test-plan "$TEST_CONFIG" --shard-id "$SHARD_NUMBER" --num-shards "$NUM_TEST_SHARDS"

				elif [[ "${TEST_CONFIG}" == *executorch* ]]; then

				  test_executorch

				elif [[ "$TEST_CONFIG" == 'jit_legacy' ]]; then

				@ -1653,6 +1697,8 @@ elif [[ "${TEST_CONFIG}" == *operator_benchmark* ]]; then

				    test_operator_benchmark cpu ${TEST_MODE}

				  fi

				elif [[ "${TEST_CONFIG}" == *operator_microbenchmark* ]]; then

				  test_operator_microbenchmark

				elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then

				@ -1672,54 +1718,40 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then

				elif [[ "${TEST_CONFIG}" == cachebench ]]; then

				  install_torchaudio

				  install_torchvision

				  checkout_install_torchbench nanogpt BERT_pytorch resnet50 hf_T5 llama moco

				  PYTHONPATH=$(pwd)/torchbench test_cachebench

				  PYTHONPATH=/torchbench test_cachebench

				elif [[ "${TEST_CONFIG}" == verify_cachebench ]]; then

				  install_torchaudio

				  install_torchvision

				  checkout_install_torchbench nanogpt

				  PYTHONPATH=$(pwd)/torchbench test_verify_cachebench

				  PYTHONPATH=/torchbench test_verify_cachebench

				elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				  install_torchaudio

				  install_torchvision

				  install_torchao

				  id=$((SHARD_NUMBER-1))

				  # https://github.com/opencv/opencv-python/issues/885

				  pip_install opencv-python==4.8.0.74

				  if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then

				    checkout_install_torchbench hf_Bert hf_Albert timm_vision_transformer

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf

				    PYTHONPATH=/torchbench test_inductor_torchbench_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then

				    checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \

				      llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \

				      functorch_maml_omniglot yolov3 mobilenet_v2 resnext50_32x4d densenet121 mnasnet1_0

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf

				    PYTHONPATH=/torchbench test_inductor_torchbench_cpu_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then

				    checkout_install_torchbench

				    TORCHBENCHPATH=$(pwd)/torchbench test_torchbench_gcp_smoketest

				    TORCHBENCHPATH=/torchbench test_torchbench_gcp_smoketest

				  else

				    checkout_install_torchbench

				    # Do this after checkout_install_torchbench to ensure we clobber any

				    # nightlies that torchbench may pull in

				    if [[ "${TEST_CONFIG}" != *cpu* ]]; then

				      install_torchrec_and_fbgemm

				    fi

				    PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"

				    PYTHONPATH=/torchbench test_dynamo_benchmark torchbench "$id"

				  fi

				elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then

				  install_torchvision

				  PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"

				  PYTHONPATH=/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"

				  if [[ "$SHARD_NUMBER" -eq "1" ]]; then

				    test_inductor_aoti

				  fi

				elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.9-gcc11-build ]]; then

				      test_inductor_distributed

				    fi

				  fi

				elif [[ "${TEST_CONFIG}" == *einops* ]]; then

				  test_einops

				elif [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then

									
										9

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -61,9 +61,10 @@ if "%USE_XPU%"=="1" (

				  call "C:\Program Files (x86)\Intel\oneAPI\compiler\latest\env\vars.bat"

				  call "C:\Program Files (x86)\Intel\oneAPI\ocloc\latest\env\vars.bat"

				  if errorlevel 1 exit /b 1

				  :: Reduce build time. Only have MTL self-hosted runner now

				  SET TORCH_XPU_ARCH_LIST=xe-lpg

				  SET USE_KINETO=0

				  :: Reduce build time

				  SET TORCH_XPU_ARCH_LIST=bmg

				  :: Re-setup python env for build

				  call pip install -r requirements.txt

				)

				@echo on

				@ -136,7 +137,7 @@ sccache --show-stats

				python -c "import os, glob; os.system('python -mpip install --no-index --no-deps ' + glob.glob('dist/*.whl')[0])"

				(

				  if "%BUILD_ENVIRONMENT%"=="" (

				    echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.

				    echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_ROOT_DIR%\Scripts\activate.bat %CONDA_ROOT_DIR%\envs\py_tmp` in Command Prompt before running Git Bash.

				  ) else (

				    copy /Y "dist\*.whl" "%PYTORCH_FINAL_PACKAGE_DIR%"

									
										12

.ci/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat
									
												View File
												
				@ -3,12 +3,12 @@ if "%BUILD_ENVIRONMENT%"=="" (

				) else (

				  set CONDA_PARENT_DIR=C:\Jenkins

				)

				set CONDA_ROOT_DIR=%CONDA_PARENT_DIR%\Miniconda3

				:: Be conservative here when rolling out the new AMI with conda. This will try

				:: to install conda as before if it couldn't find the conda installation. This

				:: can be removed eventually after we gain enough confidence in the AMI

				if not exist %CONDA_PARENT_DIR%\Miniconda3 (

				if not exist %CONDA_ROOT_DIR% (

				  set INSTALL_FRESH_CONDA=1

				)

				@ -17,10 +17,14 @@ if "%INSTALL_FRESH_CONDA%"=="1" (

				  if errorlevel 1 exit /b

				  if not errorlevel 0 exit /b

				  %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /AddToPath=0 /D=%CONDA_PARENT_DIR%\Miniconda3

				  %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /AddToPath=0 /D=%CONDA_ROOT_DIR%

				  if errorlevel 1 exit /b

				  if not errorlevel 0 exit /b

				)

				:: Activate conda so that we can use its commands, i.e. conda, python, pip

				call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3

				call %CONDA_ROOT_DIR%\Scripts\activate.bat %CONDA_ROOT_DIR%

				:: Activate conda so that we can use its commands, i.e. conda, python, pip

				call conda activate py_tmp

				call pip install -r .ci/docker/requirements-ci.txt

									
										2

.ci/pytorch/win-test-helpers/setup_pytorch_env.bat
									
												View File
												
				@ -14,7 +14,7 @@ if not errorlevel 0 exit /b

				:: build\torch. Rather than changing all these references, making a copy of torch folder

				:: from conda to the current workspace is easier. The workspace will be cleaned up after

				:: the job anyway

				xcopy /s %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\

				xcopy /s %CONDA_ROOT_DIR%\envs\py_tmp\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\

				pushd .

				if "%VC_VERSION%" == "" (

									
										14

.ci/pytorch/win-test.sh
									
												View File
												
				@ -38,13 +38,20 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				fi

				# TODO: Move both of them to Windows AMI

				python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1

				python -m pip install tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1

				# Copied from https://github.com/pytorch/test-infra/blob/be01a40157c36cd5a48391fdf44a7bc3ebd4c7e3/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1#L16 with some adjustments

				# pytest-rerunfailures==10.3 as 10.2 fails with INTERNALERROR> pluggy._manager.PluginValidationError: unknown hook 'pytest_configure_node'

				# scipy from 1.6.3 to 1.10

				# expecttest from 0.1.3 to 0.3.0

				# xdoctest from 1.0.2 to 1.3.0

				python -m pip install "future==0.18.2" "hypothesis==5.35.1" "expecttest==0.3.0" "librosa>=0.6.2" "scipy==1.10.1" "psutil==5.9.1" "pynvml==11.4.1" "pillow==9.2.0" "unittest-xml-reporting<=3.2.0,>=2.0.0" "pytest==7.1.3" "pytest-xdist==2.5.0" "pytest-flakefinder==1.1.0" "pytest-rerunfailures==10.3" "pytest-shard==0.1.2" "sympy==1.11.1" "xdoctest==1.3.0" "pygments==2.12.0" "opt-einsum>=3.3" "networkx==2.8.8" "mpmath==1.2.1" "pytest-cpp==2.3.0" "boto3==1.35.42"

				# Install Z3 optional dependency for Windows builds.

				python -m pip install z3-solver==4.15.1.0

				# Install tlparse for test\dynamo\test_structured_trace.py UTs.

				python -m pip install tlparse==0.3.30

				python -m pip install tlparse==0.4.0

				# Install parameterized

				python -m pip install parameterized==0.8.1

				@ -52,9 +59,6 @@ python -m pip install parameterized==0.8.1

				# Install pulp for testing ilps under torch\distributed\_tools

				python -m pip install pulp==2.9.0

				# Install expecttest to merge https://github.com/pytorch/pytorch/pull/155308

				python -m pip install expecttest==0.3.0

				run_tests() {

				    # Run nvidia-smi if available

				    for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

									
										2

.ci/pytorch/windows/cuda126.bat
									
												View File
												
				@ -37,7 +37,7 @@ IF "%CUDA_PATH_V126%"=="" (

				)

				IF "%BUILD_VISION%" == "" (

				    set TORCH_CUDA_ARCH_LIST=6.1;7.0;7.5;8.0;8.6;9.0

				    set TORCH_CUDA_ARCH_LIST=5.0;6.0;6.1;7.0;7.5;8.0;8.6;9.0

				    set TORCH_NVCC_FLAGS=-Xfatbin -compress-all

				) ELSE (

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90

									
										4

.ci/pytorch/windows/cuda128.bat
									
												View File
												
				@ -37,10 +37,10 @@ IF "%CUDA_PATH_V128%"=="" (

				)

				IF "%BUILD_VISION%" == "" (

				    set TORCH_CUDA_ARCH_LIST=6.1;7.0;7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_NVCC_FLAGS=-Xfatbin -compress-all

				) ELSE (

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				)

				set "CUDA_PATH=%CUDA_PATH_V128%"

									
										59

.ci/pytorch/windows/cuda130.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,59 @@

				@echo off

				set MODULE_NAME=pytorch

				IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (

				    call internal\clone.bat

				    cd %~dp0

				) ELSE (

				    call internal\clean.bat

				)

				IF ERRORLEVEL 1 goto :eof

				call internal\check_deps.bat

				IF ERRORLEVEL 1 goto :eof

				REM Check for optional components

				set USE_CUDA=

				set CMAKE_GENERATOR=Visual Studio 15 2017 Win64

				IF "%NVTOOLSEXT_PATH%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib"  (

				        set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				    ) ELSE (

				        echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing

				        exit /b 1

				    )

				)

				IF "%CUDA_PATH_V130%"=="" (

				    IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0\bin\nvcc.exe" (

				        set "CUDA_PATH_V130=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0"

				    ) ELSE (

				        echo CUDA 13.0 not found, failing

				        exit /b 1

				    )

				)

				IF "%BUILD_VISION%" == "" (

				    set TORCH_CUDA_ARCH_LIST=7.5;8.0;8.6;9.0;10.0;12.0

				    set TORCH_NVCC_FLAGS=-Xfatbin -compress-all

				) ELSE (

				    set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120

				)

				set "CUDA_PATH=%CUDA_PATH_V130%"

				set "PATH=%CUDA_PATH_V130%\bin;%PATH%"

				:optcheck

				call internal\check_opts.bat

				IF ERRORLEVEL 1 goto :eof

				if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..

				call  %~dp0\internal\copy.bat

				IF ERRORLEVEL 1 goto :eof

				call  %~dp0\internal\setup.bat

				IF ERRORLEVEL 1 goto :eof

									
										27

.ci/pytorch/windows/internal/copy.bat
									
												View File
												
				@ -1,12 +1,20 @@

				copy "%CUDA_PATH%\bin\cusparse*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\cublas*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\cudart*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\curand*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\cufft*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\cusolver*64_*.dll*" pytorch\torch\lib

				if %CUDA_VERSION% geq 130 (

				    set "dll_path=bin\x64"

				) else (

				    set "dll_path=bin"

				)

				copy "%CUDA_PATH%\%dll_path%\cusparse*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\%dll_path%\cublas*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\%dll_path%\cudart*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\%dll_path%\curand*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\%dll_path%\cufft*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\%dll_path%\cusolver*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\%dll_path%\nvrtc*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\%dll_path%\nvJitLink_*.dll*"  pytorch\torch\lib

				copy "%CUDA_PATH%\bin\cudnn*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\bin\nvrtc*64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\extras\CUPTI\lib64\cupti64_*.dll*" pytorch\torch\lib

				copy "%CUDA_PATH%\extras\CUPTI\lib64\nvperf_host*.dll*" pytorch\torch\lib

				@ -20,8 +28,3 @@ copy "%libuv_ROOT%\bin\uv.dll" pytorch\torch\lib

				if exist "C:\Windows\System32\zlibwapi.dll" (

				    copy "C:\Windows\System32\zlibwapi.dll"  pytorch\torch\lib

				)

				::copy nvJitLink dll is requires for cuda 12+

				if exist "%CUDA_PATH%\bin\nvJitLink_*.dll*" (

				    copy "%CUDA_PATH%\bin\nvJitLink_*.dll*"  pytorch\torch\lib

				)

									
										28

.ci/pytorch/windows/internal/cuda_install.bat
									
												View File
												
				@ -26,6 +26,7 @@ if exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%

				if %CUDA_VER% EQU 126 goto cuda126

				if %CUDA_VER% EQU 128 goto cuda128

				if %CUDA_VER% EQU 129 goto cuda129

				if %CUDA_VER% EQU 130 goto cuda130

				echo CUDA %CUDA_VERSION_STR% is not supported

				exit /b 1

				@ -113,6 +114,33 @@ xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"

				goto cuda_common

				:cuda130

				set CUDA_INSTALL_EXE=cuda_13.0.0_windows.exe

				if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (

				    curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" & REM @lint-ignore

				    if errorlevel 1 exit /b 1

				    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"

				    set "ARGS="

				)

				set CUDNN_FOLDER=cudnn-windows-x86_64-9.12.0.46_cuda13-archive

				set CUDNN_LIB_FOLDER="lib"

				set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"

				if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (

				    curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" & REM @lint-ignore

				    if errorlevel 1 exit /b 1

				    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"

				)

				@REM cuDNN 8.3+ required zlib to be installed on the path

				echo Installing ZLIB dlls

				curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"

				7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"

				xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"

				goto cuda_common

				:cuda_common

				:: NOTE: We only install CUDA if we don't have it installed already.

				:: With GHA runners these should be pre-installed as part of our AMI process

									
										10

.ci/pytorch/windows/internal/driver_update.bat
									
												View File
												
				@ -1,9 +1,9 @@

				set WIN_DRIVER_VN=528.89

				set "DRIVER_DOWNLOAD_LINK=https://ossci-windows.s3.amazonaws.com/%WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe" & REM @lint-ignore

				curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe

				set WIN_DRIVER_VN=580.88

				set "DRIVER_DOWNLOAD_LINK=https://ossci-windows.s3.amazonaws.com/%WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe" & REM @lint-ignore

				curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe

				if errorlevel 1 exit /b 1

				start /wait %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe -s -noreboot

				start /wait %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe -s -noreboot

				if errorlevel 1 exit /b 1

				del %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe || ver > NUL

				del %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe || ver > NUL

									
										12

.ci/pytorch/windows/internal/install_python.bat
									
												View File
												
				@ -1,12 +1,22 @@

				set ADDITIONAL_OPTIONS=""

				set PYTHON_EXEC="python"

				if "%DESIRED_PYTHON%" == "3.13t" (

				    echo Python version is set to 3.13t

				    set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.13.0/python-3.13.0-amd64.exe"

				    set ADDITIONAL_OPTIONS="Include_freethreaded=1"

				    set PYTHON_EXEC="python3.13t"

				) else if "%DESIRED_PYTHON%"=="3.14" (

				    echo Python version is set to 3.14 or 3.14t

				    set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.14.0/python-3.14.0rc1-amd64.exe"

				) else if "%DESIRED_PYTHON%"=="3.14t" (

				    echo Python version is set to 3.14 or 3.14t

				    set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.14.0/python-3.14.0rc1-amd64.exe"

				    set ADDITIONAL_OPTIONS="Include_freethreaded=1"

				    set PYTHON_EXEC="python3.14t"

				) else (

				    echo DESIRED_PYTHON not defined, Python version is set to %DESIRED_PYTHON%

				    echo Python version is set to %DESIRED_PYTHON%

				    set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/%DESIRED_PYTHON%.0/python-%DESIRED_PYTHON%.0-amd64.exe" %= @lint-ignore =%

				)

									
										21

.ci/pytorch/windows/internal/xpu_install.bat
									
												View File
												
				@ -13,9 +13,9 @@ if not exist "%SRC_DIR%\temp_build" mkdir "%SRC_DIR%\temp_build"

				:xpu_bundle_install_start

				set XPU_BUNDLE_PARENT_DIR=C:\Program Files (x86)\Intel\oneAPI

				set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d6d6c17-ca2d-4735-9331-99447e4a1280/intel-deep-learning-essentials-2025.0.1.28_offline.exe

				set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/75d4eb97-914a-4a95-852c-7b9733d80f74/intel-deep-learning-essentials-2025.1.3.8_offline.exe

				set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.deep-learning-essentials.product

				set XPU_BUNDLE_VERSION=2025.0.1+20

				set XPU_BUNDLE_VERSION=2025.1.3+5

				set XPU_BUNDLE_INSTALLED=0

				set XPU_BUNDLE_UNINSTALL=0

				set XPU_EXTRA_URL=NULL

				@ -24,9 +24,9 @@ set XPU_EXTRA_VERSION=2025.0.1+1226

				set XPU_EXTRA_INSTALLED=0

				set XPU_EXTRA_UNINSTALL=0

				if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.1] (

				    set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/75d4eb97-914a-4a95-852c-7b9733d80f74/intel-deep-learning-essentials-2025.1.3.8_offline.exe

				    set XPU_BUNDLE_VERSION=2025.1.3+5

				if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.2] (

				    set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/24751ead-ddc5-4479-b9e6-f9fe2ff8b9f2/intel-deep-learning-essentials-2025.2.1.25_offline.exe

				    set XPU_BUNDLE_VERSION=2025.2.1+20

				)

				:: Check if XPU bundle is target version or already installed

				@ -90,14 +90,3 @@ if errorlevel 1 exit /b 1

				del xpu_extra.exe

				:xpu_install_end

				if not "%XPU_ENABLE_KINETO%"=="1" goto install_end

				:: Install Level Zero SDK

				set XPU_EXTRA_LZ_URL=https://github.com/oneapi-src/level-zero/releases/download/v1.14.0/level-zero-sdk_1.14.0.zip

				curl -k -L %XPU_EXTRA_LZ_URL% --output "%SRC_DIR%\temp_build\level_zero_sdk.zip"

				echo "Installing level zero SDK..."

				7z x "%SRC_DIR%\temp_build\level_zero_sdk.zip" -o"%SRC_DIR%\temp_build\level_zero"

				set "INCLUDE=%SRC_DIR%\temp_build\level_zero\include;%INCLUDE%"

				del "%SRC_DIR%\temp_build\level_zero_sdk.zip"

				:install_end

									
										2

.ci/pytorch/windows/setup_build.bat
									
												View File
												
				@ -7,6 +7,8 @@ call "internal\install_python.bat"

				%PYTHON_EXEC% --version

				set "PATH=%CD%\Python\Lib\site-packages\cmake\data\bin;%CD%\Python\Scripts;%CD%\Python;%PATH%"

				if "%DESIRED_PYTHON%" == "3.14t" %PYTHON_EXEC% -m pip install numpy==2.3.2 cmake

				if "%DESIRED_PYTHON%" == "3.14" %PYTHON_EXEC% -m pip install numpy==2.3.2 cmake

				if "%DESIRED_PYTHON%" == "3.13t" %PYTHON_EXEC% -m pip install numpy==2.2.1 cmake

				if "%DESIRED_PYTHON%" == "3.13" %PYTHON_EXEC% -m pip install numpy==2.1.2 cmake

				if "%DESIRED_PYTHON%" == "3.12" %PYTHON_EXEC% -m pip install numpy==2.0.2 cmake

									
										70

.ci/wheel/build_wheel.sh
									
												View File
												
				@ -124,20 +124,31 @@ popd

				export TH_BINARY_BUILD=1

				export INSTALL_TEST=0 # dont install test binaries into site-packages

				export MACOSX_DEPLOYMENT_TARGET=10.15

				export MACOSX_DEPLOYMENT_TARGET=11.0

				export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

				SETUPTOOLS_PINNED_VERSION="==70.1.0"

				PYYAML_PINNED_VERSION="=5.3"

				EXTRA_CONDA_INSTALL_FLAGS=""

				CONDA_ENV_CREATE_FLAGS=""

				RENAME_WHEEL=true

				case $desired_python in

				    3.14t)

				        echo "Using 3.14 deps"

				        NUMPY_PINNED_VERSION="==2.1.0"

				        CONDA_ENV_CREATE_FLAGS="python-freethreading"

				        EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"

				        desired_python="3.14.0rc1"

				        RENAME_WHEEL=false

				        ;;

				    3.14)

				        echo "Using 3.14t deps"

				        NUMPY_PINNED_VERSION="==2.1.0"

				        EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"

				        desired_python="3.14.0rc1"

				        RENAME_WHEEL=false

				        ;;

				    3.13t)

				        echo "Using 3.13 deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=6.0.1"

				        NUMPY_PINNED_VERSION="=2.1.0"

				        NUMPY_PINNED_VERSION="==2.1.0"

				        CONDA_ENV_CREATE_FLAGS="python-freethreading"

				        EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"

				        desired_python="3.13"

				@ -145,37 +156,23 @@ case $desired_python in

				        ;;

				    3.13)

				        echo "Using 3.13 deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=6.0.1"

				        NUMPY_PINNED_VERSION="=2.1.0"

				        NUMPY_PINNED_VERSION="==2.1.0"

				        ;;

				    3.12)

				        echo "Using 3.12 deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=6.0.1"

				        NUMPY_PINNED_VERSION="=2.0.2"

				        NUMPY_PINNED_VERSION="==2.0.2"

				        ;;

				    3.11)

				        echo "Using 3.11 deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=5.3"

				        NUMPY_PINNED_VERSION="=2.0.2"

				        NUMPY_PINNED_VERSION="==2.0.2"

				        ;;

				    3.10)

				        echo "Using 3.10 deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=5.3"

				        NUMPY_PINNED_VERSION="=2.0.2"

				        ;;

				    3.9)

				        echo "Using 3.9 deps"

				        SETUPTOOLS_PINNED_VERSION=">=70.1.0"

				        PYYAML_PINNED_VERSION=">=5.3"

				        NUMPY_PINNED_VERSION="=2.0.2"

				        NUMPY_PINNED_VERSION="==2.0.2"

				        ;;

				    *)

				        echo "Using default deps"

				        NUMPY_PINNED_VERSION="=1.11.3"

				        echo "Unsupported version $desired_python"

				        exit 1

				        ;;

				esac

				@ -184,17 +181,17 @@ tmp_env_name="wheel_py$python_nodot"

				conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS}

				source activate "$tmp_env_name"

				retry pip install -r "${pytorch_rootdir}/requirements-build.txt"

				pip install "numpy=${NUMPY_PINNED_VERSION}"  "pyyaml${PYYAML_PINNED_VERSION}" requests ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing-extensions

				PINNED_PACKAGES=(

				    "numpy${NUMPY_PINNED_VERSION}"

				)

				retry pip install "${PINNED_PACKAGES[@]}" -r "${pytorch_rootdir}/requirements-build.txt"

				pip install requests ninja typing-extensions

				retry pip install -r "${pytorch_rootdir}/requirements.txt" || true

				retry brew install libomp

				# For USE_DISTRIBUTED=1 on macOS, need libuv, which is build as part of tensorpipe submodule

				export USE_DISTRIBUTED=1

				if [[ -n "$CROSS_COMPILE_ARM64" ]]; then

				    export CMAKE_OSX_ARCHITECTURES=arm64

				fi

				export USE_MKLDNN=OFF

				export USE_QNNPACK=OFF

				export BUILD_TEST=OFF

				@ -202,16 +199,7 @@ export BUILD_TEST=OFF

				pushd "$pytorch_rootdir"

				echo "Calling setup.py bdist_wheel at $(date)"

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel -d "$whl_tmp_dir"

				    echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				    BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 CMAKE_FRESH=1 python setup.py bdist_wheel -d "$whl_tmp_dir"

				    echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				else

				    python setup.py bdist_wheel -d "$whl_tmp_dir"

				fi

				python setup.py bdist_wheel -d "$whl_tmp_dir" --plat-name ${mac_version}

				echo "Finished setup.py bdist_wheel at $(date)"

									
										12

.circleci/scripts/binary_linux_test.sh
									
												View File
												
				@ -65,16 +65,8 @@ fi

				if [[ "$PACKAGE_TYPE" != libtorch ]]; then

				  if [[ "\$BUILD_ENVIRONMENT" != *s390x* ]]; then

				    if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				      pkg_no_python="$(ls -1 /final_pkgs/torch_no_python* | sort |tail -1)"

				      pkg_torch="$(ls -1 /final_pkgs/torch-* | sort |tail -1)"

				      # todo: after folder is populated use the pypi_pkg channel instead

				      pip install "\$pkg_no_python" "\$pkg_torch" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}_pypi_pkg"

				      retry pip install -q numpy protobuf typing-extensions

				    else

				      pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"

				      retry pip install -q numpy protobuf typing-extensions

				    fi

				    pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"

				    retry pip install -q numpy protobuf typing-extensions

				  else

				    pip install "\$pkg"

				    retry pip install -q numpy protobuf typing-extensions

									
										10

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -71,14 +71,7 @@ export PYTORCH_BUILD_NUMBER=1

				# Set triton version as part of PYTORCH_EXTRA_INSTALL_REQUIREMENTS

				TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"

				# CUDA 12.9 builds have triton for Linux and Linux aarch64 binaries.

				if [[ "$DESIRED_CUDA" == "cu129" ]]; then

				  TRITON_CONSTRAINT="platform_system == 'Linux'"

				fi

				TRITON_CONSTRAINT="platform_system == 'Linux'"

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" && ! "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then

				  TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				@ -134,7 +127,6 @@ export DESIRED_PYTHON="${DESIRED_PYTHON:-}"

				export DESIRED_CUDA="$DESIRED_CUDA"

				export LIBTORCH_VARIANT="${LIBTORCH_VARIANT:-}"

				export BUILD_PYTHONLESS="${BUILD_PYTHONLESS:-}"

				export USE_SPLIT_BUILD="${USE_SPLIT_BUILD:-}"

				if [[ "${OSTYPE}" == "msys" ]]; then

				  export LIBTORCH_CONFIG="${LIBTORCH_CONFIG:-}"

				  if [[ "${LIBTORCH_CONFIG:-}" == 'debug' ]]; then

									
										10

.circleci/scripts/binary_upload.sh
									
												View File
												
				@ -23,10 +23,6 @@ if [[ "${DRY_RUN}" = "disabled" ]]; then

				  AWS_S3_CP="aws s3 cp"

				fi

				if [[ "${USE_SPLIT_BUILD:-false}" == "true" ]]; then

				  UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_pypi_pkg"

				fi

				# this is special build with all dependencies packaged

				if [[ ${BUILD_NAME} == *-full* ]]; then

				  UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_full"

				@ -55,16 +51,12 @@ s3_upload() {

				    s3_upload_dir="${s3_root_dir}/${UPLOAD_SUBFOLDER}/"

				  fi

				  (

				    cache_control_flag=""

				    if [[ "${UPLOAD_CHANNEL}" = "test" ]]; then

				      cache_control_flag="--cache-control='no-cache,no-store,must-revalidate'"

				    fi

				    for pkg in ${PKG_DIR}/*.${extension}; do

				      (

				        set -x

				        shm_id=$(sha256sum "${pkg}" | awk '{print $1}')

				        ${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_upload_dir}" \

				          --metadata "checksum-sha256=${shm_id}" ${cache_control_flag}

				          --metadata "checksum-sha256=${shm_id}"

				      )

				    done

				  )

									
										3

.circleci/scripts/binary_windows_build.sh
									
												View File
												
				@ -15,8 +15,7 @@ fi

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				    export VC_YEAR=2022

				    export USE_SCCACHE=0

				    export XPU_VERSION=2025.1

				    export XPU_ENABLE_KINETO=1

				    export XPU_VERSION=2025.2

				fi

				echo "Free space on filesystem before build:"

Compare commits

1673 Commits pyobjectsl ... v2.9.0

15 .bc-linter.yml Normal file Unescape Escape View File

26 .ci/aarch64_linux/aarch64_ci_build.sh Unescape Escape View File

246 .ci/aarch64_linux/aarch64_wheel_ci_build.py Unescape Escape View File

16 .ci/aarch64_linux/build_aarch64_wheel.py Unescape Escape View File

4 .ci/docker/README.md Unescape Escape View File

6 .ci/docker/almalinux/Dockerfile Unescape Escape View File

162 .ci/docker/build.sh Unescape Escape View File

2 .ci/docker/ci_commit_pins/huggingface-requirements.txt Normal file Unescape Escape View File

1 .ci/docker/ci_commit_pins/huggingface.txt Unescape Escape View File

1 .ci/docker/ci_commit_pins/nccl-cu13.txt Normal file Unescape Escape View File

1 .ci/docker/ci_commit_pins/torchbench.txt Normal file Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

9 .ci/docker/common/install_cpython.sh Unescape Escape View File

106 .ci/docker/common/install_cuda.sh Unescape Escape View File

10 .ci/docker/common/install_cusparselt.sh Unescape Escape View File

31 .ci/docker/common/install_inductor_benchmark_deps.sh Unescape Escape View File

2 .ci/docker/common/install_nccl.sh Unescape Escape View File

4 .ci/docker/common/install_onnx.sh Unescape Escape View File

4 .ci/docker/common/install_triton.sh Unescape Escape View File

8 .ci/docker/common/install_ucc.sh Unescape Escape View File

61 .ci/docker/common/install_xpu.sh Unescape Escape View File

9 .ci/docker/common/patch_libstdc.sh Executable file Unescape Escape View File

13 .ci/docker/libtorch/Dockerfile Unescape Escape View File

5 .ci/docker/manywheel/Dockerfile_2_28 Unescape Escape View File

2 .ci/docker/manywheel/Dockerfile_2_28_aarch64 Unescape Escape View File

2 .ci/docker/manywheel/Dockerfile_cuda_aarch64 Unescape Escape View File

6 .ci/docker/manywheel/build.sh Unescape Escape View File

29 .ci/docker/requirements-ci.txt Unescape Escape View File

2 .ci/docker/requirements-docs.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

2 .ci/docker/triton_xpu_version.txt Unescape Escape View File

155 .ci/docker/ubuntu-cross-riscv/Dockerfile Normal file Unescape Escape View File

5 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

4 .ci/docker/ubuntu-xpu/Dockerfile Unescape Escape View File

7 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

2 .ci/libtorch/build.sh Unescape Escape View File

31 .ci/lumen_cli/README.md Normal file Unescape Escape View File

0 test/dynamo_expected_failures/CPython313-test_bool-BoolTest.test_blocked → .ci/lumen_cli/cli/build_cli/__init__.py Unescape Escape View File

37 .ci/lumen_cli/cli/build_cli/register_build.py Normal file Unescape Escape View File

0 test/dynamo_expected_failures/CPython313-test_bool-BoolTest.test_bool_called_at_least_once → .ci/lumen_cli/cli/lib/__init__.py Unescape Escape View File

71 .ci/lumen_cli/cli/lib/common/cli_helper.py Normal file Unescape Escape View File

42 .ci/lumen_cli/cli/lib/common/docker_helper.py Normal file Unescape Escape View File

110 .ci/lumen_cli/cli/lib/common/envs_helper.py Normal file Unescape Escape View File

143 .ci/lumen_cli/cli/lib/common/gh_summary.py Normal file Unescape Escape View File

69 .ci/lumen_cli/cli/lib/common/git_helper.py Normal file Unescape Escape View File

14 .ci/lumen_cli/cli/lib/common/logger.py Normal file Unescape Escape View File

62 .ci/lumen_cli/cli/lib/common/path_helper.py Normal file Unescape Escape View File

71 .ci/lumen_cli/cli/lib/common/pip_helper.py Normal file Unescape Escape View File

139 .ci/lumen_cli/cli/lib/common/utils.py Normal file Unescape Escape View File

292 .ci/lumen_cli/cli/lib/core/vllm/lib.py Normal file Unescape Escape View File

285 .ci/lumen_cli/cli/lib/core/vllm/vllm_build.py Normal file Unescape Escape View File

269 .ci/lumen_cli/cli/lib/core/vllm/vllm_test.py Normal file Unescape Escape View File

40 .ci/lumen_cli/cli/run.py Normal file Unescape Escape View File

0 test/dynamo_expected_failures/CPython313-test_bool-BoolTest.test_complex → .ci/lumen_cli/cli/test_cli/__init__.py Unescape Escape View File

62 .ci/lumen_cli/cli/test_cli/register_test.py Normal file Unescape Escape View File

23 .ci/lumen_cli/pyproject.toml Normal file Unescape Escape View File

47 .ci/lumen_cli/tests/test_app.py Normal file Unescape Escape View File

115 .ci/lumen_cli/tests/test_cli_helper.py Normal file Unescape Escape View File

75 .ci/lumen_cli/tests/test_docker_helper.py Normal file Unescape Escape View File

149 .ci/lumen_cli/tests/test_envs_helper.py Normal file Unescape Escape View File

122 .ci/lumen_cli/tests/test_path_helper.py Normal file Unescape Escape View File

185 .ci/lumen_cli/tests/test_run_plan.py Normal file Unescape Escape View File

143 .ci/lumen_cli/tests/test_utils.py Normal file Unescape Escape View File

176 .ci/lumen_cli/tests/test_vllm.py Normal file Unescape Escape View File

7 .ci/magma/Makefile Unescape Escape View File

2 .ci/magma/build_magma.sh Unescape Escape View File

26 .ci/magma/package_files/cuda13.patch Normal file Unescape Escape View File

4 .ci/manywheel/build.sh Unescape Escape View File

33 .ci/manywheel/build_common.sh Unescape Escape View File

104 .ci/manywheel/build_cuda.sh Unescape Escape View File

2 .ci/manywheel/build_rocm.sh Unescape Escape View File

1 .ci/manywheel/build_xpu.sh Unescape Escape View File

60 .ci/pytorch/build.sh Unescape Escape View File

23 .ci/pytorch/check_binary.sh Unescape Escape View File

43 .ci/pytorch/common_utils.sh Unescape Escape View File

2 .ci/pytorch/cpp_doc_push_script.sh Unescape Escape View File

77 .ci/pytorch/macos-test.sh Unescape Escape View File

1673 Commits

pyobjectsl ... v2.9.0

15

.bc-linter.yml Normal file

View File

26

.ci/aarch64_linux/aarch64_ci_build.sh

View File

246

.ci/aarch64_linux/aarch64_wheel_ci_build.py

View File

16

.ci/aarch64_linux/build_aarch64_wheel.py

View File

4

.ci/docker/README.md

View File

6

.ci/docker/almalinux/Dockerfile

View File

162

.ci/docker/build.sh

View File

2

.ci/docker/ci_commit_pins/huggingface-requirements.txt Normal file

View File

1

.ci/docker/ci_commit_pins/huggingface.txt

View File

1

.ci/docker/ci_commit_pins/nccl-cu13.txt Normal file

View File

1

.ci/docker/ci_commit_pins/torchbench.txt Normal file

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

9

.ci/docker/common/install_cpython.sh

View File

106

.ci/docker/common/install_cuda.sh

View File

10

.ci/docker/common/install_cusparselt.sh

View File

31

.ci/docker/common/install_inductor_benchmark_deps.sh

View File

2

.ci/docker/common/install_nccl.sh

View File

4

.ci/docker/common/install_onnx.sh

View File

4

.ci/docker/common/install_triton.sh

View File

8

.ci/docker/common/install_ucc.sh

View File

61

.ci/docker/common/install_xpu.sh

View File

9

.ci/docker/common/patch_libstdc.sh Executable file

View File

13

.ci/docker/libtorch/Dockerfile

View File

5

.ci/docker/manywheel/Dockerfile_2_28

View File

2

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

2

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

6

.ci/docker/manywheel/build.sh

View File

29

.ci/docker/requirements-ci.txt

View File

2

.ci/docker/requirements-docs.txt

View File

2

.ci/docker/triton_version.txt

View File

2

.ci/docker/triton_xpu_version.txt

View File

155

.ci/docker/ubuntu-cross-riscv/Dockerfile Normal file

View File

5

.ci/docker/ubuntu-rocm/Dockerfile

View File

4

.ci/docker/ubuntu-xpu/Dockerfile

View File

7

.ci/docker/ubuntu/Dockerfile

View File

2

.ci/libtorch/build.sh

View File

31

.ci/lumen_cli/README.md Normal file

View File

0

test/dynamo_expected_failures/CPython313-test_bool-BoolTest.test_blocked → .ci/lumen_cli/cli/build_cli/init.py

View File

37

.ci/lumen_cli/cli/build_cli/register_build.py Normal file

View File

0

test/dynamo_expected_failures/CPython313-test_bool-BoolTest.test_bool_called_at_least_once → .ci/lumen_cli/cli/lib/init.py

View File

71

.ci/lumen_cli/cli/lib/common/cli_helper.py Normal file

View File

42

.ci/lumen_cli/cli/lib/common/docker_helper.py Normal file

View File

110

.ci/lumen_cli/cli/lib/common/envs_helper.py Normal file

View File

143

.ci/lumen_cli/cli/lib/common/gh_summary.py Normal file

View File

69

.ci/lumen_cli/cli/lib/common/git_helper.py Normal file

View File

14

.ci/lumen_cli/cli/lib/common/logger.py Normal file

View File

62

.ci/lumen_cli/cli/lib/common/path_helper.py Normal file

View File

71

.ci/lumen_cli/cli/lib/common/pip_helper.py Normal file

View File

139

.ci/lumen_cli/cli/lib/common/utils.py Normal file

View File

292

.ci/lumen_cli/cli/lib/core/vllm/lib.py Normal file

View File

285

.ci/lumen_cli/cli/lib/core/vllm/vllm_build.py Normal file

View File

269

.ci/lumen_cli/cli/lib/core/vllm/vllm_test.py Normal file

View File

40

.ci/lumen_cli/cli/run.py Normal file

View File

0

test/dynamo_expected_failures/CPython313-test_bool-BoolTest.test_complex → .ci/lumen_cli/cli/test_cli/init.py

View File

62

.ci/lumen_cli/cli/test_cli/register_test.py Normal file

View File

23

.ci/lumen_cli/pyproject.toml Normal file

View File

47

.ci/lumen_cli/tests/test_app.py Normal file

View File

115

.ci/lumen_cli/tests/test_cli_helper.py Normal file

View File

75

.ci/lumen_cli/tests/test_docker_helper.py Normal file

View File

149

.ci/lumen_cli/tests/test_envs_helper.py Normal file

View File

122

.ci/lumen_cli/tests/test_path_helper.py Normal file

View File

185

.ci/lumen_cli/tests/test_run_plan.py Normal file

View File

143

.ci/lumen_cli/tests/test_utils.py Normal file

View File

176

.ci/lumen_cli/tests/test_vllm.py Normal file

View File

7

.ci/magma/Makefile

View File

2

.ci/magma/build_magma.sh

View File

26

.ci/magma/package_files/cuda13.patch Normal file

View File

4

.ci/manywheel/build.sh

View File

33

.ci/manywheel/build_common.sh

View File

104

.ci/manywheel/build_cuda.sh

View File

2

.ci/manywheel/build_rocm.sh

View File

1

.ci/manywheel/build_xpu.sh

View File

60

.ci/pytorch/build.sh

View File

23

.ci/pytorch/check_binary.sh

View File

43

.ci/pytorch/common_utils.sh

View File

2

.ci/pytorch/cpp_doc_push_script.sh

View File

77

.ci/pytorch/macos-test.sh

View File

1

.ci/pytorch/multigpu-test.sh

View File